Why logs: the one data structure streaming is built on

The fraud team at Razorpay sits down to design a real-time fraud-block service. They sketch a queue: producers push, consumers pop, the consumer that pops the message processes it, the message disappears. Two days later they discover that when the rule-engine consumer crashes mid-message, the message is gone. They add an "ack". Then they discover that when a second rule-engine wants to read the same stream — the analytics team also needs every transaction, in the same order — the queue can only deliver each message once, so they fork it, badly. Then they discover that during the IPL final, the queue grows to 40 GB in memory and OOM-kills the broker. Each fix moves them, step by step, toward the same data structure every other team converges on: an append-only log on disk. The lesson the industry learnt the hard way is that you cannot dodge the log; the queue is just the log with three of its best properties amputated.

A log is an immutable, totally-ordered, append-only sequence where every record gets a position (an offset) and is never modified after being written. That sounds simple, almost trivial. It is the substrate of every real streaming system because it is the only data structure that simultaneously gives you replay, multiple independent consumers, durability, and total order — without having to coordinate any of the readers. Kafka, Pulsar, Kinesis, Redpanda, Postgres WAL, MySQL binlog, and Redis Streams are all the same idea wearing different clothes.

What "a log" actually means in this chapter

In a data-engineering context a log is not the line your print statement writes to stderr. It is a specific, ancient data structure with three rules: records are appended in order, the order is total (every record has a unique position called the offset, and the offsets are monotonically increasing), and once written a record is immutable — never modified, never deleted out of order, only ever truncated from the head when retention says so. There is no "update record 47". There is "append a new record that supersedes record 47". The log does not care what supersedes means; that's the consumer's job.

Pat Helland, in his 2015 paper Immutability Changes Everything, made the case bluntly: when records never change, every consequence we worry about in mutable systems — concurrency, locks, cache invalidation, lost updates, write skew — collapses into a non-problem. The append-only log is the most extreme expression of that idea. The cost is that you give up the ability to "fix" a record in place; the buy is that everything downstream becomes radically simpler.

Anatomy of an append-only logA horizontal log with eight slots. Offsets 0 to 7 labelled below. Each slot shows a record. Three readers (Reader A, Reader B, Reader C) sit at different offsets, with arrows pointing to their current read position. A producer writes only at the right end. The retention boundary at the left shows offsets 0 and 1 fading out. A log: ordered, append-only, multi-reader tx_001 tx_002 tx_003 tx_004 tx_005 tx_006 tx_007 tx_008 0 1 2 3 4 5 6 7 offsets (monotonically increasing) retention boundary → producer appends Reader A @ 3 fraud rules Reader B @ 5 analytics Reader C @ 1 backfill replay Each reader holds its own offset. The log doesn't know who has read what — readers track themselves.
Three readers reading the same log at three different offsets. The producer never blocks waiting for a slow reader; the slow reader never blocks the fast one. The log's only job is to grow at the right end and forget at the left end.

The picture above hides one detail that is doing all the work: the log does not push messages to readers. Readers pull — they hold their own offset, ask the log "give me what's after offset X", and advance their offset on their own clock. This is the inversion that makes everything else possible.

The three properties that no other primitive gives you all at once

A queue (RabbitMQ-style), a database table, a pub/sub topic, a shared file — each gives you some subset of what streaming needs. The log is the only primitive that gives you all three at once.

Property 1: total order, established once at write time. When the producer writes record 4 after record 3, every reader sees record 4 after record 3. There is no race, no "did A see this update before B did" question, no consensus needed at read time. The order was decided when the broker assigned the offset, full stop. For workloads where order matters — and "order matters" is the default in any stream of state changes (transactions for one user, edits to one document, GPS pings for one rider) — this single property eliminates an entire class of bugs that distributed databases spend chapters trying to recover.

Property 2: replay from any offset. Because records are immutable and offsets are stable, a reader can ask "give me everything from offset 14,239,772 onwards" and get exactly that. A new fraud-rules version that needs to be tested against the last week of traffic just starts a new reader from the offset that corresponds to a week ago. A backfill of a downstream warehouse just starts a reader from offset 0. A bug-fix re-run of an analytics job rewinds its offset by 6 hours and re-reads. None of this needs the producer to do anything special. The producer wrote the data once; it is sitting there, immutable, waiting to be re-read by whoever asks. Why this is harder than it sounds in a queue: queues, by design, delete a message once it is acked. To replay, you have to either build a "dead-letter" plus "redrive" mechanism, which is a partial log glued onto the queue, or persist the messages elsewhere — which is just admitting you needed a log.

Property 3: multiple independent consumers, with their own offsets, no coordination. The fraud team's reader, the analytics team's reader, the warehouse-loader, the search-index updater, the audit archive — all five read the same log. None of them slow down the others. None of them need to know about each other. Each holds its own offset; the log doesn't even know how many readers it has. A new team that joins six months later just creates a new offset starting wherever they want and reads. This decoupling — the producer doesn't need to know who reads, and the readers don't need to know about each other — is the heart of why logs scale to organisations with thousands of streaming consumers.

A queue has property 1 (kind of — within one queue, in one broker, with one consumer group). It does not have property 2 (messages are deleted on ack). It does not have property 3 cleanly (each "fan-out" subscriber needs its own copy). A database has property 1 (transaction order via the WAL — itself a log!) and property 2 (you can replay history from a snapshot + WAL). It does not have property 3 cleanly — readers running long queries lock or version-stack against writers, and the system spends effort hiding the WAL from you. A shared file has property 1 (file order) and 2 (re-readable) but not 3 (no offset bookkeeping; nobody coordinates retention). The log is the data structure where all three are first-class.

Building the smallest log that demonstrates the idea

A log fits in 60 lines of Python. We won't replicate it, partition it, or give it network I/O — those are Build 7's later chapters. Right now: just the data structure, on disk, with three readers running at different offsets to prove they don't interfere.

# tiny_log.py — the data structure, stripped to its essentials.
# Append-only, ordered, replayable, multi-reader. No replication, no network.
import os, struct, json, threading, time

class TinyLog:
    """A log file: 4-byte length header, then JSON payload, repeating."""
    def __init__(self, path):
        self.path = path
        self.lock = threading.Lock()
        if not os.path.exists(path):
            open(path, "wb").close()

    def append(self, record: dict) -> int:
        # Returns the offset of the just-written record.
        payload = json.dumps(record).encode("utf-8")
        with self.lock:
            with open(self.path, "ab") as f:
                offset = f.tell()             # byte position = our offset
                f.write(struct.pack(">I", len(payload)))
                f.write(payload)
                f.flush()
                os.fsync(f.fileno())          # durable before we return
            return offset

    def read_from(self, start_offset: int):
        # Generator: yields (offset, record) for every record at or after start.
        with open(self.path, "rb") as f:
            f.seek(start_offset)
            while True:
                here = f.tell()
                header = f.read(4)
                if len(header) < 4:           # reached current end-of-log
                    return
                (length,) = struct.unpack(">I", header)
                payload = f.read(length)
                yield here, json.loads(payload.decode("utf-8"))

# --- Demo: one producer, three readers at three different offsets ---
log = TinyLog("/tmp/tiny.log")
for i in range(8):
    log.append({"tx_id": f"tx_{i:03d}", "amount_paise": 19999_00, "merchant": "razorpay-test"})

print("--- Reader A: from start (analytics backfill) ---")
for off, rec in log.read_from(0):
    print(f"  offset={off:>5}  {rec}")

print("--- Reader B: from offset 0, takes first 3 then pauses ---")
gen = log.read_from(0)
for _ in range(3):
    print(f"  {next(gen)}")
print("  (Reader B's offset is now wherever it stopped; it does not affect Reader A.)")

print("--- Reader C: tail mode, only NEW records written after now ---")
tail_start = os.path.getsize(log.path)
log.append({"tx_id": "tx_008", "amount_paise": 50000_00, "merchant": "razorpay-test"})
for off, rec in log.read_from(tail_start):
    print(f"  offset={off:>5}  {rec}  (Reader C only saw this one)")
# Sample run:
--- Reader A: from start (analytics backfill) ---
  offset=    0  {'tx_id': 'tx_000', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=   72  {'tx_id': 'tx_001', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=  144  {'tx_id': 'tx_002', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=  216  {'tx_id': 'tx_003', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=  288  {'tx_id': 'tx_004', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=  360  {'tx_id': 'tx_005', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=  432  {'tx_id': 'tx_006', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
  offset=  504  {'tx_id': 'tx_007', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
--- Reader B: from offset 0, takes first 3 then pauses ---
  (0, {'tx_id': 'tx_000', 'amount_paise': 1999900, 'merchant': 'razorpay-test'})
  (72, {'tx_id': 'tx_001', 'amount_paise': 1999900, 'merchant': 'razorpay-test'})
  (144, {'tx_id': 'tx_002', 'amount_paise': 1999900, 'merchant': 'razorpay-test'})
  (Reader B's offset is now wherever it stopped; it does not affect Reader A.)
--- Reader C: tail mode, only NEW records written after now ---
  offset=  576  {'tx_id': 'tx_008', 'amount_paise': 5000000, 'merchant': 'razorpay-test'}  (Reader C only saw this one)

The lines that matter:

What this 60-line program already gives you, that a queue does not: replay from any offset, multiple independent readers, durable on-disk storage, total order, and the ability for a brand-new reader to walk the entire history. What it does not yet give you: partitioning across machines (Chapter 49), replication for fault tolerance (Chapter 51), retention with bounded storage (Chapter 52), or the producer/consumer protocols a network system needs (Chapters 48 and 50). Build 7 fills in those gaps one at a time. Today's chapter is just the why.

Why the log keeps reappearing inside other systems

Here is a list of systems that, internally, are built on a log even when they look like something else:

The pattern is so consistent that Jay Kreps's 2013 essay The Log: What every software engineer should know about real-time data's unifying abstraction (the founding text of Kafka's design) makes a stronger claim: the log is the universal substrate of distributed data systems, and most of what we call "the database" is a particular fold over the log. The CDC chapter (Build 11) makes this point concrete — once you can read the database's internal log, you can rebuild the database's table state anywhere else.

A log feeds many materialisationsA central horizontal log of events. Three arrows fan out from it to three different downstream representations: a search index (inverted index drawing), a SQL table (rows), and a real-time dashboard counter. Each is labelled as a fold over the log. Every downstream view is a fold over the same log log: e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 … source of truth (immutable, ordered) Search index (Elasticsearch) fold: insert tokens for each event; delete on tombstone events consumer offset: 142,901 lag: 12 events Postgres mirror fold: UPSERT rows by primary key; DELETE on tombstone events consumer offset: 142,910 lag: 3 events Live counter (ClickHouse) fold: increment per-merchant counter; window-aggregate consumer offset: 142,913 lag: 0 events Each materialisation is replayable: rebuild from offset 0 to recompute, or rewind to fix a bug. The log is the truth.
Three downstream stores fed by the same log, each at a slightly different offset. Each is a "fold" — a deterministic function from log to derived state. Rebuilding any of them just means replaying from offset 0.

The deep idea here is that the log de-couples the source of truth from the materialisation. The log says what happened; the search index, the SQL mirror, and the live counter each say what currently is. If a materialisation is corrupted by a bug, you don't recover from a backup — you delete it and replay the log. This is why logs go hand-in-hand with idempotent consumers (Build 2): replay only works if processing the same record twice produces the same downstream state.

What "queue" gives up that "log" keeps

Indian platform teams that grew up on RabbitMQ or AWS SQS often arrive at Kafka with the wrong mental model. They expect "a faster queue". They get a log, and the differences trip them up for the first three months. The trip-ups are worth listing because they are the same trip-ups, in the same order, every time.

Acks delete messages. In RabbitMQ, a successful ack removes the message. In Kafka, an "ack" by a consumer is just an offset commit — the message stays on disk, available for re-read, available for the next consumer group, available for replay. Engineers who worked on RabbitMQ for years instinctively reach for "redrive" or "dead letter" patterns in Kafka and then realise they don't need them; the log already handles it.

Fan-out is a separate concept. In RabbitMQ, fan-out to multiple consumers requires a fan-out exchange and one queue per consumer. In Kafka, every consumer group reads the same partition; fan-out is free. A new use case for the data is a new consumer group, full stop, no producer change.

Order is per-partition, not per-topic. In a single-queue system, order is whatever the queue decided. In Kafka, order is preserved within a partition but not across partitions. This is the price of horizontal scale and is the topic of Chapter 49.

Retention is by time/size, not by ack. A queue's retention is "until acked". A log's retention is "until the time-or-size policy says delete it". Set retention to 7 days and a backfill engineer can rewind a week. Set retention to 1 hour and you cannot. The choice is operational, not architectural.

The reader, not the broker, tracks progress. In Kafka the consumer commits its offset to a special internal topic (__consumer_offsets). The broker's job is just to hold the data; offset bookkeeping is the consumer's. This is why a buggy consumer can be reset by simply rewinding its committed offset — the broker doesn't care.

Common confusions

Going deeper

Why offsets are byte positions, and what indexes that

In our 60-line TinyLog, the offset is the byte position. Kafka is more sophisticated: each record gets a logical offset (a 64-bit integer that increments per record), and Kafka maintains a separate sparse index file that maps logical offsets to byte positions every 4 KB or so. To serve a fetch request like "give me starting at offset 14,239,772", Kafka does a binary search in the index file (log-N), seeks to the resulting byte position, and scans forward a few records until it finds the exact one. This costs roughly one disk seek per fetch — not per record, per fetch. Once the seek is done, the kernel's sendfile(2) system call streams the bytes from page cache to network socket without copying through user-space memory. This is the basis of Kafka's famous throughput-per-broker numbers. Why sendfile is the unlock: a naive "read into Java buffer, write to socket" pipeline copies bytes four times (disk → kernel page cache → JVM heap → socket buffer → NIC). sendfile is one DMA from the page cache directly to the NIC. On a 2026 NVMe + 25 Gbps NIC node, the difference is roughly 6× throughput.

Sequential I/O is what makes append-only fast

A common worry from engineers seeing the log idea for the first time is: "Disk is slow. How can a system that's basically a flat file beat an in-memory queue?" The answer is that disks are slow at random I/O and very, very fast at sequential I/O. A modern NVMe SSD does about 2–7 GB/s sequential write; the same disk does about 1 MB/s if you scatter 4 KB writes randomly across it. The log structure is only sequential writes, by design. The producer always appends at the tail. The consumer always reads forward. There is no random seek in the steady state. This is the same insight LSM trees (Build 12, on Iceberg/Delta merge-on-read) use; it is the same insight database WALs use; it is the same insight the JFS, ZFS, and ext4 journals use. The log is the data structure that maximises the use of disks' best mode and avoids their worst.

The economic consequence is that on a single Kafka-class broker (one machine, 24 cores, 256 GB RAM, NVMe drives), throughput numbers like 1 GB/s ingest, 4 GB/s egress (because of fan-out), and a million events/sec are routine. Razorpay's payments-events Kafka cluster, as of public 2024 talks, ran around 8 brokers handling 4 million events per minute with peak headroom for IPL-finals traffic at roughly 12× normal. The cost-per-event, including replication, was about ₹0.0008 per event — one-thousandth of a paisa.

Compaction: the log that "remembers state, not history"

Sometimes you don't want every event ever; you want only the latest event per key. Kafka offers a second retention mode called log compaction where, instead of deleting records older than 7 days, the broker periodically rewrites the log to keep only the most recent record for each key. The result is still an append-only log, but old superseded records get garbage-collected. This is how Kafka stores its own __consumer_offsets topic: each commit is a record, and only the most recent commit per consumer-group-and-partition matters. Chapter 52 (retention-compaction-tiered-storage) covers the mechanism in detail. The point for now: compaction does not break any of the three properties — order, replay, multi-reader. It only changes what "the log's history" means. Why this matters for state stores: a compacted topic is the canonical way to seed a new state store. The state of a stream operator's materialised view is just the latest value per key; that's exactly what a compacted log gives you. Reseeding a crashed operator is "replay the compacted topic from offset 0". This is the design of Kafka Streams' state-store backups.

Where the log idea breaks down

The log is not a free lunch; there are workloads it does not fit.

Where this leads next

The next chapter, /wiki/a-single-partition-log-in-python, takes the 60-line TinyLog you just read and grows it into a single-partition broker — adding a record format, segment files, an offset index, and a clean producer/consumer protocol. Read it next; it makes the abstraction concrete enough that Kafka's later complications stop being scary.

After that, /wiki/partitions-and-parallelism introduces the second axis: how do you split a single logical log across many machines so that throughput scales horizontally, while preserving per-key order? That's where the partitioning function and the per-key consistent-hash idea enter. /wiki/consumer-groups-and-offset-management handles the how do many consumers cooperate question — the load-balancing fan-out pattern that turns a broadcast log into a work-sharing queue. /wiki/replication-and-isr-how-kafka-stays-up gives the partition fault tolerance: leader/follower replication and the ISR (in-sync replica) protocol that decides when a write is "committed". /wiki/retention-compaction-tiered-storage bounds the storage cost — time-based retention, log compaction, and the 2024-era trick of tiering old segments to S3 to make brokers nearly stateless. /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda closes Build 7 with a head-to-head of the four real-world implementations, so you can pick the right one for your team.

The whole arc has one thesis: every streaming system is the same primitive (the log) plus a different choice on three knobs (partitioning, replication, retention). Once you internalise that, the vendor docs stop reading like marketing and start reading like a small set of design choices made on top of a structure you already understand.

References