The stream / table duality
Open your terminal. Type tail -f /var/log/auth.log in one window, and last | head in another. The first window is a stream — events scrolling past as they happen. The second is a table — the current state, summarised. Both are computed from the same bytes. That tiny duality, scaled up, is the engine that runs every modern data platform.
A stream is the changelog of a table — every insert, update, and delete that ever happened, in order. A table is the materialized view of a stream — the result of replaying the changelog and keeping only the latest value per key. The two are the same information, read with different rules: the stream is the history, the table is the snapshot. Build a stream, you can derive any table; keep only a table, you have thrown the history away.
The duality, stated precisely
In chapter 2 you built an append-only log and noticed that the current value of a key was just the last line for that key in the file. The file itself was the history. The "table" — the dictionary view — was something you computed on demand by scanning.
That observation is the whole of the stream/table duality. State it both directions:
- Stream → Table. Take a sequence of
(key, value)change events and fold them. Last write per key wins. The result is a table. - Table → Stream. Take a table that someone is mutating and emit one event per change —
(key, new_value)for inserts and updates,(key, tombstone)for deletes. The result is a stream.
The two operations are inverses. Apply one and then the other and you are back where you started. This is what people mean when they say Kafka and Postgres are the same thing wearing different clothes.
A reader who has done the append-only log chapter will recognise this immediately. The log was the stream; get(key) was the fold; the dictionary the fold returned was the table. We never named it then because we were focused on storage. With Kafka in the picture, the duality becomes the architectural primitive.
Building the fold from scratch
Open your editor. Type this in. Do not copy-paste — type it. Every line is a step in the fold that turns a stream into a table.
# duality.py — turn a stream of change events into a table, and back
from collections import OrderedDict
def fold(stream):
"""Stream -> Table. Replay events, keep last value per key."""
table = OrderedDict()
for key, value in stream:
if value is None:
# tombstone: delete the row
table.pop(key, None)
else:
table[key] = value
return table
def changelog(old_table, new_table):
"""Table -> Stream. Emit one event per row that changed."""
events = []
for k, v in new_table.items():
if old_table.get(k) != v:
events.append((k, v))
for k in old_table:
if k not in new_table:
events.append((k, None)) # tombstone
return events
Two functions, sixteen lines. Let me walk every piece, because the simplicity is the point.
fold(stream). Iterates the events in order. Each event is a pair (key, value). The OrderedDict is the table — keys map to the latest value seen. If the value is None, we delete the key (the streaming community calls this a tombstone — the same tombstone you saw for deletions in chapter 5 of Build 1). Otherwise we overwrite. Last write wins — exactly the rule of the append-only log.
changelog(old_table, new_table). The inverse. Diff two snapshots and emit one event per row that changed. New row → emit (key, value). Updated row → emit (key, new_value). Deleted row → emit (key, None) (a tombstone). Replay these on top of old_table and you get new_table back.
Why the fold is the dictionary you already know: dict[k] = v in Python is a single-event fold step. Every database that processes one record at a time is doing a fold. The "table" you see at the end of the day is the result of folding all the writes that ever hit the database, in order, with last-write-wins.
Try it:
stream = [
("riya", 500),
("rahul", 200),
("riya", 300),
("asha", 800),
("rahul", 250),
("riya", 450),
]
table = fold(stream)
print(table)
# OrderedDict([('riya', 450), ('rahul', 250), ('asha', 800)])
# Now run the inverse: emit events that turn an empty table into this one
print(changelog({}, table))
# [('riya', 450), ('rahul', 250), ('asha', 800)]
# Round-trip: fold(changelog(old, new)) on top of old == new
old = {"riya": 100, "kiran": 50}
events = changelog(old, dict(table))
new = fold([(k, v) for k, v in old.items()] + events)
print(new)
# OrderedDict([('riya', 450), ('kiran', 50), ('rahul', 250), ('asha', 800)])
Notice kiran is still there — changelog did not emit a tombstone for kiran because they appear in new_table (we passed table, which has no kiran, so they would be deleted; but here we did changelog(old, dict(table)) where dict(table) contains no kiran so kiran should be deleted). If you want the strict round-trip property, run the inverse against the same key set on both sides. The code above shows the easy direction; round-trip exactness is what Kafka's compacted topics guarantee.
Why tombstones are non-optional: without (key, None) events, the changelog cannot represent a delete. A consumer rebuilding the table by folding the stream would never know that a key disappeared. Every CDC system — Debezium, Maxwell, Kafka Connect — emits delete events explicitly for this reason.
Where the duality shows up in real systems
Once you have the fold in your head, half the streaming-systems vocabulary becomes obvious. Each of these is the same trick.
Postgres WAL → replica table
When a Postgres replica boots, it reads the write-ahead log from the primary and replays every record. The WAL is a stream — (page, before_image, after_image) tuples in commit order. The replica is a table — the heap pages, after every change has been folded in. pg_basebackup ships the table; streaming replication ships the stream after that point. Together they reconstruct the database elsewhere.
This is also what powers logical replication and Debezium. The pgoutput plugin (or older wal2json) reads the WAL, decodes each row change to a (table, key, value) event, and publishes it to a Kafka topic. Now downstream services can fold that stream into whatever shape they need — a search index, an analytics warehouse, a cache. The Postgres source has not changed — every existing query still works against the table — but a new architectural option has appeared: anyone who wants a different view of the same data can build it by subscribing to the stream and folding.
Kafka compacted topic → key-value store
Kafka has two retention modes. The default is time/size-based — keep the last seven days, or the last 100 GB. The other is log compaction: for each key, keep only the latest record; older records for that key are eligible for garbage collection.
A compacted topic is a key-value store viewed as a stream. Producers write events; the broker periodically rewrites the segment files keeping only the latest value per key. A consumer that reads the topic from the beginning ends up with the table. Kafka Streams' KTable and ksqlDB's TABLE literals are defined as compacted topics. The fold is implicit in the broker's compaction job.
Materialized views
In Postgres, CREATE MATERIALIZED VIEW runs a query and caches the result as a table. REFRESH MATERIALIZED VIEW re-runs the query. In a streaming system — Materialize, ksqlDB, RisingWave — a materialized view is continuously folded. Every time a new event arrives on the input stream, the view is incrementally updated. The view is the table; the input is the stream; the SQL between them is the fold.
The mental model: in batch SQL, the table is primary and the stream is a "change feed" you optionally subscribe to. In streaming SQL, the stream is primary and the table is one possible view of it. Same data, opposite point of focus.
Concretely, when you write CREATE MATERIALIZED VIEW v AS SELECT user_id, COUNT(*) FROM clicks GROUP BY user_id in Materialize, the engine compiles the SQL into a dataflow graph of stateful operators — a GROUP BY becomes a per-key counter, a JOIN becomes a state-store lookup, a WHERE becomes a filter. Every input event walks the graph, updating the operators and emitting deltas to the output. The output is the materialized table; the input is the stream; nothing more. The fold has been spread out across many operators, but it is still the same operation: replay events in order, derive a table.
Event sourcing
In an event-sourced application — Razorpay's ledger, Zerodha's order book, an Uber-style driver-state service — the database of record is a sequence of events: OrderPlaced, PaymentCaptured, Refunded, Shipped. The "current state" of an order is the fold of all events with that order id. Querying by id replays the events. Querying by some other dimension — "all orders shipped today" — runs a fold over the whole stream into a different table shape.
This is the philosophical commitment of event sourcing: the events are the truth; the table is one of many possible projections. Throw the events away and you have lost information. Throw the table away and you can rebuild it.
The Bengaluru-startup CQRS pattern (Command Query Responsibility Segregation) is event sourcing's structural sibling. Writes go to a command-side that produces events and folds them into one canonical state; reads go to one of several read-side projections — each a different table tuned for the queries it serves. The shared substrate is the event stream. CQRS is rarely the right answer for a small product, but at the scale of a Swiggy or a Flipkart it is unavoidable, because the read shapes the analytics team needs and the read shape the order-tracking page needs are not the same table — and forcing them to be one is what makes the database melt under load.
This picture is the architectural payoff. If you commit to a shared log, you can add a new table — a new search index, a new ML feature store, a new cache — by writing one consumer that folds the stream. You do not touch the source database. You do not break any other consumer. You read from the offset of your choice and catch up. Jay Kreps's 2013 essay made this canon: the log is what every software engineer should know about real-time data's unifying abstraction.
A worked example: a wallet at Razorpay
Make it concrete. A wallet service at Razorpay holds, for every user, a current balance in paise. Riya tops up ₹500. Rahul gets paid ₹200. Riya pays ₹200 to Asha. The "table" view, after these three operations, looks like this:
| user | balance (paise) |
|---|---|
| riya | 30,000 |
| rahul | 20,000 |
| asha | 20,000 |
The "stream" view of the same situation is the changelog of every balance change, in commit order:
t=1 riya.balance = 50000
t=2 rahul.balance = 20000
t=3 riya.balance = 30000 # after paying Asha
t=4 asha.balance = 20000
Both views answer different questions. The table answers how much does Riya have right now? in O(1). The stream answers show me every change that ever happened to Riya's balance, in order — which is the only honest answer to a customer who calls support disputing a charge. A bank that throws away the stream and keeps only the table is committing financial malpractice. A bank that throws away the table and keeps only the stream is committing performance malpractice. Real systems keep both, and the stream is the source of truth.
Why the stream has to be the source of truth: every customer-support escalation, every regulatory audit, every internal fraud investigation needs the history of what happened — not just the current balance. RBI's KYC-AML guidelines, SEBI's record-retention rules, and Razorpay's own internal disputes process all assume the existence of an immutable changelog. Lose the stream and these obligations cannot be met. The table is a performance optimisation; the stream is the legal record.
The shape of the production system follows from this. Razorpay's wallet writes go to a Postgres table for O(\log n) point lookups (balance check) and to a Kafka topic for the changelog (audit, downstream consumers, replay-on-bug-fix). Both are written in the same transaction, using either two-phase commit or the transactional outbox pattern (write to a Postgres outbox table inside the transaction, then a separate worker copies the outbox rows to Kafka). The duality is not a free architectural property — it is a discipline you must enforce on every write path. Bypass it once with a direct UPDATE wallet SET balance = ? that does not produce an event, and the audit trail is broken forever.
The reconciliation that the duality enables
A junior engineer at a fintech writes a query against the wallet table and sees riya.balance = 30000. A senior engineer is sceptical. She runs:
events = wallet_stream.read(key="riya") # all changes for riya
print(sum_signed(events)) # fold the events as deltas
# 30000
The two numbers must match. If they do not, the table is corrupted — somebody did an in-place write that did not go through the stream. The stream is authoritative; the table is a cache; the duality is what makes the audit possible at all. Without it, you cannot prove the database is honest.
A second worked example: rebuilding a leaderboard
To feel the duality in code, build a small leaderboard. Imagine a BookMyShow-style movie booking system: every ticket purchase is a (movie_id, seats_sold) event, and the homepage shows the top five movies by total seats sold today. The events are a stream; the leaderboard is a table; the fold is SUM.
# leaderboard.py — fold a stream of bookings into a top-N table
from collections import defaultdict
import heapq
def build_leaderboard(events, n=5):
totals = defaultdict(int)
for movie_id, seats in events:
totals[movie_id] += seats # the fold step
return heapq.nlargest(n, totals.items(), key=lambda kv: kv[1])
events = [
("pathaan", 120), ("jawan", 200), ("rrr", 80),
("pathaan", 60), ("animal", 300), ("jawan", 150),
("rrr", 40), ("pathaan", 90),
]
print(build_leaderboard(events, n=3))
# [('pathaan', 270), ('animal', 300), ('jawan', 350)] # order may vary
Read the code carefully. The fold is one line — totals[movie_id] += seats — applied event-by-event. The leaderboard is a snapshot of totals at the moment you stop iterating. If you stop after the first four events, you get a different leaderboard. The same stream produces different tables depending on the watermark — the time at which you fold. Chapter 177 (windowing and watermarks) is about choosing that boundary explicitly. For now, the takeaway is that the table is a function of two inputs: the stream, and a stop point. Most people forget the second one.
Common confusions
-
"A stream is a queue." Queues delete on consume; streams keep history. A Kafka topic from which you can replay last week's events is not a queue — it is a log. RabbitMQ, ActiveMQ, SQS are queues. Kafka, Pulsar, Kinesis are streams. (See chapter 174.)
-
"A table is just a snapshot of a stream." Yes, but a snapshot at what time? A streaming system distinguishes a GlobalKTable (snapshot as of latest), a windowed table (snapshot over the last hour), and a time-travel table (snapshot as of
t = 10am yesterday). All three are folds of the same stream, with different stop conditions. -
"If I have the table I can throw away the stream." You cannot — once you do, the audit trail is gone, downstream consumers can never bootstrap, and you cannot recompute a corrupted index. Treat the stream as the source of truth and the table as cache. Postgres, MongoDB, and every event-sourced system follow this rule even if it is invisible from the API.
-
"Stream/table duality is a Kafka invention." It is not. Postgres replicas have always rebuilt themselves from the WAL; LSM-tree compaction has always folded a stream of writes into the latest value per key (see chapter 41 on compaction). Jay Kreps named the abstraction in 2013, but the mechanism predates him by decades.
-
"Tombstones are a Kafka quirk." They are universal. Every system that derives a table from a changelog needs a way to say "this key is gone." LSM-trees write tombstones in their memtables; Cassandra has an explicit tombstone column type; Debezium emits delete events with a
nullvalue. No tombstone, no delete. -
"The fold is always last-write-wins." Last-write-wins is the simplest fold. Other folds are valid: SUM (running balance), MAX (high-water mark), HLL (approximate distinct count), LIST (full history). Streaming SQL exposes all of these as aggregations. The duality is not "stream becomes a key-value table" — it is "stream becomes whatever shape your fold function produces."
Going deeper
Compaction is the fold materialised on disk
A Kafka compacted topic implements the fold by rewriting segment files. The cleaner thread reads old segments, builds an in-memory map of key → latest_offset, and writes a new segment containing only the records at those offsets. The deleted records are gone the next time the segment is read; the live records sit at the same offsets they did before. Crucially, the compacted topic is still a stream — consumers reading it sequentially see one record per key, and a downstream KTable can be built from it without re-fetching from a database. This is how Kafka Streams ships state stores: the state is materialised in a local RocksDB, but the backup is a compacted topic, and recovery is fold over that topic. See the Kafka docs on log compaction and the design notes in core/src/main/scala/kafka/log/Cleaner.scala in the Kafka source.
Kappa architecture vs Lambda
The duality has architectural consequences. The Lambda architecture (Marz, 2014) ran two parallel pipelines — a batch path that periodically recomputed the table from scratch, and a streaming path that updated it incrementally — and merged the results. It existed because streaming engines were not yet trustworthy enough to be the only path. Jay Kreps's 2014 reply was the Kappa architecture: keep only the streaming path, and recompute the table by replaying the stream from offset zero whenever you change the schema. The duality is what makes Kappa work — if the stream is durable and replayable, every table is regenerable, and the batch path is redundant. Today Kappa is the default; Lambda survives only in legacy systems and in places where the batch engine has features the streaming engine lacks.
Table partitioning and the co-partitioning rule
If you fold a stream of events partitioned by user id, your table is partitioned by user id too — naturally, because each partition contains all the events for its keys. This is why Kafka Streams requires that two streams being joined on the same key share the same partition count: if orders is partitioned by user_id into 12 partitions, and payments is partitioned by user_id into 12 partitions, then partition 7 of one matches partition 7 of the other, and the join can run locally without shuffling. This is called the co-partitioning rule, and it is a direct consequence of the duality — the partitioning of the stream becomes the partitioning of the table.
Why CDC needs strict ordering
If you build a table from a stream by folding last-write-wins, the order of the events matters. (riya, 300) followed by (riya, 450) ends with riya = 450. The other order ends with riya = 300. Get the order wrong and your table is wrong.
This is why CDC connectors like Debezium replicate Kafka partitions in strict per-key order — the connector partitions by primary key, so all events for the same row land in the same partition, and a partition is by definition consumed in order. Across partitions, no ordering is guaranteed; within a partition, it is. The duality only works if the stream is per-key totally ordered.
The corollary is that you cannot trivially "rescale" a stream by changing its partition count without doing a careful rekeying step. Doubling the partition count of a topic with riya already keyed to partition 3 may now route new riya events to partition 7, and the consumer that built the table from partition 3 will silently stop seeing updates. Production migrations either drain the topic completely before switching or use a side-by-side approach where both old and new topics receive writes for a transition window.
What the duality does not promise
The duality says you can recompute the table from the stream. It does not say the recomputation is cheap. A stream of a billion events rebuilds a table of a million rows by reading the billion. That is the bargain — the table is fast to query, the stream is slow to fold, and you keep both because each one buys what the other cannot. Snapshot + tail strategies (Debezium's snapshot phase, Postgres's pg_basebackup + WAL) exist exactly to skip the long fold: snapshot the table once, then incrementally apply the stream from that point. The duality holds; it just gets help from a periodic copy of the table to keep recovery time bounded.
Folds beyond last-write-wins
Last-write-wins is the simplest fold; it is what makes a stream into a key-value table. Streaming SQL systems generalise this. KStream.aggregate(...) in Kafka Streams takes any associative initialiser and an aggregator lambda, so you can fold a stream of order events into a running revenue table — (key, running_total) — by adding the new amount to whatever was there before. Materialize and ksqlDB do the same with SQL: SELECT user_id, SUM(amount) FROM payments GROUP BY user_id is a fold whose accumulator is +. CRDTs (chapter 184) take this further — they design folds that converge regardless of order, so a stream that arrives in any order produces the same table. The unifying claim of the duality survives every choice of fold function; what changes is whether the table at the end is correct under reordering, retries, or duplicates.
The duality as a debugging tool
Once you internalise the duality, an entire class of bugs becomes easy to spot. The cache disagrees with the database — the cache is a derived table, the database is a derived table, and somebody updated one without producing an event the other could fold. The search index is missing yesterday's products — same diagnosis. The replica fell behind — the consumer is lagging on the stream; check its offset. A migration silently lost rows — the migration wrote directly to the new table without producing changelog events, so anything subscribed to the stream missed them. The first question to ask in any "two systems disagree" incident is: what is the shared stream, and is each side actually folding from it? Nine times out of ten, somebody bypassed the stream.
Where this leads next
- Kafka as a distributed log — chapter 174: the substrate this article assumed. Partitions, offsets, retention, and replication.
- Exactly-once semantics: how it actually works — chapter 176: when the fold is non-idempotent (a SUM, not last-write-wins), duplicate events corrupt the table. This is the chapter that explains how producers and consumers cooperate to avoid that.
- Windowing, watermarks, and event time — chapter 177: folds with a time bound. The table of clicks per user in the last five minutes is a windowed fold.
- Change streams: CDC built in — the database-side of the duality: how Postgres, MongoDB, and friends emit their internal changelog so other systems can fold it.
- The append-only log: simplest store — chapter 2: the original duality, unnamed, in thirty lines of Python.
References
- Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — the essay that named the duality. engineering.linkedin.com.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 11 Stream Processing — the duality, derived patiently with worked examples. dataintensive.net.
- Apache Kafka, Log Compaction — the design doc for compacted topics, the fold made physical. kafka.apache.org/documentation/#compaction.
- Confluent, Streams and Tables in Apache Kafka: A Primer (2020) — the canonical Kafka Streams treatment, covering KStream, KTable, GlobalKTable. confluent.io/blog/kafka-streams-tables-part-1-event-streaming.
- Jay Kreps, Questioning the Lambda Architecture (O'Reilly Radar, 2014) — the Kappa essay. The duality, applied to system architecture. oreilly.com/radar/questioning-the-lambda-architecture.
- Debezium documentation, How Debezium connectors work — a concrete CDC implementation that exposes Postgres / MySQL WALs as Kafka streams. debezium.io/documentation.
- padho-wiki, The append-only log: simplest store — the duality, unnamed, in thirty lines.
- padho-wiki, Compaction: reclaiming deleted and overwritten keys — the same fold, applied to LSM-tree storage.