Wall: stateless stream processing isn't enough
It is 21:14 IST on a Friday at a Bengaluru fintech. The new fraud-rules service, shipped two weeks ago, reads from a Kafka topic of card transactions, runs each event through a stateless filter — if amount > 1_00_000 and merchant_country != 'IN' then flag — and writes flagged events to a downstream queue. The dashboards say "p99 latency 38 ms, throughput 24k events/sec". The on-call channel is quiet. Then risk-ops messages: a coordinated card-testing attack just drained ₹47 lakh from the issuing-bank float over the last ninety minutes. Each individual transaction was ₹2,499 — well below the ₹1,00,000 threshold. The pattern was 12 transactions per minute on the same BIN, from 4 different merchants, but no rule looked across events. The consumer is stateless. It sees one record, decides, forgets. Every operator on the team can read the code in five minutes; the production behaviour is still useless against the actual attacker.
Build 7 gave you a message log: events flow in, a consumer reads them in order, and for the first time the pipeline reacts in seconds instead of hours. But a stateless consumer can only answer questions about one event at a time — count, sum, top-k, joins, sessions, dedup, watermarks, anything that requires "this event in the context of the last five minutes" needs memory that survives across events. Adding that memory naively — a Python dict, a Redis cache — breaks the moment a partition is reassigned, a worker restarts, or you need exactly-once semantics. Build 8 is the architecture for that memory: windows, state stores, watermarks, checkpoints. The wall is the moment a stateless filter stops being enough; the response is stateful stream processing.
Where stateless ends — the questions that need memory
A stateless operator is a pure function from one input event to zero or more output events. map(x → f(x)), filter(x → predicate(x)), flatMap(x → expand(x)). It has no recollection of any earlier event, and that property is what makes it trivial to scale, restart, and reason about. If a worker crashes, you re-read from the last committed offset, and the function produces the same outputs because nothing was carried across.
The trouble is that almost every interesting question a business asks is not a question about one event. "Count the transactions in the last five minutes" needs to remember the last five minutes. "Was this user's last login from a different country?" needs the previous login. "Average order value this hour" needs the running sum and count. "Detect a sudden 5× spike in error rate" needs a moving baseline. The moment the question depends on more than the current event, the operator has to remember something between calls — and that "something" is state.
The list of stateless-only operators is short and well-known: per-event projection, per-event filter, per-event enrichment from a static reference table, format conversion, schema validation. Everything else — windowed counts, top-k, sessions, joins, dedup, anomaly detection, watermarks, "sudden spikes", running averages — is stateful. Why this list is so short: a stateless operator can't even dedup an event, because dedup requires remembering "have I seen this id before?" — which is state. The illusion that "streaming = map and filter" comes from toy examples; in production, the share of operators that are truly stateless is small.
What goes wrong when you bolt state on naively
The first instinct, when someone asks for "transactions per user in the last five minutes", is to add a Python dict keyed by user. The consumer reads each event, increments the count, periodically purges old entries, and emits a number. This works on a developer laptop. In production, four problems show up within the first week.
Problem 1: a worker restart loses everything. The dict is in process memory. A kubectl rollout restart to ship a config change wipes every counter. After the restart, the first user's "transactions in the last five minutes" reads zero — even though they made eight transactions four minutes ago. Every operator becomes a sudden-amnesia bug. The fix is to persist the dict to disk between updates, and now you've started writing your own state store.
Problem 2: scale-out re-shards the keyspace. When you go from 2 consumers to 4, Kafka reassigns partitions. The consumer that used to own user_42's events now sees user_99's events. The state for user_42 is on the wrong worker — and the consumer that just took over partition P3 has no state for any of the users in P3. Either you migrate state across workers (which is hard and slow), or you accept that scaling out resets state (which is unacceptable for fraud). The fix is to make state co-partitioned with the input so each worker holds only the state for keys it owns; that requires a state-aware partitioner and a way to move state when partitions move.
Problem 3: late events ruin windowed aggregates. If a transaction at event-time 21:08 arrives at processing-time 21:14 — six minutes late, because of a network blip on the merchant side — does it count toward the 21:05–21:10 window or not? The naive dict closes the window when wall-clock time passes 21:10 and ignores the late event. The "last five minutes" answer for that user is now wrong by one event. Fixing this requires a separate notion of event time and a watermark that says "I think I've seen everything up to T" — neither of which a dict has.
Problem 4: exactly-once is impossible without coordination. When the consumer commits offset 1000 to Kafka but crashes before persisting the dict update for offset 1000, the next worker re-reads offset 1000 and double-counts. Going the other way — persisting the dict first and committing offset second — leads to lost counts on the same crash pattern. The fix is to make the offset commit and the state update atomic, which is the entire job of a checkpointing protocol like Flink's Chandy-Lamport or Kafka Streams' transactional log-append.
# naive_stateful_consumer.py — the version that breaks in production
from kafka import KafkaConsumer
from collections import defaultdict, deque
import time, json
# In-memory per-user history. Looks innocent. Is not.
counts = defaultdict(deque) # user_id -> deque[(event_time, amount)]
WINDOW_SECONDS = 300
c = KafkaConsumer(
"txns",
bootstrap_servers="kafka.razorpay.local:9092",
auto_offset_reset="latest",
enable_auto_commit=True, # silently commits before state is durable
value_deserializer=lambda v: json.loads(v),
)
for msg in c:
ev = msg.value # {user_id, event_time, amount, merchant}
uid, et, amt = ev["user_id"], ev["event_time"], ev["amount"]
dq = counts[uid]
# purge expired entries based on PROCESSING TIME — first bug
now = time.time()
while dq and now - dq[0][0] > WINDOW_SECONDS:
dq.popleft()
dq.append((now, amt)) # storing now, not et — second bug
n = len(dq)
total = sum(a for _, a in dq)
if n >= 10 and total > 50_000:
print(f"ALERT user={uid} count={n} total=₹{total} in {WINDOW_SECONDS//60}m")
# If process dies here, the in-memory deque dies with it — third bug.
# If we add a second consumer, partition reassignment moves user keys
# to a worker with an empty deque — fourth bug.
ALERT user=u_8821 count=12 total=₹29988 in 5m # late 21:09 events processed at 21:14 — counted under processing-time
ALERT user=u_5530 count=11 total=₹27495 in 5m # would have been the right alert at 21:11; we caught it 3 min late
(crash, restart, all counts reset)
ALERT user=u_8821 count=2 total=₹4998 in 5m # post-restart amnesia: we saw 12 a moment ago, now we see 2
Walk through the lines that decide everything:
counts = defaultdict(deque)— this is the entire state store, in process memory. Survival of a worker restart: zero.enable_auto_commit=True— Kafka periodically marks offsets as consumed even though our deque might not have processed them yet. Why this breaks exactly-once: auto-commit decouples "the broker thinks we've consumed it" from "our state has actually been updated for it". On crash, we re-read events whose offsets were committed but whose state effects were never applied — so we lose them; or we re-process events whose state effects were applied but whose offsets were not committed — so we double-count. Either way, the stateful answer is wrong.now = time.time()vset = ev["event_time"]— using wall-clock time means a delayed event is "now" rather than at its real event time. The 5-minute window becomes a 5-minute-of-processing window, which is a different shape from what the business asked for.dq.popleft()— the purge runs only when a new event arrives for that user. A user with no traffic for 10 minutes still has stale entries in their deque; memory grows, and a later "spike" might double-count those old entries depending on timestamp resolution.- No checkpoint, no snapshot, no offset coordination — the dict can drift arbitrarily far from the consumer's offset position. After a long-running session, recovering this consumer requires re-reading days of Kafka, which is feasible only when retention is generous and the operator doesn't need an answer for several hours.
The moral: the naive version is fast to write and impossible to operate. Every production failure mode you'll see — fraud false-negatives after a deploy, dashboards that "blip to zero" during scale-out, alerts that fire late and again — traces back to one of these four problems.
What "stateful stream processing" means in practice
The architecture Build 8 builds toward isn't more code on top of the consumer — it's a runtime that owns state on your behalf. Apache Flink, Kafka Streams, Spark Structured Streaming, Materialize, ksqlDB all ship variants of the same five components:
The five components, mapped to the four problems above:
-
Keyed operators mean every event for
user_42lands on the same operator instance, deterministically. The state foruser_42lives on that one instance and nowhere else. This solves Problem 2: when partitions move, the runtime moves the corresponding state with them. -
A state backend (RocksDB on local disk for big state, on-heap for small, or external like Redis for special cases) gives you a structured key-value API the operator writes to instead of a raw dict. The backend is responsible for persistence, eviction, compaction, snapshotting. This solves Problem 1.
-
Event-time semantics with watermarks: every event carries an
event_timefield, and the runtime tracks a watermark — the largest event_time it believes it has seen "everything before". A 5-minute tumbling window doesn't close because wall-clock time advanced; it closes because the watermark crossed the window boundary. Late events that arrive before some grace period (allowed_lateness, typically tuned to the empirical p99 of the source's delay — around 90 seconds for Swiggy partner pings, 5 minutes for Razorpay merchant webhooks) still count. Settingallowed_lateness = 0is correct in spirit but causes large windows of "right answer with no data because we're waiting"; setting it to an hour means alerts fire an hour late. The trade-off is explicit and tunable. This solves Problem 3. -
Distributed snapshot checkpoints (Chandy-Lamport in Flink, or transactional commits in Kafka Streams) atomically capture "the state at offset X, across all operators". On restart, the runtime rolls back to the most recent checkpoint, restores state, and resumes from the matching offset. This solves Problem 4 — exactly-once for processing, not just delivery.
-
Sinks with two-phase commit plug the last gap: writing the result to ClickHouse or Postgres atomically with the upstream checkpoint, so the answer the user sees never reflects "uncommitted" state. Build 9 covers this in detail.
Why a runtime is necessary instead of a clever library: the four problems are coupled. You cannot solve checkpointing without keyed state (because you need to know what to snapshot per key). You cannot solve scale-out without co-partitioning (because state has to follow keys). You cannot solve late-event correctness without watermarks (because there's no other signal that "the past is done"). A library that fixes one without the others doesn't help; you need a coordinated system that owns all of them. That's what Flink, Kafka Streams, and the rest are — coordinated runtimes for stateful operators.
Indian-context examples — five workloads that needed Build 8
The wall is concrete. These are workloads that, between roughly 2020 and 2025, forced Indian platforms off their stateless first cut and into stateful runtimes.
| Workload | Why stateless failed | Stateful primitive needed |
|---|---|---|
| Razorpay card-testing detection | The attack pattern is ≥ 10 small txns/min on same BIN, no single event triggers a rule |
Tumbling 1-min window + count-distinct per BIN |
| Swiggy delivery-partner ETA | ETA depends on the partner's last 60 sec of GPS pings, not just the latest one | Sliding window + last-value-by-key state |
| Zerodha tick deduplication | Exchange resends ticks during recovery; downstream cannot double-count fills | Per-key dedup state with a sliding 30-min retention |
| Cred reward attribution | A reward fires when a user completes A then B within 7 days; A and B are separate events | Session windows + per-user state with 7-day TTL |
| Dream11 leaderboard during a match | Score = sum of player events for users who joined this contest; match runs 4 hours | Stream-stream join (events × contest-membership) + windowed sum |
The pattern is that each of these is a question across events, not a question about one event, and adding a tablespoon of state to a stateless consumer doesn't get you there. The first three teams I'm aware of who attempted these on a Python+Kafka stack ended up reimplementing parts of a state backend (snapshots, RocksDB, dedup by Bloom filter) before retreating to Flink or Kafka Streams. That cost — six months of in-house infrastructure development by a team that was supposed to be writing fraud rules — is the wall's price.
A pragmatic note: not every team needs Flink. For modest scale (under 50k events/sec, state under 10 GB, latency target above 5 seconds), Kafka Streams running embedded in your Java service is meaningfully simpler. For very large state (TBs) and complex windowing, Flink is the standard. For SQL-first teams, Materialize and ksqlDB let you express stateful operators as SELECT count(*) FROM txns WINDOW TUMBLING (5 MINUTES) and the runtime owns the rest. The choice is part of the next several chapters.
A second pragmatic note, on cost: a Flink job processing 50k events/sec with 200 GB of state typically runs on 8 c6i.2xlarge instances (~₹1.2 lakh/month on AWS Mumbai) plus the S3 cost for checkpoints (a rounding error). The same workload as a stateless Python service would cost about ₹40,000/month — but it would not produce correct answers. The 3× cost premium is the price of having state, and most teams discover it during the first capacity review after the wall. Budgeting for stateful streaming as "a streaming line item plus a state line item" rather than a single Kafka-consumer line item is the framing that holds up.
A third pragmatic note, on team shape: stateful streaming demands a smaller set of people who go deeper than what stateless consumers required. The Kafka-consumer cottage industry that grew up around microservices in 2018–2022 — every team rolling their own consumer, in their own service — does not scale to stateful jobs. The state-evolution story, the checkpoint-recovery story, the watermark-tuning story, the resource-allocation story: these reward centralisation. Most Indian platforms past 200 engineers end up with a dedicated streaming-platform team that owns the runtime (Flink cluster, Kafka Streams operators, monitoring) while application teams write the operators. This is the same evolution Kubernetes triggered for compute: at first every team ran their own deployment, eventually a platform team ran the cluster and others wrote workloads. The wall pushes the org chart as much as the architecture diagram.
What changes for the on-call engineer
The four problems above don't just bite at design time; they reshape what 2 a.m. pages look like. A stateless consumer that's running fine has two failure modes: it's down (so events back up in Kafka), or it's slow (so consumer lag grows monotonically). Both are visible from a single Kafka-lag dashboard, and both have a simple remediation — restart, scale up, or skip ahead.
A stateful consumer adds an entirely new dimension. It can be caught up on offsets and still wrong on state: lag is zero, throughput is healthy, and the answers are still off because the state was corrupted by a bad checkpoint or a mid-deploy reassignment. Detecting this requires a second class of monitoring — state-size metrics, checkpoint-age metrics, watermark-lag metrics, "last successful checkpoint" alerts. The on-call playbook for stateful streaming is at least 3× longer than for stateless, because the failure surface is genuinely larger. Why this is unavoidable: state is data the consumer wrote, derived from the input but no longer one-to-one with it. Once a system has data of its own, you have to monitor it as such — its size, its freshness, its correctness — not just the input pipeline. The complexity isn't accidental; it's intrinsic to having memory.
For a Razorpay or PhonePe team running a fraud-rules service on Flink, the on-call runbook typically includes: how to roll back to checkpoint N-1, how to clear and rebuild RocksDB for one operator, how to pin a specific job graph version, and how to drain a job before a deploy without losing in-flight state. These are not optional. The runbook is the work — the application code is small.
A pattern that recurs in the post-mortem of every "we accidentally lost state" incident: someone changes the operator's logic in a way that breaks state-schema compatibility (renames a key field, changes a value type, drops a column from the keyed state) and the new job, on restart, can't read the old checkpoint. The choices at that moment are bad: roll back the code (lose the bug fix), accept a state reset (lose the running counters), or run a parallel re-derivation job (operationally complex). Every mature streaming team has a state-evolution playbook by year two — versioned state schemas, dual-write transitions, and explicit migration jobs — and the cost of building it is real. None of this exists for stateless consumers.
Common confusions
- "Stateless means simpler so we should always start there." Stateless means no memory, which means most production-grade questions cannot be answered. The ranking is "use stateless when the question genuinely requires no memory" — not "default to stateless to keep the code simple". A pipeline whose code is simple but whose answers are wrong is not simpler than one with a state backend.
- "Stateful = slower." A correctly-tuned Flink or Kafka Streams job processes 100k+ events/sec per core with sub-second p99. The slowness people associate with stateful processing is usually mis-tuned RocksDB, oversized checkpoints, or external state stores being used where local would suffice. Stateful done right is not slower than stateless done right.
- "Redis is a state backend." Redis can hold per-key counters, but it doesn't participate in your stream-processor's checkpoint protocol. On crash, your Flink job rolls back to checkpoint T, but Redis still has the writes from after T — your view is now inconsistent. Using Redis as a stream-processor state backend is a known foot-gun; use it for serving features that the stream-processor publishes, not as the processor's own working memory.
- "Watermarks are just a timestamp filter." A watermark is a promise by the runtime — "I believe I've seen all events up to time T". It is not a filter; it triggers downstream actions like closing a window or emitting a result. A wrong watermark closes a window early, dropping late events permanently. Watermark generation is a design choice with correctness consequences, not a knob.
- "Once we go stateful, we don't need batch anymore." Most production architectures keep a parallel batch path for backfills, schema migrations, and the occasional "recompute everything from scratch" — because reprocessing seven days of data through a streaming job is slower than running a Spark batch job over the same data. The Lambda and Kappa debates of 2014–2018 ended in pragmatism: you usually run both. The streaming path covers fresh data; batch covers history and corrections.
Going deeper
Why the "tablespoon of state" doesn't work — the impossibility result
There's a quiet impossibility lurking in trying to bolt state onto a stateless consumer: if your state lives outside the consumer's commit unit, you cannot make state updates and offset commits atomic. Either you commit offsets first (lose state on crash) or persist state first (double-count on crash). The only escape is to make the state and the offset commit one transaction — which means either (a) the state lives in Kafka itself (Kafka Streams' approach: state is a compacted topic, the offset commit IS the state commit) or (b) the runtime coordinates state and offsets with a 2PC-like protocol (Flink's Chandy-Lamport snapshot). There is no middle ground that gets you exactly-once with bolt-on state. This isn't a software-engineering preference; it's a distributed-systems result. A library that says "exactly-once stateful streaming" without addressing this either embeds Kafka Streams / Flink internally or is wrong.
What RocksDB actually does for stream processors
RocksDB is the default state backend for Flink and Kafka Streams not because it's fast on every workload — it's slower on small, hot states than on-heap — but because it scales to state that doesn't fit in memory. Flink's RocksDB backend stores per-key state on local disk, with the working set in RocksDB's block cache, and periodically uploads incremental SST file diffs to S3 or HDFS for the checkpoint. Why incremental: a Flink job with 500 GB of state that checkpoints fully every 60 seconds would saturate the network with 500 GB of upload per minute. Incremental snapshots upload only the SST files that changed since the last checkpoint — typically a few hundred MB — making checkpointing cheap enough to do every 30 seconds. The trick is that RocksDB's LSM file layout naturally produces a stable set of "old, unchanged" SSTs and a small set of "recent, changed" ones; the snapshot is the union with reference counting.
Kafka Streams vs Flink — the architectural fork
Kafka Streams runs as a library embedded in your application; the application is the runtime. Scaling means deploying more instances of your app. State lives on the local disk of each instance, plus replicated as a "changelog topic" in Kafka. Flink runs as a cluster (job manager + task managers), and your job runs on it. Scaling means asking the cluster for more slots. State lives on task-manager disks plus checkpointed to S3. The Kafka Streams model is operationally simpler if you already deploy Java services and don't want to run another cluster; the Flink model is more flexible for very large state, complex SQL, and multi-tenancy. Most Indian teams I've seen pick Kafka Streams for "every microservice gets a small streaming job" and Flink for "the central data-platform team owns 30 critical jobs". The choice often follows the org chart, not just the technology.
The runtime landscape: Spark, Materialize, and the SQL future
Beyond Flink and Kafka Streams, two more architectural choices matter. Spark Structured Streaming is a stateful streaming runtime — windowing, watermarks, checkpoints, RocksDB-backed state — but it is fundamentally micro-batch: every 200 ms (or whatever the trigger interval is), Spark reads the last batch of events from the source, runs them through the operators, updates state, and writes the output. This is fine for analytics latency targets (1–10 second p99) but not for sub-second targets, because the trigger interval is itself a floor on latency. For a Flipkart catalogue-update pipeline that needs results in 5 seconds, Spark Structured Streaming is great; for a Razorpay fraud-decision path that needs results in 200 ms, it's not. Continuous Processing mode in Spark 3.x narrows but does not close the gap, and most teams that started on Spark for streaming end up with a separate Flink or Kafka Streams path for the hot operational workloads.
A separate line of evolution — Materialize, RisingWave, ksqlDB to a degree — implements stateful stream processing as incrementally-maintained materialised views over streams. The user writes CREATE MATERIALIZED VIEW fraud_alerts AS SELECT user_id, count(*) FROM txns WHERE amount > 100 GROUP BY user_id, date_trunc('minute', event_time) and the system maintains the result incrementally as new events arrive. The mental model collapses "stream processing" into "queries on streams", and the runtime handles state, watermarks, and checkpointing without exposing them as application concerns. This is where the field is heading for analytical use cases; for very low-latency operational use cases (sub-100 ms RTB, dispatch), the imperative Flink/Kafka-Streams model still dominates because it gives finer control. The convergence — declarative SQL for both batch and stream, incrementally maintained — is the topic of Build 10. The architectural diversity here (continuous vs micro-batch vs IVM) is part of what makes the streaming-runtime decision consequential and worth understanding before committing.
State versus a database — when do you stop and just use Postgres
A natural question after seeing all this complexity: "Why not just write each event to Postgres and run the windowed query there?" The answer is the throughput-latency frontier. Postgres handles ~10k writes/sec on a well-tuned single node, with ~5 ms p99 read latency on small queries. A Flink job handles 100k+ events/sec per core, with sub-millisecond state lookups for keyed state on RocksDB, because the state is colocated with the operator and there is no network hop or query-planning overhead per event. For low-throughput, low-latency questions (under 5k events/sec, query latency above 50 ms), Postgres or Redis is genuinely the simpler answer; for the throughput-latency regime where streaming wins, the state has to live where the operator runs. The "use Postgres" path also breaks down for windowing: TTL-driven cleanup of expired window state in Postgres is a nightmare at scale (tombstones, vacuum, index bloat), while RocksDB-backed Flink state has TTL as a first-class feature. The line where "switch from Postgres to a real streaming runtime" lives is workload-specific but reliably crosses around 10k–50k events/sec for most teams.
Where this leads next
Build 8 is the response to this wall. The next chapter, /wiki/stateless-operators-map-filter-the-easy-part, formally separates what's easy from what's hard, and the rest of Build 8 walks through the hard part one primitive at a time: state stores (/wiki/state-stores-why-rocksdb-is-in-every-streaming-engine), windowing (/wiki/windowing-tumbling-sliding-session), event-time vs processing-time (/wiki/event-time-vs-processing-time-the-whole-ballgame), watermarks (/wiki/watermarks-how-we-know-were-done), late data, joins, and checkpoints (/wiki/checkpointing-the-consistent-snapshot-algorithm).
The closing wall of Build 8 — /wiki/wall-exactly-once-is-a-lie-everyone-tells — looks at the gap between "exactly-once for the operator's state" and "exactly-once for the sink that the user sees", which is what Build 9 then tackles with idempotent producers and 2PC sinks.
The mental model to take forward is: stateless is the easy 5%, stateful is the hard 95%, and the architecture you adopt for the hard part is one of the most consequential decisions on a streaming team. The choice (Flink, Kafka Streams, Materialize, Spark Structured Streaming, build-your-own) is not interchangeable — each commits you to a different operational model, a different state-evolution story, and a different on-call burden.
Build 8 gives you the conceptual vocabulary to make that choice well, primitive by primitive: keyed state, windowing, watermarks, late events, joins, checkpoints. Each one is small in isolation; the system that combines them coherently is what a stream-processing runtime is. The chapters that follow build the primitives in the order a senior engineer would explain them at a whiteboard — and end at the point where the "exactly-once" promise of these systems gets honest about its limits.
References
- Streaming Systems (Akidau, Chernyak, Lax) — the canonical book on event-time, watermarks, and the Dataflow model.
- Apache Flink — Stateful Functions documentation — Flink's mental model for keyed state, snapshots, and exactly-once.
- Kafka Streams — Stateful processing — the embedded-runtime alternative to Flink, with state-as-changelog-topic.
- Chandy-Lamport: Distributed Snapshots — the 1985 algorithm Flink's checkpointing is built on.
- Materialize — Streaming SQL with strong consistency — incrementally-maintained materialised views as the SQL future for streaming.
- Confluent — How Kafka Streams handles exactly-once — the transactional log-append approach for end-to-end EOS.
- /wiki/replication-and-isr-how-kafka-stays-up — the broker-level guarantee that the consumer sits on top of; without it none of this is durable.
- /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda — the previous chapter; the message-log substrate that stateful stream processors run on.