Kappa architecture: stream-only and reprocess from the log
A Razorpay analytics engineer in 2018 finds a bug. The "fraud risk score" metric on the merchant dashboard has been computing wrongly for six weeks because a feature flag was inverted. In Lambda, this fix would mean two PRs (Storm and Hadoop), two deploys, and a 14-hour Hadoop backfill of all the historical Parquet partitions. In Kappa, the fix is one PR to the Flink job, one deploy, and pressing "start from offset 0" on a parallel reprocessing topology. The new topology streams through six weeks of Kafka data in 90 minutes, writes to a shadow output table, the team verifies, then atomically swaps the dashboard to point at the new table. The old job and the old table get garbage-collected a week later. This is the operational picture that Kappa was designed to make routine.
Kappa architecture is Lambda with the batch layer deleted. You keep one streaming engine and one log with long retention, and you reprocess history by replaying the log from offset 0 with a new version of the same job. The architectural simplification — one codebase, one runtime, one storage tier — is the whole point. Kappa works only if your log retention covers your reprocessing horizon and your stream engine can do exactly-once stateful computation on bounded replays.
The premise: replay is the only operation you need
Jay Kreps, then at LinkedIn, posted Questioning the Lambda Architecture in July 2014. The argument was sharp: the speed layer's job is to compute a result from a stream of events; the batch layer's job is to compute the same result from a different stream of events (the historical replay of the same log). If your streaming engine can correctly process arbitrary replays of the log, the batch layer is redundant — it is just the speed layer started with an older offset.
Kreps gave the recipe in three lines:
- Use Kafka (or any durable, ordered, replayable log) as the system of record. Retention measured in months, not hours.
- Use one stream processor (Storm Trident at the time, eventually Flink, Spark Structured Streaming, or Kafka Streams) for both live and historical processing.
- To reprocess, start a parallel job with a new version of the code, point it at offset 0 (or whatever historical offset you need), let it catch up to live, then atomically switch the serving layer to read from the new output.
The architecture diagram is uninteresting on purpose. It is one log, one job, one output. The interesting properties live in the operational story.
The first time you draw this diagram, it looks anticlimactic. There is no reconciliation logic, no merge query, no two-codebase split. That is the point. The complexity has moved from the architecture into one place — the stream processor's correctness guarantees — and that is exactly the place where the last decade of streaming engine work has paid off.
Why "structurally identical" is load-bearing: in Lambda, the batch and speed layers were two different programs that had to produce the same answer. In Kappa, the live job and the reprocess job are the same program with different start offsets. There is no semantic gap to reconcile because there is no second program. Any bug in the live job is a bug in the reprocess job, and any fix to one is a fix to both.
What it looks like in code: one job, two start offsets
The discipline of writing a Kappa job is: write a Flink (or Beam, or Kafka Streams) topology that is replay-safe. Replay-safe means the output depends only on the input log and the code version, not on wall-clock time, external API calls, or any state that wasn't itself derived from the log. Here is the same "distinct merchants today" metric from the Lambda chapter, written once for Kappa:
# kappa_distinct_merchants.py — single Flink Python job, run as live or reprocess
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import Configuration, WatermarkStrategy, Duration, Types
from pyflink.datastream.connectors.kafka import KafkaSource, KafkaOffsetsInitializer
from pyflink.datastream.formats.json import JsonRowDeserializationSchema
import argparse, datetime
parser = argparse.ArgumentParser()
parser.add_argument("--mode", choices=["live", "reprocess"], required=True)
parser.add_argument("--output-table", required=True) # e.g. distinct_merchants_v4
args = parser.parse_args()
env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60_000) # 60s checkpoints, exactly-once
env.set_parallelism(8)
start_offset = (KafkaOffsetsInitializer.earliest()
if args.mode == "reprocess"
else KafkaOffsetsInitializer.committed_offsets())
source = (KafkaSource.builder()
.set_bootstrap_servers("kafka.razorpay-internal:9092")
.set_topics("payments.captured.v1")
.set_group_id(f"distinct-merchants-{args.output_table}")
.set_starting_offsets(start_offset)
.set_value_only_deserializer(JsonRowDeserializationSchema.builder()
.type_info(Types.ROW_NAMED(
["status", "amount", "merchant_id", "event_time"],
[Types.STRING(), Types.LONG(), Types.STRING(), Types.LONG()]))
.build())
.build())
events = env.from_source(source,
WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(30))
.with_timestamp_assigner(lambda e, _: e["event_time"]),
"payments-source")
result = (events
.filter(lambda e: e["status"] == "captured" and e["amount"] > 0)
.key_by(lambda e: datetime.date.fromtimestamp(e["event_time"] / 1000).isoformat())
.process(DistinctCountByDay())) # stateful: ValueState<HashSet<merchant_id>>
result.add_sink(IcebergSink(args.output_table)) # transactional, exactly-once
env.execute(f"distinct-merchants-{args.mode}")
# Live mode (running continuously since 2026-03-01):
$ python kappa_distinct_merchants.py --mode live --output-table distinct_merchants_v3
[2026-04-25 11:32:18] processed 8,450,221 events, watermark=2026-04-25T11:31:48Z
[2026-04-25 11:32:18] state size: 4.2 MB (in RocksDB), checkpoint completed in 1.8s
# Reprocessing mode (kicked off after a logic fix):
$ python kappa_distinct_merchants.py --mode reprocess --output-table distinct_merchants_v4
[2026-04-25 11:35:00] processed 0 events, starting from offset 0 of payments.captured.v1
[2026-04-25 11:42:14] processed 12,400,000 events (week of 2026-03-01), 28k events/sec
[2026-04-25 12:18:09] processed 187,500,000 events, watermark=2026-04-22T00:00:00Z
[2026-04-25 12:51:33] caught up: watermark within 30s of HEAD
[2026-04-25 12:51:33] reprocess complete; output_v4 has full corrected history
Walk through the load-bearing parts:
- One source of truth:
payments.captured.v1. Kafka is the system of record. The same topic feeds both modes. Why the topic name is versioned: schema evolution is a real problem. When the upstream payments service changes its event schema, you cut over topayments.captured.v2rather than mutatingv1in place. Old reprocess jobs can still read v1 unchanged. This is the Kappa equivalent of "immutable raw data" — the topic is forever, the schema is versioned. --modeis the only operational difference between live and reprocess. Same code path, same operators, same state primitives. TheKafkaOffsetsInitializer.earliest()vscommitted_offsets()switch is the only line that changes. Everything downstream — the filter, the keying, the stateful distinct-count operator, the sink — runs identically.enable_checkpointing(60_000)plus a transactional Iceberg sink gives you end-to-end exactly-once. This is the property Lambda's speed layer could not reliably give in 2011 and the entire reason a separate batch layer was needed. With Flink's checkpoint barriers (covered in /wiki/checkpointing-the-consistent-snapshot-algorithm) and a 2PC sink (/wiki/flinks-two-phase-commit-sink), the streaming engine itself produces correct output, no merge-with-batch required.--output-table distinct_merchants_v4writes to a shadow table. The reprocess job does not overwrite the live job's output. It writes a parallel Iceberg table. The serving layer readsdistinct_merchants_v3(live) until verification completes, then is reconfigured to readdistinct_merchants_v4. Why the swap is atomic at the catalog level: Iceberg lets you alter a view definition or a table pointer in a single metadata transaction. The application doesn't need to coordinate; it queriesmetric.distinct_merchants_todayand the catalog resolves to whichever underlying table is current. This makes the cutover invisible to dashboards.- Watermarks (30-second bounded out-of-orderness) are the same in both modes. Reprocess sees historical event timestamps in chronological-ish order at fast wall-clock rate. Live sees current timestamps in chronological-ish order at slow wall-clock rate. The watermark logic doesn't care about wall clock — it advances based on event timestamps in the stream.
A reprocess of 188 million events takes 90 minutes here at 28k events/sec on 8 parallel slots. The same data was 14 hours in Lambda's hourly Hadoop job. The compute density is the same; the difference is that the Hadoop job ran from cold every hour while the Flink reprocess runs once and exits.
Where Kappa works, where it breaks
Kappa is not a free upgrade from Lambda. There are four prerequisites that must hold, and several anti-patterns where Kappa is the wrong choice.
Prerequisites
- Log retention covers the reprocessing horizon. If you ever need to reprocess six months of history, your Kafka topic needs six months of data accessible. With Kafka's tiered storage to S3 (KIP-405, generally available 2024), this is now financially reasonable — hot retention is small, cold retention is S3 prices (~₹1.6/GB/month at ap-south-1 list pricing in 2026). Without tiered storage, retaining six months in hot Kafka brokers is uneconomical for high-volume topics.
- The stream engine can do exactly-once stateful computation. Flink, Kafka Streams, and Spark Structured Streaming all qualify in 2026. Storm classic and pre-2017 Spark Streaming did not. This is why Kappa was technically possible in 2014 but practical only after about 2018.
- Sinks are idempotent or transactional. Reprocessing produces the same output records as the live run. Either the sink can dedupe (idempotent — typical for keyed upserts into an OLAP store) or it commits transactionally (Iceberg, Delta, modern Kafka producer with
transactional.id). - Reprocessing throughput beats data growth rate. If a reprocess of one day takes more than one wall-clock day, you can never catch up. In practice you want at least 5–10× headroom — reprocessing a week in a few hours.
Where Kappa breaks
The fourth quadrant deserves a sentence. India's DPDP Act 2023 requires merchant deletion-on-request to propagate within 30 days. A pure Kafka log doesn't support deleting a user's records — it's append-only. Real Kappa-shaped systems in Indian fintech use Kafka log compaction with tombstone records, plus an Iceberg lakehouse table that mirrors the topic and supports DELETE FROM ... WHERE merchant_id = ? semantics (/wiki/iceberg-the-format-and-the-catalog-that-runs-on-it). The reprocess job reads from the Iceberg mirror, not from raw Kafka, when GDPR-style deletes have happened. This is technically a hybrid, but the operational model — one job, two start positions — is preserved.
Common confusions
- "Kappa is just deleting the batch layer from Lambda." Mechanically yes, but the implication is wrong. The batch layer existed because the streaming engine couldn't be trusted with correctness. Kappa works only because the stream engine can now produce exactly-once results on bounded replays. You aren't removing a redundant component — you're upgrading the streaming engine until the batch layer becomes redundant.
- "Kappa means you only run streaming jobs." No, Kappa means your historical jobs and your live jobs are the same job with different inputs. A separate periodic Spark batch job that reads raw events to produce, say, a yearly cohort retention report is fine and complementary; it just isn't part of the metric pipeline that needed the Lambda merge. Kappa is about the unification of batch and streaming for the same metric, not about banning batch from the org.
- "Reprocessing is the same as backfilling." Related but not identical. Reprocessing replays the entire log into a fresh output table to fix a logic bug. Backfilling (/wiki/backfills-re-running-history-correctly) typically fills a hole — a missing partition, a failed run, a few days of bad data. In Kappa, both operations use the same mechanism (start a job at a historical offset), but the offset range and the output target differ.
- "Kappa requires Kafka." Any durable, ordered, replayable log works. Pulsar, Kinesis (with longer retention), Pravega, and the WAL of a CDC source like Debezium with logical decoding (/wiki/cdc-with-debezium-the-canonical-pattern) all qualify. Kafka is dominant because tiered storage made long retention economical, but the architecture isn't Kafka-specific.
- "With infinite retention you never need batch." Wrong for two reasons. Heavy historical-aggregation workloads (5-year cohort, lifetime-value over a decade) can run faster as a Parquet-based Spark job than as a stream replay because Parquet is column-pruned and the stream replay is row-by-row. Also, ML training over history typically needs random access (point-in-time joins) which streaming engines aren't optimised for. The lakehouse pattern handles both — Iceberg tables for batch reads, the same data exposed as a CDC stream for real-time consumers.
- "Kappa eliminates the freshness/correctness tradeoff." It eliminates the codebase tradeoff. You still trade off output table version: while a reprocess job is catching up, the serving layer either reads the old (stale, but live) table or the new (more correct, but mid-build) table. The trade-off moved from per-query merge logic to a serving-layer view-switch policy.
Going deeper
How LinkedIn (Kreps' team) actually deployed Kappa
The 2014 Kreps post wasn't speculative — it described what LinkedIn was building. Their Samza framework (donated to Apache in 2013) was the early Kappa engine. The classic example was the "people you may know" feature, computed over the same Kafka log of profile updates and connection events. A bug-fix to the recommendation algorithm meant deploying a parallel Samza job on the same topic from a 2-week-old offset, letting it catch up over a few hours, then atomically swapping the production view to read from the new RocksDB-backed key-value store. The interesting piece is that LinkedIn used this pattern on tens of features simultaneously — the operational discipline was "every output is in a versioned namespace, every job has a version, every cutover is a single config change." Kreps' The Log: What every software engineer should know makes this explicit. The architectural discipline of versioned outputs is what makes Kappa work at scale; without it, the parallel-job pattern collapses into "production now has two of everything, what do we do?".
Why exactly-once is the make-or-break property
If the streaming engine cannot produce exactly-once outputs across replays, the reprocess job's output will not match the live job's output. Bug-fix migrations break: you swap to the reprocessed table and discover the daily totals are off by 0.3% because the reprocess saw a few duplicate events that the live job had deduplicated differently. The 2017–2019 work on Flink's checkpoint/2PC story (KIP-98 on the Kafka producer side, Flink's TwoPhaseCommitSinkFunction) is what made Kappa industrial. Before that, every team running Kappa had a homegrown idempotency layer (typically a unique event-id-per-event scheme written into the sink). The 2PC sink moved this from "every team's responsibility" to "the framework handles it" — and that is the inflection point where Kappa stopped being a LinkedIn-only technique.
The Confluent-promoted variant: streaming materialised views
Confluent's marketing of "real-time materialized views" via ksqlDB (and later Flink SQL) is essentially Kappa with a SQL surface. You write CREATE TABLE distinct_merchants AS SELECT ... against a Kafka topic, the engine continuously maintains the table. To reprocess, you DROP TABLE and CREATE TABLE with new logic — the engine replays from offset 0 automatically. This collapses the operational story to a SQL DDL operation. Materialize, RisingWave, and Flink SQL on Confluent Cloud all sit in this lineage. The trade-off is that you accept the engine's view-maintenance semantics (typically incremental view maintenance, /wiki/incremental-view-maintenance-as-the-endgame) instead of writing the topology yourself.
Where Indian fintech sits in 2026
Razorpay's analytics platform, post-2022 migration, is Kappa-shaped: Kafka MSK with 90-day retention plus tiered storage to S3 (~120 days reprocess horizon), Flink on Kubernetes for both live and reprocess workloads, Iceberg as the sink for serving-layer tables, the dashboard reads through a Trino layer with view definitions that point at the current "blessed" table version. PhonePe runs a similar topology with their own Kafka fork. Zerodha, with stricter regulatory and reconciliation requirements, runs a hybrid — Kappa for the trader-facing real-time analytics, traditional batch reconciliation for the end-of-day SEBI reporting (which has a 14-second freeze window and immutability requirements that are easier to satisfy with a snapshot-based batch). The pattern: Kappa for inward-facing engineering metrics, batch retained where regulatory immutability or complex historical aggregation dominates.
What goes wrong at 02:00 IST
The sharp edges of Kappa show up in incident response. A bad deploy of the live Flink job at 22:00 ships a metric definition that double-counts. By 02:00 someone notices the dashboard. The fix-forward path is: deploy a corrected v4 job with --mode reprocess from offset 0, wait for it to catch up (90 minutes), swap the serving view. During those 90 minutes, the dashboard is showing wrong numbers. The mitigation is either to keep the previous version's output table around as a "rollback target" (cheap with Iceberg, just don't expire the old snapshots until the new one is verified), or to run reprocesses against a smaller offset window first — say, only the last 24 hours — and verify before doing the full reprocess. Lambda's 14-hour batch backfill was painful; Kappa's 90-minute reprocess is fast, but you still want the operational instinct of "deploy → verify on a small window → expand to full" rather than firing the full reprocess and waiting.
Where this leads next
Kappa is one stop in the unified-batch-stream evolution. The neighbouring chapters trace the full arc:
- /wiki/lambda-architecture-why-it-was-a-good-idea-that-didnt-last — what Kappa was responding to. Read first if you skipped it.
- /wiki/the-dataflow-model-batch-as-bounded-streams — Google's framing that gave Kappa a theoretical foundation: batch is just a bounded stream, so a stream engine that does both is the right primitive.
- /wiki/beam-and-flink-write-once-run-on-both — the engine that took Kappa from "LinkedIn-only" to "default for new builds".
- /wiki/incremental-view-maintenance-as-the-endgame — the database-theory descendant: don't replay, propagate deltas. The endgame of the unification arc.
The thing to internalise is the simplification thesis: when an engineering decision lets you delete a whole layer (the batch layer) and replace its operational role with a configuration knob (the start offset), take the deletion. The decade of Lambda taught teams to manage two pipelines in parallel, and that skill — coordinated migrations, semantic-drift detection, dual deploys — turned out to be a workaround for a missing capability in the streaming engine. Once that capability landed, the workaround was technical debt.
References
- Questioning the Lambda Architecture (Jay Kreps, O'Reilly Radar, 2014) — the founding post. Short, direct, and where the name "Kappa" comes from.
- The Log: What every software engineer should know about real-time data's unifying abstraction (Kreps, LinkedIn Engineering, 2013) — the conceptual scaffolding for why a log is the right system-of-record primitive.
- Apache Flink: Stream and Batch Processing in a Single Engine (Carbone et al., IEEE Bulletin 2015) — the Flink team's articulation of "batch is bounded streaming", the engine that made Kappa practical.
- KIP-98: Exactly-Once Delivery and Transactional Messaging in Kafka (Confluent, 2017) — the producer-side correctness story that Kappa relies on.
- KIP-405: Tiered Storage in Kafka (Apache Kafka, GA 2024) — the change that made Kafka retention long enough for Kappa to be the default architecture, not an aspirational one.
- Designing Data-Intensive Applications, Chapter 11 (Kleppmann, 2017) — the cleanest textbook treatment of the Lambda → Kappa transition.
- /wiki/lambda-architecture-why-it-was-a-good-idea-that-didnt-last — what Kappa replaced.
- /wiki/the-append-only-log-simplest-store — the immutability primitive at the bottom of every Kappa deployment.