Kappa architecture: stream-only and reprocess from the log

A Razorpay analytics engineer in 2018 finds a bug. The "fraud risk score" metric on the merchant dashboard has been computing wrongly for six weeks because a feature flag was inverted. In Lambda, this fix would mean two PRs (Storm and Hadoop), two deploys, and a 14-hour Hadoop backfill of all the historical Parquet partitions. In Kappa, the fix is one PR to the Flink job, one deploy, and pressing "start from offset 0" on a parallel reprocessing topology. The new topology streams through six weeks of Kafka data in 90 minutes, writes to a shadow output table, the team verifies, then atomically swaps the dashboard to point at the new table. The old job and the old table get garbage-collected a week later. This is the operational picture that Kappa was designed to make routine.

Kappa architecture is Lambda with the batch layer deleted. You keep one streaming engine and one log with long retention, and you reprocess history by replaying the log from offset 0 with a new version of the same job. The architectural simplification — one codebase, one runtime, one storage tier — is the whole point. Kappa works only if your log retention covers your reprocessing horizon and your stream engine can do exactly-once stateful computation on bounded replays.

The premise: replay is the only operation you need

Jay Kreps, then at LinkedIn, posted Questioning the Lambda Architecture in July 2014. The argument was sharp: the speed layer's job is to compute a result from a stream of events; the batch layer's job is to compute the same result from a different stream of events (the historical replay of the same log). If your streaming engine can correctly process arbitrary replays of the log, the batch layer is redundant — it is just the speed layer started with an older offset.

Kreps gave the recipe in three lines:

  1. Use Kafka (or any durable, ordered, replayable log) as the system of record. Retention measured in months, not hours.
  2. Use one stream processor (Storm Trident at the time, eventually Flink, Spark Structured Streaming, or Kafka Streams) for both live and historical processing.
  3. To reprocess, start a parallel job with a new version of the code, point it at offset 0 (or whatever historical offset you need), let it catch up to live, then atomically switch the serving layer to read from the new output.

The architecture diagram is uninteresting on purpose. It is one log, one job, one output. The interesting properties live in the operational story.

Kappa architecture: one log, one job, one outputA diagram showing Kafka as the durable log feeding a single stream processor (Flink), with a parallel reprocessing job replaying from offset 0 alongside the live job, both writing to versioned output tables that the serving layer can switch between. Kappa: one engine, one log, two job versions side by side Kafka log retention: 30 days+ tiered to S3 for older offset 0 ────────► offset 1 ─────────► ... offset 8.4M ──► offset 8.5M (HEAD) durable, ordered, immutable live job v3 reads from HEAD low latency tail writes → output_v3 currently serving reprocess job v4 reads from offset 0 catching up to HEAD writes → output_v4 shadow, not yet served serving layer view switch: CURRENT → output_v3 flip to output_v4 when caught up + verified
The reprocessing job is structurally identical to the live job. It just reads from a different offset and writes to a different output table. The serving layer chooses which output to read.

The first time you draw this diagram, it looks anticlimactic. There is no reconciliation logic, no merge query, no two-codebase split. That is the point. The complexity has moved from the architecture into one place — the stream processor's correctness guarantees — and that is exactly the place where the last decade of streaming engine work has paid off.

Why "structurally identical" is load-bearing: in Lambda, the batch and speed layers were two different programs that had to produce the same answer. In Kappa, the live job and the reprocess job are the same program with different start offsets. There is no semantic gap to reconcile because there is no second program. Any bug in the live job is a bug in the reprocess job, and any fix to one is a fix to both.

What it looks like in code: one job, two start offsets

The discipline of writing a Kappa job is: write a Flink (or Beam, or Kafka Streams) topology that is replay-safe. Replay-safe means the output depends only on the input log and the code version, not on wall-clock time, external API calls, or any state that wasn't itself derived from the log. Here is the same "distinct merchants today" metric from the Lambda chapter, written once for Kappa:

# kappa_distinct_merchants.py — single Flink Python job, run as live or reprocess

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import Configuration, WatermarkStrategy, Duration, Types
from pyflink.datastream.connectors.kafka import KafkaSource, KafkaOffsetsInitializer
from pyflink.datastream.formats.json import JsonRowDeserializationSchema
import argparse, datetime

parser = argparse.ArgumentParser()
parser.add_argument("--mode", choices=["live", "reprocess"], required=True)
parser.add_argument("--output-table", required=True)        # e.g. distinct_merchants_v4
args = parser.parse_args()

env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60_000)                            # 60s checkpoints, exactly-once
env.set_parallelism(8)

start_offset = (KafkaOffsetsInitializer.earliest()
                if args.mode == "reprocess"
                else KafkaOffsetsInitializer.committed_offsets())

source = (KafkaSource.builder()
          .set_bootstrap_servers("kafka.razorpay-internal:9092")
          .set_topics("payments.captured.v1")
          .set_group_id(f"distinct-merchants-{args.output_table}")
          .set_starting_offsets(start_offset)
          .set_value_only_deserializer(JsonRowDeserializationSchema.builder()
              .type_info(Types.ROW_NAMED(
                  ["status", "amount", "merchant_id", "event_time"],
                  [Types.STRING(), Types.LONG(), Types.STRING(), Types.LONG()]))
              .build())
          .build())

events = env.from_source(source,
    WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(30))
                     .with_timestamp_assigner(lambda e, _: e["event_time"]),
    "payments-source")

result = (events
    .filter(lambda e: e["status"] == "captured" and e["amount"] > 0)
    .key_by(lambda e: datetime.date.fromtimestamp(e["event_time"] / 1000).isoformat())
    .process(DistinctCountByDay()))      # stateful: ValueState<HashSet<merchant_id>>

result.add_sink(IcebergSink(args.output_table))           # transactional, exactly-once
env.execute(f"distinct-merchants-{args.mode}")
# Live mode (running continuously since 2026-03-01):
$ python kappa_distinct_merchants.py --mode live --output-table distinct_merchants_v3
[2026-04-25 11:32:18] processed 8,450,221 events, watermark=2026-04-25T11:31:48Z
[2026-04-25 11:32:18] state size: 4.2 MB (in RocksDB), checkpoint completed in 1.8s

# Reprocessing mode (kicked off after a logic fix):
$ python kappa_distinct_merchants.py --mode reprocess --output-table distinct_merchants_v4
[2026-04-25 11:35:00] processed 0 events, starting from offset 0 of payments.captured.v1
[2026-04-25 11:42:14] processed 12,400,000 events (week of 2026-03-01), 28k events/sec
[2026-04-25 12:18:09] processed 187,500,000 events, watermark=2026-04-22T00:00:00Z
[2026-04-25 12:51:33] caught up: watermark within 30s of HEAD
[2026-04-25 12:51:33] reprocess complete; output_v4 has full corrected history

Walk through the load-bearing parts:

A reprocess of 188 million events takes 90 minutes here at 28k events/sec on 8 parallel slots. The same data was 14 hours in Lambda's hourly Hadoop job. The compute density is the same; the difference is that the Hadoop job ran from cold every hour while the Flink reprocess runs once and exits.

Where Kappa works, where it breaks

Kappa is not a free upgrade from Lambda. There are four prerequisites that must hold, and several anti-patterns where Kappa is the wrong choice.

Prerequisites

  1. Log retention covers the reprocessing horizon. If you ever need to reprocess six months of history, your Kafka topic needs six months of data accessible. With Kafka's tiered storage to S3 (KIP-405, generally available 2024), this is now financially reasonable — hot retention is small, cold retention is S3 prices (~₹1.6/GB/month at ap-south-1 list pricing in 2026). Without tiered storage, retaining six months in hot Kafka brokers is uneconomical for high-volume topics.
  2. The stream engine can do exactly-once stateful computation. Flink, Kafka Streams, and Spark Structured Streaming all qualify in 2026. Storm classic and pre-2017 Spark Streaming did not. This is why Kappa was technically possible in 2014 but practical only after about 2018.
  3. Sinks are idempotent or transactional. Reprocessing produces the same output records as the live run. Either the sink can dedupe (idempotent — typical for keyed upserts into an OLAP store) or it commits transactionally (Iceberg, Delta, modern Kafka producer with transactional.id).
  4. Reprocessing throughput beats data growth rate. If a reprocess of one day takes more than one wall-clock day, you can never catch up. In practice you want at least 5–10× headroom — reprocessing a week in a few hours.

Where Kappa breaks

When Kappa is the wrong choiceA four-quadrant diagram showing scenarios where Kappa breaks down: heavy historical compute, expensive ML training, GDPR delete requirements, and bounded log retention. Anti-patterns: stick with batch (or hybrid) heavy historical aggregation 5-year monthly cohort report replay too slow; a Spark batch on parquet wins → keep batch ML training over history offline feature gen on a year of data point-in-time joins need random access, not sequential replay → feature store GDPR / DPDP delete-on-request log is immutable deleting a user's events from Kafka requires log compaction + tombstones, fragile → Iceberg+stream log too short 7-day Kafka need 90-day replay Lambda batch layer was the workaround → tier to S3
Kappa requires the log to be the genuine system of record. Cases where the log is bounded, the workload is heavy random-access, or where regulatory deletes are common — these push back toward Lambda or to lakehouse-on-stream hybrids.

The fourth quadrant deserves a sentence. India's DPDP Act 2023 requires merchant deletion-on-request to propagate within 30 days. A pure Kafka log doesn't support deleting a user's records — it's append-only. Real Kappa-shaped systems in Indian fintech use Kafka log compaction with tombstone records, plus an Iceberg lakehouse table that mirrors the topic and supports DELETE FROM ... WHERE merchant_id = ? semantics (/wiki/iceberg-the-format-and-the-catalog-that-runs-on-it). The reprocess job reads from the Iceberg mirror, not from raw Kafka, when GDPR-style deletes have happened. This is technically a hybrid, but the operational model — one job, two start positions — is preserved.

Common confusions

Going deeper

How LinkedIn (Kreps' team) actually deployed Kappa

The 2014 Kreps post wasn't speculative — it described what LinkedIn was building. Their Samza framework (donated to Apache in 2013) was the early Kappa engine. The classic example was the "people you may know" feature, computed over the same Kafka log of profile updates and connection events. A bug-fix to the recommendation algorithm meant deploying a parallel Samza job on the same topic from a 2-week-old offset, letting it catch up over a few hours, then atomically swapping the production view to read from the new RocksDB-backed key-value store. The interesting piece is that LinkedIn used this pattern on tens of features simultaneously — the operational discipline was "every output is in a versioned namespace, every job has a version, every cutover is a single config change." Kreps' The Log: What every software engineer should know makes this explicit. The architectural discipline of versioned outputs is what makes Kappa work at scale; without it, the parallel-job pattern collapses into "production now has two of everything, what do we do?".

Why exactly-once is the make-or-break property

If the streaming engine cannot produce exactly-once outputs across replays, the reprocess job's output will not match the live job's output. Bug-fix migrations break: you swap to the reprocessed table and discover the daily totals are off by 0.3% because the reprocess saw a few duplicate events that the live job had deduplicated differently. The 2017–2019 work on Flink's checkpoint/2PC story (KIP-98 on the Kafka producer side, Flink's TwoPhaseCommitSinkFunction) is what made Kappa industrial. Before that, every team running Kappa had a homegrown idempotency layer (typically a unique event-id-per-event scheme written into the sink). The 2PC sink moved this from "every team's responsibility" to "the framework handles it" — and that is the inflection point where Kappa stopped being a LinkedIn-only technique.

The Confluent-promoted variant: streaming materialised views

Confluent's marketing of "real-time materialized views" via ksqlDB (and later Flink SQL) is essentially Kappa with a SQL surface. You write CREATE TABLE distinct_merchants AS SELECT ... against a Kafka topic, the engine continuously maintains the table. To reprocess, you DROP TABLE and CREATE TABLE with new logic — the engine replays from offset 0 automatically. This collapses the operational story to a SQL DDL operation. Materialize, RisingWave, and Flink SQL on Confluent Cloud all sit in this lineage. The trade-off is that you accept the engine's view-maintenance semantics (typically incremental view maintenance, /wiki/incremental-view-maintenance-as-the-endgame) instead of writing the topology yourself.

Where Indian fintech sits in 2026

Razorpay's analytics platform, post-2022 migration, is Kappa-shaped: Kafka MSK with 90-day retention plus tiered storage to S3 (~120 days reprocess horizon), Flink on Kubernetes for both live and reprocess workloads, Iceberg as the sink for serving-layer tables, the dashboard reads through a Trino layer with view definitions that point at the current "blessed" table version. PhonePe runs a similar topology with their own Kafka fork. Zerodha, with stricter regulatory and reconciliation requirements, runs a hybrid — Kappa for the trader-facing real-time analytics, traditional batch reconciliation for the end-of-day SEBI reporting (which has a 14-second freeze window and immutability requirements that are easier to satisfy with a snapshot-based batch). The pattern: Kappa for inward-facing engineering metrics, batch retained where regulatory immutability or complex historical aggregation dominates.

What goes wrong at 02:00 IST

The sharp edges of Kappa show up in incident response. A bad deploy of the live Flink job at 22:00 ships a metric definition that double-counts. By 02:00 someone notices the dashboard. The fix-forward path is: deploy a corrected v4 job with --mode reprocess from offset 0, wait for it to catch up (90 minutes), swap the serving view. During those 90 minutes, the dashboard is showing wrong numbers. The mitigation is either to keep the previous version's output table around as a "rollback target" (cheap with Iceberg, just don't expire the old snapshots until the new one is verified), or to run reprocesses against a smaller offset window first — say, only the last 24 hours — and verify before doing the full reprocess. Lambda's 14-hour batch backfill was painful; Kappa's 90-minute reprocess is fast, but you still want the operational instinct of "deploy → verify on a small window → expand to full" rather than firing the full reprocess and waiting.

Where this leads next

Kappa is one stop in the unified-batch-stream evolution. The neighbouring chapters trace the full arc:

The thing to internalise is the simplification thesis: when an engineering decision lets you delete a whole layer (the batch layer) and replace its operational role with a configuration knob (the start offset), take the deletion. The decade of Lambda taught teams to manage two pipelines in parallel, and that skill — coordinated migrations, semantic-drift detection, dual deploys — turned out to be a workaround for a missing capability in the streaming engine. Once that capability landed, the workaround was technical debt.

References