Lambda architecture: why it was a good idea that didn't last
A Flipkart analytics engineer in 2014 has two pipelines computing the same metric. One reads Kafka and updates a Redis counter every 5 seconds — the homepage shows "items sold today: 12.4 lakh" within seconds of the actual sale. The other reads HDFS once an hour and runs a Hive query that produces the correct number — 12,387,452, not 12.4 lakh. The dashboard reads from both, preferring the hourly number when it is fresh enough and falling back to the streaming counter otherwise. This is Lambda architecture, named by Nathan Marz in 2011, and for about five years it was the default way to do large-scale analytics. By 2020 almost no one was building it from scratch. This chapter is the autopsy.
Lambda architecture has three layers — a batch layer that recomputes the truth from immutable raw data, a speed layer that approximates recent data from a stream, and a serving layer that merges the two for queries. It works, but it forces you to write every transformation twice, in two different runtimes, and keep them semantically identical. That maintenance tax is the reason Kappa, the Dataflow model, and modern unified engines replaced it.
The 2011 problem Lambda was designed to solve
In 2011 the streaming world was Storm and the batch world was Hadoop MapReduce. Storm gave you sub-second latency but its exactly-once story was weak — duplicates, dropped tuples, no real fault-tolerant state. Hadoop gave you correctness — re-runnable, deterministic, replayable from raw input — but a job took hours. If you needed both "the dashboard updates within 10 seconds" and "the monthly reconciliation matches to the rupee", neither runtime could give you both.
Marz's insight was: stop trying to make one runtime do both. Run two pipelines. Make the batch pipeline the source of truth, run it on immutable raw data so it is fully reproducible, and accept the latency. Run the streaming pipeline as a best-effort approximation of the most recent window — say, the last 6 hours — and accept the correctness gaps. At query time, merge: take the batch result for everything older than 6 hours, the stream result for everything inside that window, and serve the union.
The three principles Marz baked in are still good ideas:
- Immutable raw data is the system of record. You append events to a log, you never mutate. Any view can be recomputed from scratch. Why immutability matters: a bug in your aggregation logic discovered three months later is fixable. You re-run the batch job from raw data with the corrected logic, and the truth catches up. With mutable state, the bug has already corrupted the state — there is nothing to re-derive from.
- Recomputation is the reset button. Whenever the batch layer finishes, it overwrites the speed-layer's approximation for that window. Errors in the streaming layer have a bounded blast radius — they live for at most one batch interval before the truth arrives.
- Latency and correctness are separate concerns that can be served by separate systems. The dashboard's "real time" panel shows you 12.4 lakh; the finance team's reconciliation pulls from the batch view and gets 12,387,452. Both are answers to "what happened today" — at different points on the latency-correctness curve.
These principles outlived Lambda itself. Modern architectures bake them in differently, but they are still the foundation.
What it actually looks like in code: the duplication tax
The killer flaw is not in the architecture diagram. It is in the codebase. Here is the same metric — "count distinct merchants who received a payment today" — written for both layers. The shape of the duplication is what eventually killed Lambda.
# lambda_metric_duplication.py — the same metric, two pipelines, two languages
# ============== BATCH LAYER (Spark, runs hourly on S3 parquet) ==============
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("daily-merchants-batch").getOrCreate()
events = spark.read.parquet("s3://razorpay-events/payments/dt=2026-04-25/")
batch_view = (
events
.filter(F.col("status") == "captured")
.filter(F.col("amount") > 0)
.groupBy(F.to_date("created_at").alias("dt"))
.agg(F.countDistinct("merchant_id").alias("distinct_merchants"))
)
batch_view.write.mode("overwrite").parquet("s3://razorpay-views/batch/daily-merchants/")
# Output: dt=2026-04-25, distinct_merchants=4_82_117 (after 47 minutes of compute)
# ============== SPEED LAYER (Storm topology, sub-second latency) ============
# In real Storm this is Java; the Python flavour below is pyleus-style for clarity.
from pyleus.storm import SimpleBolt
import redis, datetime
class DistinctMerchantsBolt(SimpleBolt):
def initialize(self, conf, ctx):
self.r = redis.Redis(host="speed-layer-redis", db=0)
def process_tuple(self, tup):
event = tup.values[0] # {"status": ..., "amount": ..., "merchant_id": ...}
if event["status"] != "captured": # filter — same as batch
return
if event["amount"] <= 0: # filter — same as batch
return
dt = datetime.datetime.fromtimestamp(event["created_at"]).date().isoformat()
# Approximate distinct count via HyperLogLog (Redis PFADD)
# — the batch uses exact countDistinct; the speed layer trades accuracy for memory
self.r.pfadd(f"merchants_today:{dt}", event["merchant_id"])
# ============== SERVING LAYER (the merge query) =============================
def serve_distinct_merchants(dt: str) -> int:
# Read batch view first; if its watermark is recent enough, return it.
batch_row = read_parquet_view(f"s3://razorpay-views/batch/daily-merchants/dt={dt}/")
batch_watermark = batch_row["computed_at"] # last batch run timestamp
now = datetime.datetime.utcnow()
if (now - batch_watermark) < datetime.timedelta(hours=1):
return batch_row["distinct_merchants"]
# Else fall back to speed layer's HLL estimate
return int(redis.Redis(host="speed-layer-redis").pfcount(f"merchants_today:{dt}"))
# A typical day in production:
# 09:00 batch run completes for dt=2026-04-25 partition (00:00-08:00 data)
# batch_view: distinct_merchants = 1_67_342 (exact)
# 09:00-10:00 serving layer returns batch value
# 10:00 batch run completes for 00:00-09:00 data
# batch_view: distinct_merchants = 2_04_881 (exact)
# Between batch runs, query at 09:30 returns:
# batch (1,67,342, stale by 30 min) + HLL delta from speed layer for 09:00-09:30
# = approximate distinct merge — error bound roughly ±0.81% per HLL standard error
Walk through what makes the duplication painful:
- The two filters appear in both layers.
status == "captured"andamount > 0are written in PySpark DSL on the batch side and in plain Python on the speed side. Why this drift is silent: a product manager asks to also excludemerchant_id = "internal-test". The Spark engineer adds the filter to the batch job and ships. Two weeks later the dashboard shows different numbers from real-time vs hourly because the Storm topology was never updated. There is no compiler error — the two layers are different services, different code repos, different deploy schedules. You only find the drift when an analyst notices the dashboard "jumps" at the top of every hour. - The batch layer uses
countDistinct(exact); the speed layer uses HyperLogLog (approximate). This is a deliberate compromise — exact distinct over a streaming window needs unbounded memory. Why this matters semantically: the merge in the serving layer is implicitly mixing two different aggregations. "Today's distinct merchants" returned at 09:30 is(exact count of 1,67,342 for 00:00–09:00) + (HLL estimate of new merchants 09:00–09:30). The two are not directly addable — the HLL only knows about merchants in its window; it cannot tell which of those were already counted in the batch. - The HLL workaround is to seed the HLL with the batch view at boundary time. Most Lambda implementations either accept the small overlap error (~1% for HLL) or rebuild the HLL state from the batch view at every batch completion — which is its own coordination problem.
- The serving layer's "freshness" check is a heuristic. "If the batch is less than an hour old, use it" works most of the time but breaks during incidents. If the batch job is delayed by 4 hours, the serving layer falls back to speed-layer-only — and the speed layer is only retaining 6 hours of state, so a 4-hour batch delay risks losing data outside the speed layer's retention.
The point is not that any single one of these is unfixable. The point is that you fix all of them, twice. Every change to a metric definition becomes a coordinated migration across two codebases, two test suites, two deployment pipelines, two on-call rotations.
Why it died: the triple cost
By 2018, the same teams that championed Lambda were dismantling it. Three forces converged.
The duplication tax was the loudest complaint but the most fixable one in principle. Teams that wrote a metric DSL on top of both layers (Twitter's Summingbird, LinkedIn's internal tooling) could express a metric once and compile it down to both. Summingbird in particular was a heroic effort — a Scala DSL that produced a Storm topology and a Scalding (Hadoop) job from the same code. It mostly worked, and it required a small army of people to keep the two backends in sync. Most teams did not have that army.
Runtime convergence was the larger force. Apache Flink shipped in 2014 with a model that treated batch as a special case of streaming — a bounded stream that ends. By 2017 Flink could do exactly-once stateful computation with windowing semantics that matched (and exceeded) what Spark Batch could do. The argument "we run batch because streaming isn't accurate enough" stopped being true. Apache Beam (Google's open-sourcing of the Dataflow model in 2015) cemented this — one API, multiple runners.
Kafka grew up and replaced HDFS as the system of record for many use cases. Once Kafka had infinite retention via tiered storage to S3 (officially in 2022 via KIP-405; informally via Confluent for years), the question "what is the immutable log of all events" had a single answer: Kafka itself. There was no need for a separate "raw data on HDFS" tier — the Kafka topic, replayed from offset 0, was the raw data. This is the conceptual jump that led to Kappa architecture (covered in /wiki/kappa-architecture-stream-only-and-reprocess-from-the-log).
By 2020, a startup standing up analytics for the first time would not pick Lambda. They would pick a single streaming engine (Flink, or a managed equivalent like Confluent's ksqlDB or Materialize) with the source of truth in Kafka or a lakehouse table format (/wiki/iceberg-the-format-and-the-catalog-that-runs-on-it).
What survives, what dies, what mutates
Lambda's specific architecture — separate Storm and Hadoop pipelines with a serving-layer merge — is dead. Its principles live on in three places:
- Immutable raw data as the system of record. Every modern lakehouse design (Iceberg, Delta, Hudi) is built on this principle. Append-only Parquet files, manifest snapshots, time-travel reads — all direct descendants of Marz's "immutable atomic data".
- Recomputation as the bug-fix mechanism. Backfills (/wiki/backfills-re-running-history-correctly) are still how you recover from logic bugs. The mechanism changed (replay a Kafka topic vs re-run a Hadoop job), the principle didn't.
- The latency-vs-correctness trade-off is now negotiated at the operator level, not the architecture level. A single Flink job can have an exactly-once pipeline path and an at-least-once approximate path coexisting — windowing semantics, watermarks, allowed lateness, side outputs. The two pipelines fused into one runtime with knobs.
In Indian production systems, the trace of Lambda is everywhere if you know what to look for. Flipkart's Big Billion Days dashboard in 2017 was Lambda-shaped — Storm fed Redis for the live counters, Hive batch jobs fed the post-event reconciliation. Razorpay's payment analytics had a Spark Streaming + Spark Batch split through 2019. By 2024 both teams had migrated to Flink-on-Iceberg topologies, with the speed/batch distinction surviving only as a watermark configuration. The architecture diagrams got simpler; the underlying problems — late data, exactly-once, windowing — stayed exactly as hard.
Common confusions
- "Lambda is just any architecture with both batch and streaming." No — Lambda is specifically the three-layer pattern with a merge at query time and the batch layer as the system of truth that overwrites the streaming layer's approximations. A pipeline that has Spark Batch for ETL and Flink for real-time alerting, where the two never share a metric, is not Lambda. Lambda's defining property is the redundant computation of the same metric for the purpose of merge.
- "Kappa is just Lambda without the batch layer." Close but wrong — Kappa is the streaming layer reused as the batch layer, by replaying the Kafka log from offset 0 with a longer-retention topology. The compute and the code are unified; only the input changes (live tail vs replay from offset 0). Lambda's batch and speed layers were genuinely different runtimes; Kappa's "batch" is the same Flink job started with a different start offset.
- "Lambda was a bad idea." It was a correct idea for 2011's tooling. Storm couldn't do exactly-once. Hadoop couldn't do real-time. The two-pipeline pattern was the only way to get both properties at the same time. The criticism is not "Lambda was wrong then" but "Lambda became wrong as the underlying engines improved, and many teams kept it longer than they should have."
- "The batch layer in Lambda is the same as the warehouse." No. Lambda's batch layer recomputes views from raw immutable data on every run — full recomputation, not incremental. A modern warehouse like Snowflake or BigQuery does incremental MERGE upserts; that's not Lambda's batch layer. The "recomputed from scratch" property is what made Lambda's batch layer self-healing — and what made it expensive.
- "You need HyperLogLog because the batch layer uses exact counts." Only if you literally need cardinality estimates. The general pattern — speed layer uses an approximate algorithm, batch uses exact — is a symptom of Lambda's split, not a requirement of it. Some metrics (sums, counts of events not entities) are exact in both layers; some metrics (top-K, percentiles) are approximate in both. The duplication tax exists either way.
- "Lambda's serving layer is a query engine." No, it's a thin merge layer in front of two precomputed view stores. The actual computation is done in batch and speed layers; the serving layer just picks which view to read or how to combine them. Modern equivalents (Druid, Pinot, ClickHouse) bake the merge inside a single query engine — the architectural distinction collapses.
Going deeper
Why Marz emphasised immutability so hard
Re-read How to beat the CAP theorem and the framing becomes clear. Marz's argument is that mutability is the cause of most distributed-systems pain — partial failures, conflicting writes, the entire CAP-theorem trade-off space. If your system of record is append-only immutable events, then any view derived from those events is recomputable; any inconsistency is eventual; any bug is a recompute away from being fixed. This insight is independently true and has propagated far beyond Lambda — event sourcing, CQRS, lakehouses, Datomic, even the "log-structured merge tree" inside RocksDB share the same DNA.
Summingbird and the heroism of metric-DSL approaches
Twitter's Summingbird (2013, open-sourced 2014) was the most ambitious attempt to fix the duplication tax. You wrote a metric in Scala using a typed monoid algebra:
def metric: Producer[Platform, (Date, MerchantId), Long] =
source.filter(_.status == "captured")
.filter(_.amount > 0)
.map(e => (e.date, e.merchantId) -> 1L)
.sumByKey(merchants)
A Platform parameter let the same expression compile to a Storm topology (online) and a Scalding job (batch). Same code, two runtimes, guaranteed semantic match. This was elegant — and required the team to maintain two backends plus the DSL plus the type-system gymnastics that made the monoid-based aggregation work for both. Summingbird was deprecated in favour of Heron (Twitter's Storm successor) and eventually a Flink-based pipeline. The lesson is that even with heroic DSL effort, the runtime fragmentation kept leaking through — Storm's fault tolerance had different failure modes than Scalding's, and the DSL couldn't paper over those.
The CAP angle: why "beat" is too strong
Marz's "How to beat the CAP theorem" is a good read but the title overpromises. Lambda doesn't beat CAP — it sidesteps the consistency/availability question by accepting that the speed layer is eventually consistent and the batch layer is strongly consistent on its own snapshot. At any moment, reading the merged view is reading a stale-but-consistent batch result joined with a fresh-but-approximate streaming result. There's no global consistency point. CAP says you cannot have CP and AP simultaneously across a partition; Lambda says "I'll have both, but not for the same query at the same time." It's a useful trick, not a theorem violation.
Where Lambda's serving-layer merge breaks: out-of-order events that span the boundary
The boundary between batch view (e.g., closed at 08:00) and speed view (covers 08:00 onwards) assumes events arrive in event-time order, which they don't. Suppose a delayed payment event with event_time = 07:55:00 arrives at the streaming layer at processing_time = 08:30:00. The batch view for the 00:00–08:00 partition is already closed. The speed layer will count this event, but the next batch run (covering 00:00–09:00) will also count it — unless you carefully track watermarks across the boundary. Real Lambda deployments addressed this with allowed-lateness windows in the speed layer plus idempotent merge in the serving layer (e.g., distinct counts via HLL, sums via per-event-id keys). This boundary problem is exactly what the Dataflow model (/wiki/the-dataflow-model-batch-as-bounded-streams) fixed by treating the boundary as a watermark, not a wall.
How Indian fintech moved past Lambda
PhonePe's transaction analytics platform was Lambda-shaped through 2019 — Spark Batch for the daily settlement reports, Spark Streaming for the live dashboard. The migration to a unified Flink-on-Iceberg platform took about 18 months and was driven less by engineering elegance and more by a single operational pain: every metric definition change required two PRs, two reviewers, two deploys, and a 24-hour soak test where engineers stared at both pipelines to verify no drift. The new platform reduced that to a single PR. The savings showed up not in compute cost (which actually grew, slightly) but in engineering throughput — the analytics team could ship 5× more metric changes per quarter.
Where this leads next
Lambda's intellectual successors are the next four chapters in this build:
- /wiki/the-dataflow-model-batch-as-bounded-streams — Google's 2015 paper that reframed batch as a special case of streaming, removing the conceptual basis for two pipelines.
- /wiki/kappa-architecture-stream-only-and-reprocess-from-the-log — the direct response to Lambda's duplication tax: kill the batch layer, reprocess by replaying the log.
- /wiki/beam-and-flink-write-once-run-on-both — the engine that made unified batch/stream practical for production teams.
- /wiki/incremental-view-maintenance-as-the-endgame — the database community's parallel answer to the same problem, via differential dataflow.
The pattern to internalise: Lambda was the right answer to a specific 2011 problem (weak streaming engines + slow batch engines, no shared abstraction). The principles it surfaced — immutability, recomputability, latency-correctness separation — are still load-bearing. The specific three-layer topology is a museum piece. Read Marz's original post for the principles; do not copy the architecture.
References
- How to beat the CAP theorem (Nathan Marz, 2011) — the founding essay. Read it for the principles, not the implementation advice.
- Big Data: Principles and best practices of scalable real-time data systems (Marz & Warren, 2015) — the book-length treatment, including the full three-layer reference architecture.
- Questioning the Lambda Architecture (Jay Kreps, 2014) — the most influential critique, by Kafka's co-creator. The seed of Kappa architecture.
- Summingbird: A Framework for Integrating Batch and Online MapReduce Computations (Boykin et al., VLDB 2014) — the academic write-up of the most ambitious DSL attempt to bridge the two layers.
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (Akidau et al., VLDB 2015) — Google's paper that obsoleted Lambda's intellectual basis.
- /wiki/kappa-architecture-stream-only-and-reprocess-from-the-log — the simplification that replaced Lambda for most production teams.
- /wiki/backfills-re-running-history-correctly — the modern descendant of Lambda's "recompute from raw" principle.
- /wiki/the-append-only-log-simplest-store — the immutability primitive at the bottom of every Lambda-descendant architecture.