Lambda architecture: why it was a good idea that didn't last

A Flipkart analytics engineer in 2014 has two pipelines computing the same metric. One reads Kafka and updates a Redis counter every 5 seconds — the homepage shows "items sold today: 12.4 lakh" within seconds of the actual sale. The other reads HDFS once an hour and runs a Hive query that produces the correct number — 12,387,452, not 12.4 lakh. The dashboard reads from both, preferring the hourly number when it is fresh enough and falling back to the streaming counter otherwise. This is Lambda architecture, named by Nathan Marz in 2011, and for about five years it was the default way to do large-scale analytics. By 2020 almost no one was building it from scratch. This chapter is the autopsy.

Lambda architecture has three layers — a batch layer that recomputes the truth from immutable raw data, a speed layer that approximates recent data from a stream, and a serving layer that merges the two for queries. It works, but it forces you to write every transformation twice, in two different runtimes, and keep them semantically identical. That maintenance tax is the reason Kappa, the Dataflow model, and modern unified engines replaced it.

The 2011 problem Lambda was designed to solve

In 2011 the streaming world was Storm and the batch world was Hadoop MapReduce. Storm gave you sub-second latency but its exactly-once story was weak — duplicates, dropped tuples, no real fault-tolerant state. Hadoop gave you correctness — re-runnable, deterministic, replayable from raw input — but a job took hours. If you needed both "the dashboard updates within 10 seconds" and "the monthly reconciliation matches to the rupee", neither runtime could give you both.

Marz's insight was: stop trying to make one runtime do both. Run two pipelines. Make the batch pipeline the source of truth, run it on immutable raw data so it is fully reproducible, and accept the latency. Run the streaming pipeline as a best-effort approximation of the most recent window — say, the last 6 hours — and accept the correctness gaps. At query time, merge: take the batch result for everything older than 6 hours, the stream result for everything inside that window, and serve the union.

Lambda architecture: three layers, two pipelines, one queryA diagram showing immutable raw data feeding both a batch layer (Hadoop, slow, correct) and a speed layer (Storm, fast, approximate); the serving layer merges batch views and real-time views to answer queries. The three-layer Lambda topology immutable raw events Kafka / HDFS append BATCH layer Hadoop / Spark, hourly recomputes from scratch SPEED layer Storm / Spark Streaming approximate, last 6h only batch views HBase / Cassandra truth, ≤ 1h stale real-time views Redis / Memcached approximate, ≤ 5s stale SERVING layer (query merge) batch_view UNION recent_view (last 6h) query
Raw events fan out to both layers. The serving layer merges. The batch layer is the source of truth; the speed layer is a band-aid for the latency window the batch layer can't cover.

The three principles Marz baked in are still good ideas:

  1. Immutable raw data is the system of record. You append events to a log, you never mutate. Any view can be recomputed from scratch. Why immutability matters: a bug in your aggregation logic discovered three months later is fixable. You re-run the batch job from raw data with the corrected logic, and the truth catches up. With mutable state, the bug has already corrupted the state — there is nothing to re-derive from.
  2. Recomputation is the reset button. Whenever the batch layer finishes, it overwrites the speed-layer's approximation for that window. Errors in the streaming layer have a bounded blast radius — they live for at most one batch interval before the truth arrives.
  3. Latency and correctness are separate concerns that can be served by separate systems. The dashboard's "real time" panel shows you 12.4 lakh; the finance team's reconciliation pulls from the batch view and gets 12,387,452. Both are answers to "what happened today" — at different points on the latency-correctness curve.

These principles outlived Lambda itself. Modern architectures bake them in differently, but they are still the foundation.

What it actually looks like in code: the duplication tax

The killer flaw is not in the architecture diagram. It is in the codebase. Here is the same metric — "count distinct merchants who received a payment today" — written for both layers. The shape of the duplication is what eventually killed Lambda.

# lambda_metric_duplication.py — the same metric, two pipelines, two languages

# ============== BATCH LAYER (Spark, runs hourly on S3 parquet) ==============
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("daily-merchants-batch").getOrCreate()
events = spark.read.parquet("s3://razorpay-events/payments/dt=2026-04-25/")

batch_view = (
    events
      .filter(F.col("status") == "captured")
      .filter(F.col("amount") > 0)
      .groupBy(F.to_date("created_at").alias("dt"))
      .agg(F.countDistinct("merchant_id").alias("distinct_merchants"))
)
batch_view.write.mode("overwrite").parquet("s3://razorpay-views/batch/daily-merchants/")
# Output: dt=2026-04-25, distinct_merchants=4_82_117  (after 47 minutes of compute)


# ============== SPEED LAYER (Storm topology, sub-second latency) ============
# In real Storm this is Java; the Python flavour below is pyleus-style for clarity.
from pyleus.storm import SimpleBolt
import redis, datetime

class DistinctMerchantsBolt(SimpleBolt):
    def initialize(self, conf, ctx):
        self.r = redis.Redis(host="speed-layer-redis", db=0)

    def process_tuple(self, tup):
        event = tup.values[0]                # {"status": ..., "amount": ..., "merchant_id": ...}
        if event["status"] != "captured":     # filter — same as batch
            return
        if event["amount"] <= 0:               # filter — same as batch
            return
        dt = datetime.datetime.fromtimestamp(event["created_at"]).date().isoformat()
        # Approximate distinct count via HyperLogLog (Redis PFADD)
        # — the batch uses exact countDistinct; the speed layer trades accuracy for memory
        self.r.pfadd(f"merchants_today:{dt}", event["merchant_id"])

# ============== SERVING LAYER (the merge query) =============================
def serve_distinct_merchants(dt: str) -> int:
    # Read batch view first; if its watermark is recent enough, return it.
    batch_row = read_parquet_view(f"s3://razorpay-views/batch/daily-merchants/dt={dt}/")
    batch_watermark = batch_row["computed_at"]   # last batch run timestamp
    now = datetime.datetime.utcnow()
    if (now - batch_watermark) < datetime.timedelta(hours=1):
        return batch_row["distinct_merchants"]
    # Else fall back to speed layer's HLL estimate
    return int(redis.Redis(host="speed-layer-redis").pfcount(f"merchants_today:{dt}"))
# A typical day in production:
# 09:00 batch run completes for dt=2026-04-25 partition (00:00-08:00 data)
#         batch_view: distinct_merchants = 1_67_342 (exact)
# 09:00-10:00 serving layer returns batch value
# 10:00 batch run completes for 00:00-09:00 data
#         batch_view: distinct_merchants = 2_04_881 (exact)
# Between batch runs, query at 09:30 returns:
#   batch (1,67,342, stale by 30 min) + HLL delta from speed layer for 09:00-09:30
#   = approximate distinct merge — error bound roughly ±0.81% per HLL standard error

Walk through what makes the duplication painful:

The point is not that any single one of these is unfixable. The point is that you fix all of them, twice. Every change to a metric definition becomes a coordinated migration across two codebases, two test suites, two deployment pipelines, two on-call rotations.

Why it died: the triple cost

By 2018, the same teams that championed Lambda were dismantling it. Three forces converged.

The three forces that killed Lambda architectureA three-panel diagram showing duplication tax (two codebases that drift), runtime convergence (Flink and Beam doing batch and stream in one engine), and replay-from-log (Kafka with long retention making batch redundant) — each panel showing why the corresponding pillar of Lambda became unnecessary. The three reasons Lambda gave way 1. Duplication tax two codebases two languages two test suites two on-calls silent semantic drift every metric shipped twice cost scales with metric count → engineers quit 2. Runtime convergence Flink (2014): batch = bounded stream Beam / Dataflow (2015): one API, two runtimes stream engines learned exactly-once + windowing → no need for batch backup → one codebase 3. Replay-from-log Kafka retention: 7d → ∞ tiered storage to S3 log = system of record "reprocess from offset 0" replaces nightly Hadoop run → Kappa architecture → one storage tier
Each pillar of Lambda — separate engines, separate storage, separate codebases — became unnecessary as the underlying platforms matured.

The duplication tax was the loudest complaint but the most fixable one in principle. Teams that wrote a metric DSL on top of both layers (Twitter's Summingbird, LinkedIn's internal tooling) could express a metric once and compile it down to both. Summingbird in particular was a heroic effort — a Scala DSL that produced a Storm topology and a Scalding (Hadoop) job from the same code. It mostly worked, and it required a small army of people to keep the two backends in sync. Most teams did not have that army.

Runtime convergence was the larger force. Apache Flink shipped in 2014 with a model that treated batch as a special case of streaming — a bounded stream that ends. By 2017 Flink could do exactly-once stateful computation with windowing semantics that matched (and exceeded) what Spark Batch could do. The argument "we run batch because streaming isn't accurate enough" stopped being true. Apache Beam (Google's open-sourcing of the Dataflow model in 2015) cemented this — one API, multiple runners.

Kafka grew up and replaced HDFS as the system of record for many use cases. Once Kafka had infinite retention via tiered storage to S3 (officially in 2022 via KIP-405; informally via Confluent for years), the question "what is the immutable log of all events" had a single answer: Kafka itself. There was no need for a separate "raw data on HDFS" tier — the Kafka topic, replayed from offset 0, was the raw data. This is the conceptual jump that led to Kappa architecture (covered in /wiki/kappa-architecture-stream-only-and-reprocess-from-the-log).

By 2020, a startup standing up analytics for the first time would not pick Lambda. They would pick a single streaming engine (Flink, or a managed equivalent like Confluent's ksqlDB or Materialize) with the source of truth in Kafka or a lakehouse table format (/wiki/iceberg-the-format-and-the-catalog-that-runs-on-it).

What survives, what dies, what mutates

Lambda's specific architecture — separate Storm and Hadoop pipelines with a serving-layer merge — is dead. Its principles live on in three places:

In Indian production systems, the trace of Lambda is everywhere if you know what to look for. Flipkart's Big Billion Days dashboard in 2017 was Lambda-shaped — Storm fed Redis for the live counters, Hive batch jobs fed the post-event reconciliation. Razorpay's payment analytics had a Spark Streaming + Spark Batch split through 2019. By 2024 both teams had migrated to Flink-on-Iceberg topologies, with the speed/batch distinction surviving only as a watermark configuration. The architecture diagrams got simpler; the underlying problems — late data, exactly-once, windowing — stayed exactly as hard.

Common confusions

Going deeper

Why Marz emphasised immutability so hard

Re-read How to beat the CAP theorem and the framing becomes clear. Marz's argument is that mutability is the cause of most distributed-systems pain — partial failures, conflicting writes, the entire CAP-theorem trade-off space. If your system of record is append-only immutable events, then any view derived from those events is recomputable; any inconsistency is eventual; any bug is a recompute away from being fixed. This insight is independently true and has propagated far beyond Lambda — event sourcing, CQRS, lakehouses, Datomic, even the "log-structured merge tree" inside RocksDB share the same DNA.

Summingbird and the heroism of metric-DSL approaches

Twitter's Summingbird (2013, open-sourced 2014) was the most ambitious attempt to fix the duplication tax. You wrote a metric in Scala using a typed monoid algebra:

def metric: Producer[Platform, (Date, MerchantId), Long] =
  source.filter(_.status == "captured")
        .filter(_.amount > 0)
        .map(e => (e.date, e.merchantId) -> 1L)
        .sumByKey(merchants)

A Platform parameter let the same expression compile to a Storm topology (online) and a Scalding job (batch). Same code, two runtimes, guaranteed semantic match. This was elegant — and required the team to maintain two backends plus the DSL plus the type-system gymnastics that made the monoid-based aggregation work for both. Summingbird was deprecated in favour of Heron (Twitter's Storm successor) and eventually a Flink-based pipeline. The lesson is that even with heroic DSL effort, the runtime fragmentation kept leaking through — Storm's fault tolerance had different failure modes than Scalding's, and the DSL couldn't paper over those.

The CAP angle: why "beat" is too strong

Marz's "How to beat the CAP theorem" is a good read but the title overpromises. Lambda doesn't beat CAP — it sidesteps the consistency/availability question by accepting that the speed layer is eventually consistent and the batch layer is strongly consistent on its own snapshot. At any moment, reading the merged view is reading a stale-but-consistent batch result joined with a fresh-but-approximate streaming result. There's no global consistency point. CAP says you cannot have CP and AP simultaneously across a partition; Lambda says "I'll have both, but not for the same query at the same time." It's a useful trick, not a theorem violation.

Where Lambda's serving-layer merge breaks: out-of-order events that span the boundary

The boundary between batch view (e.g., closed at 08:00) and speed view (covers 08:00 onwards) assumes events arrive in event-time order, which they don't. Suppose a delayed payment event with event_time = 07:55:00 arrives at the streaming layer at processing_time = 08:30:00. The batch view for the 00:00–08:00 partition is already closed. The speed layer will count this event, but the next batch run (covering 00:00–09:00) will also count it — unless you carefully track watermarks across the boundary. Real Lambda deployments addressed this with allowed-lateness windows in the speed layer plus idempotent merge in the serving layer (e.g., distinct counts via HLL, sums via per-event-id keys). This boundary problem is exactly what the Dataflow model (/wiki/the-dataflow-model-batch-as-bounded-streams) fixed by treating the boundary as a watermark, not a wall.

How Indian fintech moved past Lambda

PhonePe's transaction analytics platform was Lambda-shaped through 2019 — Spark Batch for the daily settlement reports, Spark Streaming for the live dashboard. The migration to a unified Flink-on-Iceberg platform took about 18 months and was driven less by engineering elegance and more by a single operational pain: every metric definition change required two PRs, two reviewers, two deploys, and a 24-hour soak test where engineers stared at both pipelines to verify no drift. The new platform reduced that to a single PR. The savings showed up not in compute cost (which actually grew, slightly) but in engineering throughput — the analytics team could ship 5× more metric changes per quarter.

Where this leads next

Lambda's intellectual successors are the next four chapters in this build:

The pattern to internalise: Lambda was the right answer to a specific 2011 problem (weak streaming engines + slow batch engines, no shared abstraction). The principles it surfaced — immutability, recomputability, latency-correctness separation — are still load-bearing. The specific three-layer topology is a museum piece. Read Marz's original post for the principles; do not copy the architecture.

References