Wall: daily batches are too slow for the business

It is 13:42 IST on a Saturday at a Bengaluru payments company. Card-not-present fraud on a single BIN range starts spiking — twenty fraudulent ₹19,999 transactions per minute, all routed through a freshly compromised checkout flow. The fraud-ops dashboard, fed by the nightly Iceberg pipeline that everyone built in Build 6, will refresh at 02:00 the next morning. By the time analyst Aditi sees the spike on Sunday's stand-up, the issuer has already reported ₹3.4 crore in chargebacks, the acquirer has slapped a hold on the merchant account, and the engineering team is in a Sunday war room rebuilding the rule that should have fired in the first 90 seconds. The lakehouse is correct. The columnar layout is fast. Compaction ran cleanly at 03:15. None of that mattered, because the business needed the answer in seconds and the architecture handed it back in seventeen hours.

Build 6 made storage cheap and queries scan-efficient, but it left the cadence problem untouched: a pipeline that runs once a day or once an hour can never answer questions whose value decays in seconds. Fraud detection, dynamic pricing, ride dispatch, ad bidding, and personalisation all need event-time-to-insight measured in seconds, not hours. The wall is the moment the business stops accepting "the dashboard refreshes overnight" as an answer — and the only honest fix is to flip the pipeline from "store first, compute later" to "process events as they arrive". That flip is Build 7.

The shape of the wall — latency is a product feature, not a perf metric

The pipelines you have built so far share one assumption: data lands in a staging area, an orchestrator wakes up on a schedule, the transform reads the staging area, writes the gold table, and the dashboard reads the gold table. Every step is a barrier. The orchestrator does not know that fraud is spiking; it knows that 02:00 has arrived. The staging area does not know an event is urgent; it knows a file appeared. The dashboard does not know the user is waiting; it knows a query was issued. The end-to-end event-time-to-insight — the wall-clock interval between an event happening in the world and the system being able to act on it — is the sum of every barrier.

Even on a well-tuned hourly pipeline, the math is brutal. An event that occurs at 14:07 lands in S3 at 14:08 (1 minute), waits for the next top-of-hour trigger at 15:00 (53 minutes), the transform takes 18 minutes (it scans the last 24 hours for join consistency), the dashboard cache invalidates at 15:18, the user sees the number at 15:19. The minimum is 72 minutes for a single event. Daily pipelines push the worst case past 24 hours.

Event-time-to-insight under daily, hourly, and streaming cadencesThree horizontal timelines stacked vertically. Top: a daily batch with an event at 14:07 producing insight at 02:00 the next day, total around 12 hours. Middle: hourly batch, event at 14:07 producing insight at 15:18, around 72 minutes. Bottom: streaming, event at 14:07 producing insight at 14:07:08, around 8 seconds. Each shows the dominant barrier. Event-time-to-insight: where the latency actually goes Daily batch (02:00 nightly Iceberg pipeline) land wait until 02:00 next day (~12 h) run ~12 hours wall-clock; useful work is <1% of the interval. Hourly batch (top-of-hour trigger) land wait for next hour (~53 min) transform (~18 min) ~72 minutes; 73% of the time is just waiting for the next clock tick. Streaming (event-driven) event → log → operator → sink, ~8 seconds No clock-tick wait; the operator wakes when the event arrives, not when the cron fires. The bar isn't "make the transform faster"; it's "remove the wait barrier between the event and the operator".
The dominant cost in batch isn't the transform — it's the wait between the event and the next scheduled trigger. Streaming removes the wait, not the transform.

Why hourly is not "almost streaming": the median latency of an hourly job is 30 minutes (events arrive uniformly across the hour) and the worst case is 60+ minutes. Most decisions a fraud system or a ride-dispatcher cares about have value that decays exponentially with a half-life of 30 seconds. By the median latency of an hourly batch, the value of the answer has dropped by 60 half-lives — effectively zero.

Five workloads that broke the batch contract

The wall is concrete. These are the five workloads that, around 2018–2022, forced every serious Indian platform off batch and onto streaming.

Workload Decay half-life What batch returns What the business needs
Card fraud detection (Razorpay, PayU) ~30 s Yesterday's spike, today's chargeback Block the transaction in < 200 ms
Surge pricing (Ola, Uber, Rapido) ~60 s Last hour's average Per-cell, per-minute fare multiplier
Ride dispatch (Swiggy, Zomato delivery) ~5 s Driver's location 4 minutes ago Driver's location now, ETA accurate to 30 s
Real-time bidding (Inmobi, Jio Ads) ~10 ms Yesterday's CTR Bid value computed during the 100 ms RTB auction
Personalisation re-rank (Flipkart, Myntra) ~5 min This morning's behaviour The user's last 30 seconds of clicks

The pattern across all five: the value of the answer is a function of how recent the input is, and that function decays faster than any reasonable batch schedule can keep up with. You can shrink the schedule from daily to hourly to 15-minute to 5-minute "micro-batch", but each step costs more orchestrator overhead, more transform startup cost, more compaction churn, and more manifest writes — and you still cannot beat the median half of a 5-minute interval, which is 2.5 minutes. For a 30-second half-life, 2.5 minutes is five half-lives — the answer is worth 3% of its event-time value.

A sixth workload — and the one most batch-trained engineers underweight — is the operational dashboard the on-call uses at 2 a.m. When the payments pipeline alarm fires, the on-call engineer wants to know "what's broken, right now". A 5-minute-stale dashboard that says "everything is fine" while the system is currently on fire is worse than no dashboard at all, because it teaches the on-call to distrust the tool and revert to grepping logs. Real-time observability — Grafana with sub-second freshness on Prometheus or VictoriaMetrics — is itself a streaming-shaped problem. The metrics pipeline is one of the streaming pipelines you'll build.

Building it: measuring how fast value decays

The "value decays" intuition is precise enough to compute. If an answer has half-life h and your pipeline has end-to-end latency L, the value-of-information ratio is 0.5^(L/h). Below is a working simulation that compares daily, hourly, micro-batch, and streaming on a 30-second half-life workload.

# value_decay.py — how much of an answer's value survives at each cadence?
import math, statistics, random

# Scenarios: name -> distribution of end-to-end latencies in seconds.
# (mean,  ceiling) — uniform between event_time and the next trigger.
SCHEDULES = {
    "daily_02_00":   (12 * 3600,  24 * 3600),     # nightly batch
    "hourly":        (1800 + 1080, 3600 + 1080),  # 30 min wait + 18 min run
    "micro_5min":    (150 + 90,    300 + 90),     # 2.5 min wait + 90 s run
    "streaming":     (8,           20),           # 8 s typical, 20 s p99
}
HALF_LIFE_S = 30.0  # fraud-detection-style decay
N_EVENTS    = 100_000

def voi(latency_seconds, half_life=HALF_LIFE_S):
    # Value-of-information that survives `latency_seconds` of delay.
    return 0.5 ** (latency_seconds / half_life)

random.seed(42)
print(f"{'cadence':<14} {'p50_lat':>10} {'p99_lat':>10} {'voi_p50':>10} "
      f"{'voi_mean':>10} {'voi_p99':>10}")
for name, (mean_s, max_s) in SCHEDULES.items():
    latencies = [random.uniform(0, max_s) for _ in range(N_EVENTS)]
    vois = [voi(l) for l in latencies]
    p50 = statistics.median(latencies)
    p99 = sorted(latencies)[int(N_EVENTS * 0.99)]
    voi_p50  = voi(p50)
    voi_mean = statistics.mean(vois)
    voi_p99  = voi(p99)
    print(f"{name:<14} {p50:>10.1f} {p99:>10.1f} "
          f"{voi_p50:>10.6f} {voi_mean:>10.6f} {voi_p99:>10.6f}")
# Sample run:
cadence           p50_lat    p99_lat    voi_p50   voi_mean    voi_p99
daily_02_00       43200.0    85536.0   0.000000   0.000041   0.000000
hourly             2340.0     4633.7   0.000000   0.000124   0.000000
micro_5min          194.6      386.3   0.011378   0.087921   0.000142
streaming             9.9       19.8   0.795270   0.864711   0.631076

The lines that matter:

Why the daily and hourly columns show 0.000000 at p50: with a 30-second half-life, even the median latency of the hourly cadence is 39 minutes, which is 78 half-lives, so the surviving value is 0.5^78 ≈ 3 × 10^-24 — indistinguishable from zero in float64. The number isn't literally zero; it is too small to matter, which is the entire point.

The takeaway is the gap between micro-batch (8.7% mean value retention) and streaming (86.5%): a full order of magnitude. The point is not "streaming is twice as fast as batch" — it is "for a 30-second half-life workload, streaming preserves an order of magnitude more value than even a 5-minute micro-batch can". And micro-batch already costs 12× more orchestrator triggers per day than hourly. Past a certain decay rate, the cost of the next batch-shrinking step becomes infinite, because shrinking the schedule below the transform's startup cost is impossible — you can't run a Spark job in 30 seconds.

Why "just run the batch every minute" doesn't work

Every team that hits this wall first asks: can we just run the existing pipeline every minute? Three things break.

Startup cost dominates. A typical Spark or Trino job spends 30–90 seconds spinning up executors, fetching the manifest, planning the query, and only then begins to read data. A 60-second schedule with a 60-second startup cost has zero useful runtime. Even on warm clusters with reused executors, the per-trigger overhead of the orchestrator (DAG scheduling, sensor polling, lineage emission) takes 3–10 seconds — and you pay it once per minute, 1440 times a day. The orchestrator's own database starts to bend under the trigger frequency; Airflow's metadata DB on a 1-minute schedule across 200 DAGs commits roughly 12,000 task-state writes per minute.

Compaction storms. Each micro-batch run produces a new commit. Iceberg or Delta now has 1440 commits a day instead of 24, each with its own manifest and small data file. The compaction job that was a once-a-day chore is now a continuous background process, and the small-file count compounds faster than compaction can keep up. (You read /wiki/compaction-small-files-hell-and-how-to-avoid-it — that problem multiplies by 60×.) The S3 PUT cost alone goes from a few hundred per day to tens of thousands.

State doesn't fit in a single batch. "Number of failed login attempts in the last 5 minutes" is trivial in streaming (a 5-minute window with a counter) and a nightmare in micro-batch — every run has to read the previous batch's output to know the running count, and the read-modify-write race between consecutive batches is the kind of bug that loses ₹2 crore on a busy Saturday. Sliding-window aggregations like "p99 latency over the last 60 seconds, updated every 5 seconds" require state that survives across micro-batch invocations, and re-loading that state from a warehouse table every 5 seconds is its own performance disaster.

Out-of-order events break the bucket. Batch processing assumes "events that landed in this trigger window belong to this trigger window". A late-arriving event (the rider's GPS ping that took 90 seconds to upload because their phone was on a flaky 4G signal) lands in the wrong batch and produces wrong aggregates. Streaming systems treat this as a first-class problem with watermarks (Build 8) — but a batch pipeline has no language to express it.

Why micro-batch breaks below the startup-cost floorA horizontal axis of schedule period from 1 day to 10 seconds. A staircase of useful runtime per trigger. Below 60 seconds, useful runtime drops to zero because startup cost equals the period. Useful runtime per trigger as you shrink the schedule 1 day 1 hr 5 min 1 min 10 s schedule period (log-spaced) 0% 100% useful runtime startup-cost floor batch is dead here ~60 s
As the schedule shrinks, useful runtime per trigger collapses; below the ~60 s startup-cost floor, every trigger is pure overhead. Streaming moves to a different model: one long-running operator, not many short triggers.

The streaming model gives up on "trigger periodically" entirely. Instead: one long-running operator, fed events one at a time (or in tiny micro-batches of 50–500 events), keeping its own state in memory or on local SSD, emitting outputs continuously. There is no startup cost because the operator never stops. There are no compaction storms because the operator is the one writing to the sink, and it can batch writes at whatever cadence the sink wants. And state is a first-class citizen — windows, joins, deduplication all live in the operator's local state store.

Why "long-running operator" is a fundamentally different cost shape: a batch trigger has fixed startup cost S regardless of payload size, so cost-per-event is S/n and shrinks toward zero as n grows. Below a critical n, S/n exceeds the per-event work and the trigger is mostly overhead. Streaming flattens this: the operator pays S once at startup (and again only on restart/failover), so cost-per-event approaches the irreducible per-event cost asymptotically — there is no minimum-batch-size threshold below which the model breaks.

The "freshness SLO" reframed

Build 5 introduced the freshness SLO — a contract that says "the gold table will be at most N minutes stale". The Build-5 SLO was a batch freshness SLO; it constrained the worst-case wall-clock age of the latest fact. The streaming-era reframing changes the unit. Instead of "how stale is the table", the new SLO is "how stale is the answer the user just received". The two are not the same.

A table can be 5 seconds fresh and the user's answer can be 5 minutes stale, if the user's query was answered from a cache, an aggregate, or a materialised view that hasn't refreshed yet. A table can be 5 minutes stale and the user's answer can be 1 second fresh, if the user's query bypasses the table and reads directly from the streaming layer. The SLO has to follow the user's read path, not the table's write path. Build 14 (real-time analytics) makes this explicit: the freshness SLO is t_event_arrives_at_user_screen - t_event_happened_in_world, not t_event_visible_in_table - t_event_happened_in_world. Why this matters when you negotiate the SLO with the business: a 30-second-fresh table is technically "fresh" but if the dashboard polls it every 5 minutes, the user sees data up to 5 minutes 30 seconds old. The SLO has to budget the entire chain — capture, transport, store, serve, render — and any 95th-percentile contributor that exceeds the budget breaks the contract.

A war story: how Swiggy hit the wall

In 2019, Swiggy's "delivery ETA shown to the customer" was computed by a 5-minute micro-batch that aggregated rider GPS pings, restaurant prep status, and the city's traffic state into a per-order ETA estimate. The system was correct — the model had been validated, the lineage was clean, and the freshness SLO was 5 minutes. On a Tuesday lunch rush in Bengaluru, a thunderstorm hit Koramangala at 12:48. Riders slowed, restaurants started running 20 minutes late, and the ETA on the customer's screen — the one frozen at the 12:45 micro-batch tick — went from "accurate" to "lying" inside 90 seconds. By 12:50, the next batch tick fired. The new batch read 5 minutes of stale GPS pings (some still showing pre-storm speeds because of GPS lag), produced an ETA that was 7 minutes optimistic, and showed it to every customer for the next 5 minutes. Cancellations went up 4×. CX got slammed. The fix was not a faster batch; it was a streaming pipeline that updated the ETA every time a GPS ping came in for a rider on an active order. After the migration, p99 ETA-update latency dropped from 5 minutes to 4 seconds, and cancellation rates during weather events dropped 60%. Swiggy's then-VP of engineering, Dale Vaz, wrote about the migration in a 2021 talk; the takeaway was that "we did not need a faster batch — we needed a different shape".

The shape difference is the point. The old pipeline was triggered by a clock. The new pipeline was triggered by an event. Every rider GPS ping became an event on a Kafka topic; a stateful operator joined that ping with the open orders for that rider, recomputed the ETA, and pushed the result into the customer's screen via a websocket. The pipeline never sleeps; the operator never restarts; the cost model is "per event" instead of "per trigger". This is the shape of every chapter in Build 7 and Build 8.

What the next 7 chapters give you (and what they don't)

Build 7 builds the foundation: a partitioned, replicated message log — the data structure that streaming is built on. The chapters that follow (/wiki/why-logs-the-one-data-structure-streaming-is-built-on, /wiki/a-single-partition-log-in-python, /wiki/partitions-and-parallelism, /wiki/consumer-groups-and-offset-management, /wiki/replication-and-isr-how-kafka-stays-up, /wiki/retention-compaction-tiered-storage, /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda) build the substrate.

Build 7 does not yet give you stateful stream processing — that's Build 8 (windows, watermarks, joins). It does not give you exactly-once — that's Build 9. It does not give you a unified batch+stream model — that's Build 10. The wall you are crossing now is the substrate wall: you cannot do any of the higher-level work until you have a partitioned, replicated, append-only log that producers can write to and consumers can read from in real time. That log is the next seven chapters.

What the team has to relearn

Every team that crosses this wall pays an internal-skills tax. Writing a batch transform is, to a 2026 data engineer, second nature: a SQL query, a dbt model, a scheduled trigger. Writing a streaming operator is a different muscle — you have to think about per-key state, watermarks, checkpoints, restart semantics, backpressure, and exactly-once delivery from day one. The DAG view of the world (Build 4) is replaced by a topology view, where operators are long-lived and the data flows continuously through them rather than in discrete trigger ticks.

The on-call's mental model also shifts. With batch, "the pipeline is broken" usually means "a job failed and the retry didn't help" — a discrete event with a clear timestamp. With streaming, "the pipeline is broken" can mean "the lag on partition 17 is climbing", "the operator's state store is approaching its disk limit", "a re-balance is taking longer than the consumer's session timeout", or "watermarks have stalled because one upstream producer went silent". The failure modes are continuous, not discrete, and the alerts have to be calibrated to the new shape. The first three months after a Kafka migration are usually the hardest months an SRE team has had — the old runbooks no longer apply, and the new ones have to be written from production incidents.

How to recognise the wall in your own pipeline

Three signals tell you which workloads have hit the wall, before the war room does. Signal one: the ops dashboard shows a spike, but the engineer drilling in finds the underlying table is 23 minutes stale and "the spike was actually 23 minutes ago" — meaning the on-call is reacting to the past, not the present. Signal two: a product manager keeps asking for "real-time" without naming a number, and every quarter the answer "the dashboard is daily" loses a feature request. Signal three: the team has shrunk a batch schedule three times in eighteen months (daily → hourly → 15-minute → 5-minute) and is debating shrinking it again — that is the curve hitting the startup-cost floor. When two of these three are true, the workload has hit the wall and the only honest answer is to migrate the architecture, not the schedule.

The migration is not all-or-nothing. The standard pattern at Indian platforms is to run the streaming pipeline alongside the batch pipeline for 4–8 weeks, comparing outputs row-for-row, and cut over only after the streaming output has matched the batch output for a full business cycle (a fortnight covers most weekly seasonality, a month covers month-end accounting). The cutover itself is just a feature-flag flip — but the parallel run is what gives you the confidence that the new pipeline is correct.

Common confusions

Going deeper

The half-life calculation, derived

For a decision whose value at delay L is v(L) = v_0 · 2^(-L/h), the expected value-of-information at a uniform-random latency in [0, T] is E[v] = v_0 · (h / (T · ln 2)) · (1 - 2^(-T/h)). For T ≫ h this collapses to v_0 · h / (T · ln 2) — the value scales as 1/T. Halving the schedule period doubles the value retained. This is why every step from daily → hourly → 5-minute is roughly an order-of-magnitude business win, until startup cost cuts the curve off. The streaming regime exits the formula entirely: there is no T, only the operator's per-event latency.

The wall in numbers: a real Indian-platform decision matrix

Across 2020–2024, a representative sample of Indian platforms migrated specific workloads off batch and onto streaming. The decisions were not architectural ideology; they were per-workload latency budgets meeting business pressure.

Platform Workload Old cadence New cadence What forced the change
Razorpay Card-not-present fraud rules 5-min micro-batch Streaming (Flink + Kafka) A single ₹3.4 crore loss in one day
Zerodha Order-book pre-trade risk check Every 30 s Per-event (microsecond budget) SEBI margin-rule enforcement
Swiggy Delivery ETA on customer screen 5-min micro-batch Per-GPS-ping streaming Cancellation rate during weather events
Flipkart Personalisation re-rank Hourly batch 30-second incremental Conversion lift from "last 30 seconds of clicks"
Dream11 Match contest leaderboard 60-second polling 1-second push User experience during peak overs
PhonePe UPI velocity-check fraud 1-min batch Streaming RBI's day-zero fraud-block mandate

In every row, the "new cadence" column is not "just faster batch" — it is a different architecture. The substrate is a partitioned message log; the operator is long-running; the state is local; the sink is incremental. That substrate is what Build 7 builds.

What "lambda architecture" was and why it died

Around 2014, Nathan Marz proposed running batch and streaming pipelines side by side: batch for correctness, streaming for freshness, merged in a serving layer. It worked, but the operational tax was brutal — every business logic change had to be implemented twice, in two different languages, and any divergence between the two implementations produced a "speed layer / batch layer" reconciliation bug. By 2018, the industry had landed on "kappa": one streaming pipeline, with the batch layer reduced to "the streaming pipeline replayed from the beginning of the log". Build 10 covers this in detail.

The latency the network gives you for free

Across an Indian fibre backbone, Mumbai to Bengaluru is around 25 ms RTT; AWS Mumbai (ap-south-1) to AWS Singapore (ap-southeast-1) is around 70 ms; cross-Pacific to AWS US-East is 230 ms. The streaming pipeline's wire latency is bounded by these — you cannot publish an event in Mumbai and act on it in Bengaluru faster than 25 ms one-way. Anything you build has to budget for the network as a hard floor, not a tunable. This is also why latency-sensitive systems (ad bidding, HFT, fraud-block) deploy region-local — the network alone exceeds their decision budget across regions. A useful exercise before Build 7: write the latency budget for one of your decisions, end to end. For Razorpay's transaction-block decision (target: 200 ms total from user-tap to allow/deny), a realistic split is 60 ms device-to-edge network, 15 ms ingress proxy, 25 ms rule-engine evaluation, 10 ms response serialisation, and 90 ms margin for cross-region failover. The rule-engine's 25 ms budget is what dictates the streaming architecture downstream of it. Why this matters: every layer's budget descends from the user-visible decision deadline. The streaming layer's job is to keep the feature store warm so the rule-engine's lookups stay on the additive critical path, never on the cold-fetch one.

When batch is still the right answer

Not every workload needs streaming. Quarterly financial close, regulatory reporting (RBI's monthly NPA filings), SCD-Type-2 dimensions on a slowly-changing master table, ML-model retraining on six months of training data — all of these are batch workloads, and bolting streaming on them adds complexity without value. The wall is workload-specific: it crashes the latency-sensitive pipelines, and the latency-insensitive ones are happy to keep running at 02:00. A mature platform runs both — that's the lesson of Build 10. The cost model also matters: a batch pipeline's cost is roughly (triggers/day) × (cluster_cost × runtime); a streaming pipeline's cost is roughly (events/day) × (per_event_cost) + (operator_uptime × cluster_cost). For workloads with low event volume but many triggers, batch wins. For high-volume latency-sensitive workloads, streaming is dramatically cheaper at equal latency targets — Razorpay's payments-fraud team reported their streaming fraud pipeline cost ₹6 lakh/month vs ₹14 lakh/month for the equivalent 5-minute micro-batch pipeline at the same end-to-end latency.

Where this leads next

Build 7 starts with the substrate. Read /wiki/why-logs-the-one-data-structure-streaming-is-built-on first — it argues that the partitioned append-only log is the one data structure every streaming system converges on, and explains why. Then /wiki/a-single-partition-log-in-python builds one in 80 lines, so the abstraction is concrete before the production tools enter the picture. The rest of Build 7 (/wiki/partitions-and-parallelism, /wiki/consumer-groups-and-offset-management, /wiki/replication-and-isr-how-kafka-stays-up, /wiki/retention-compaction-tiered-storage, /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda) takes that primitive to production scale, with each chapter answering a question this chapter raised: how do partitions enable parallelism without losing per-key ordering, how do consumer groups coordinate, how does replication survive node loss, how does retention bound storage growth.

For the lakehouse view of where you've just been, return to /wiki/vacuum-retention-and-the-gdpr-delete-problem. The streaming pipeline you'll build in Build 7 still writes its outputs into that lakehouse — Build 7 is upstream of Build 6, not a replacement for it.

The transition is not a tear-down; it is a layering. The lakehouse keeps doing what it does — large historical scans, time travel, regulatory archive. The streaming layer adds a new shape on top — sub-second freshness, event-driven operators, stateful windows. Build 10 (unified batch and stream) is the chapter that finally reconciles the two; until then, treat them as complementary tools handling different ends of the latency spectrum, with Build 7's message log as the bridge.

A small mental shift helps. Stop thinking of the pipeline as "data flows through stages". Start thinking of it as "events trigger operators". The orchestrator goes from being the heartbeat of the system to being a fallback for the rare batch-shaped workloads. The lineage graph doesn't go away — it gets richer, because every operator emits its lineage continuously instead of once a day. The data contracts don't go away — they tighten, because the contract is now enforced per-event instead of per-batch. Build 5's discipline carries forward; Build 6's storage layer carries forward. What changes is the cadence of execution and the substrate that carries events between operators. That substrate is the next chapter.

References