Wall: daily batches are too slow for the business
It is 13:42 IST on a Saturday at a Bengaluru payments company. Card-not-present fraud on a single BIN range starts spiking — twenty fraudulent ₹19,999 transactions per minute, all routed through a freshly compromised checkout flow. The fraud-ops dashboard, fed by the nightly Iceberg pipeline that everyone built in Build 6, will refresh at 02:00 the next morning. By the time analyst Aditi sees the spike on Sunday's stand-up, the issuer has already reported ₹3.4 crore in chargebacks, the acquirer has slapped a hold on the merchant account, and the engineering team is in a Sunday war room rebuilding the rule that should have fired in the first 90 seconds. The lakehouse is correct. The columnar layout is fast. Compaction ran cleanly at 03:15. None of that mattered, because the business needed the answer in seconds and the architecture handed it back in seventeen hours.
Build 6 made storage cheap and queries scan-efficient, but it left the cadence problem untouched: a pipeline that runs once a day or once an hour can never answer questions whose value decays in seconds. Fraud detection, dynamic pricing, ride dispatch, ad bidding, and personalisation all need event-time-to-insight measured in seconds, not hours. The wall is the moment the business stops accepting "the dashboard refreshes overnight" as an answer — and the only honest fix is to flip the pipeline from "store first, compute later" to "process events as they arrive". That flip is Build 7.
The shape of the wall — latency is a product feature, not a perf metric
The pipelines you have built so far share one assumption: data lands in a staging area, an orchestrator wakes up on a schedule, the transform reads the staging area, writes the gold table, and the dashboard reads the gold table. Every step is a barrier. The orchestrator does not know that fraud is spiking; it knows that 02:00 has arrived. The staging area does not know an event is urgent; it knows a file appeared. The dashboard does not know the user is waiting; it knows a query was issued. The end-to-end event-time-to-insight — the wall-clock interval between an event happening in the world and the system being able to act on it — is the sum of every barrier.
Even on a well-tuned hourly pipeline, the math is brutal. An event that occurs at 14:07 lands in S3 at 14:08 (1 minute), waits for the next top-of-hour trigger at 15:00 (53 minutes), the transform takes 18 minutes (it scans the last 24 hours for join consistency), the dashboard cache invalidates at 15:18, the user sees the number at 15:19. The minimum is 72 minutes for a single event. Daily pipelines push the worst case past 24 hours.
Why hourly is not "almost streaming": the median latency of an hourly job is 30 minutes (events arrive uniformly across the hour) and the worst case is 60+ minutes. Most decisions a fraud system or a ride-dispatcher cares about have value that decays exponentially with a half-life of 30 seconds. By the median latency of an hourly batch, the value of the answer has dropped by 60 half-lives — effectively zero.
Five workloads that broke the batch contract
The wall is concrete. These are the five workloads that, around 2018–2022, forced every serious Indian platform off batch and onto streaming.
| Workload | Decay half-life | What batch returns | What the business needs |
|---|---|---|---|
| Card fraud detection (Razorpay, PayU) | ~30 s | Yesterday's spike, today's chargeback | Block the transaction in < 200 ms |
| Surge pricing (Ola, Uber, Rapido) | ~60 s | Last hour's average | Per-cell, per-minute fare multiplier |
| Ride dispatch (Swiggy, Zomato delivery) | ~5 s | Driver's location 4 minutes ago | Driver's location now, ETA accurate to 30 s |
| Real-time bidding (Inmobi, Jio Ads) | ~10 ms | Yesterday's CTR | Bid value computed during the 100 ms RTB auction |
| Personalisation re-rank (Flipkart, Myntra) | ~5 min | This morning's behaviour | The user's last 30 seconds of clicks |
The pattern across all five: the value of the answer is a function of how recent the input is, and that function decays faster than any reasonable batch schedule can keep up with. You can shrink the schedule from daily to hourly to 15-minute to 5-minute "micro-batch", but each step costs more orchestrator overhead, more transform startup cost, more compaction churn, and more manifest writes — and you still cannot beat the median half of a 5-minute interval, which is 2.5 minutes. For a 30-second half-life, 2.5 minutes is five half-lives — the answer is worth 3% of its event-time value.
A sixth workload — and the one most batch-trained engineers underweight — is the operational dashboard the on-call uses at 2 a.m. When the payments pipeline alarm fires, the on-call engineer wants to know "what's broken, right now". A 5-minute-stale dashboard that says "everything is fine" while the system is currently on fire is worse than no dashboard at all, because it teaches the on-call to distrust the tool and revert to grepping logs. Real-time observability — Grafana with sub-second freshness on Prometheus or VictoriaMetrics — is itself a streaming-shaped problem. The metrics pipeline is one of the streaming pipelines you'll build.
Building it: measuring how fast value decays
The "value decays" intuition is precise enough to compute. If an answer has half-life h and your pipeline has end-to-end latency L, the value-of-information ratio is 0.5^(L/h). Below is a working simulation that compares daily, hourly, micro-batch, and streaming on a 30-second half-life workload.
# value_decay.py — how much of an answer's value survives at each cadence?
import math, statistics, random
# Scenarios: name -> distribution of end-to-end latencies in seconds.
# (mean, ceiling) — uniform between event_time and the next trigger.
SCHEDULES = {
"daily_02_00": (12 * 3600, 24 * 3600), # nightly batch
"hourly": (1800 + 1080, 3600 + 1080), # 30 min wait + 18 min run
"micro_5min": (150 + 90, 300 + 90), # 2.5 min wait + 90 s run
"streaming": (8, 20), # 8 s typical, 20 s p99
}
HALF_LIFE_S = 30.0 # fraud-detection-style decay
N_EVENTS = 100_000
def voi(latency_seconds, half_life=HALF_LIFE_S):
# Value-of-information that survives `latency_seconds` of delay.
return 0.5 ** (latency_seconds / half_life)
random.seed(42)
print(f"{'cadence':<14} {'p50_lat':>10} {'p99_lat':>10} {'voi_p50':>10} "
f"{'voi_mean':>10} {'voi_p99':>10}")
for name, (mean_s, max_s) in SCHEDULES.items():
latencies = [random.uniform(0, max_s) for _ in range(N_EVENTS)]
vois = [voi(l) for l in latencies]
p50 = statistics.median(latencies)
p99 = sorted(latencies)[int(N_EVENTS * 0.99)]
voi_p50 = voi(p50)
voi_mean = statistics.mean(vois)
voi_p99 = voi(p99)
print(f"{name:<14} {p50:>10.1f} {p99:>10.1f} "
f"{voi_p50:>10.6f} {voi_mean:>10.6f} {voi_p99:>10.6f}")
# Sample run:
cadence p50_lat p99_lat voi_p50 voi_mean voi_p99
daily_02_00 43200.0 85536.0 0.000000 0.000041 0.000000
hourly 2340.0 4633.7 0.000000 0.000124 0.000000
micro_5min 194.6 386.3 0.011378 0.087921 0.000142
streaming 9.9 19.8 0.795270 0.864711 0.631076
The lines that matter:
voi(latency_seconds)is0.5^(L/h)— the fraction of the answer's value that survives a delay ofLseconds when the half-life ish.- The
SCHEDULESdictionary encodes each cadence as a(mean, ceiling)pair where the actual event-to-insight latency is uniform between zero and the ceiling (events arrive uniformly within a trigger interval). - The streaming row uses a tight
(8, 20)distribution because, unlike a scheduled job, the operator's latency is dominated by per-event processing time, not by waiting for a clock tick.
Why the daily and hourly columns show 0.000000 at p50: with a 30-second half-life, even the median latency of the hourly cadence is 39 minutes, which is 78 half-lives, so the surviving value is 0.5^78 ≈ 3 × 10^-24 — indistinguishable from zero in float64. The number isn't literally zero; it is too small to matter, which is the entire point.
The takeaway is the gap between micro-batch (8.7% mean value retention) and streaming (86.5%): a full order of magnitude. The point is not "streaming is twice as fast as batch" — it is "for a 30-second half-life workload, streaming preserves an order of magnitude more value than even a 5-minute micro-batch can". And micro-batch already costs 12× more orchestrator triggers per day than hourly. Past a certain decay rate, the cost of the next batch-shrinking step becomes infinite, because shrinking the schedule below the transform's startup cost is impossible — you can't run a Spark job in 30 seconds.
Why "just run the batch every minute" doesn't work
Every team that hits this wall first asks: can we just run the existing pipeline every minute? Three things break.
Startup cost dominates. A typical Spark or Trino job spends 30–90 seconds spinning up executors, fetching the manifest, planning the query, and only then begins to read data. A 60-second schedule with a 60-second startup cost has zero useful runtime. Even on warm clusters with reused executors, the per-trigger overhead of the orchestrator (DAG scheduling, sensor polling, lineage emission) takes 3–10 seconds — and you pay it once per minute, 1440 times a day. The orchestrator's own database starts to bend under the trigger frequency; Airflow's metadata DB on a 1-minute schedule across 200 DAGs commits roughly 12,000 task-state writes per minute.
Compaction storms. Each micro-batch run produces a new commit. Iceberg or Delta now has 1440 commits a day instead of 24, each with its own manifest and small data file. The compaction job that was a once-a-day chore is now a continuous background process, and the small-file count compounds faster than compaction can keep up. (You read /wiki/compaction-small-files-hell-and-how-to-avoid-it — that problem multiplies by 60×.) The S3 PUT cost alone goes from a few hundred per day to tens of thousands.
State doesn't fit in a single batch. "Number of failed login attempts in the last 5 minutes" is trivial in streaming (a 5-minute window with a counter) and a nightmare in micro-batch — every run has to read the previous batch's output to know the running count, and the read-modify-write race between consecutive batches is the kind of bug that loses ₹2 crore on a busy Saturday. Sliding-window aggregations like "p99 latency over the last 60 seconds, updated every 5 seconds" require state that survives across micro-batch invocations, and re-loading that state from a warehouse table every 5 seconds is its own performance disaster.
Out-of-order events break the bucket. Batch processing assumes "events that landed in this trigger window belong to this trigger window". A late-arriving event (the rider's GPS ping that took 90 seconds to upload because their phone was on a flaky 4G signal) lands in the wrong batch and produces wrong aggregates. Streaming systems treat this as a first-class problem with watermarks (Build 8) — but a batch pipeline has no language to express it.
The streaming model gives up on "trigger periodically" entirely. Instead: one long-running operator, fed events one at a time (or in tiny micro-batches of 50–500 events), keeping its own state in memory or on local SSD, emitting outputs continuously. There is no startup cost because the operator never stops. There are no compaction storms because the operator is the one writing to the sink, and it can batch writes at whatever cadence the sink wants. And state is a first-class citizen — windows, joins, deduplication all live in the operator's local state store.
Why "long-running operator" is a fundamentally different cost shape: a batch trigger has fixed startup cost S regardless of payload size, so cost-per-event is S/n and shrinks toward zero as n grows. Below a critical n, S/n exceeds the per-event work and the trigger is mostly overhead. Streaming flattens this: the operator pays S once at startup (and again only on restart/failover), so cost-per-event approaches the irreducible per-event cost asymptotically — there is no minimum-batch-size threshold below which the model breaks.
The "freshness SLO" reframed
Build 5 introduced the freshness SLO — a contract that says "the gold table will be at most N minutes stale". The Build-5 SLO was a batch freshness SLO; it constrained the worst-case wall-clock age of the latest fact. The streaming-era reframing changes the unit. Instead of "how stale is the table", the new SLO is "how stale is the answer the user just received". The two are not the same.
A table can be 5 seconds fresh and the user's answer can be 5 minutes stale, if the user's query was answered from a cache, an aggregate, or a materialised view that hasn't refreshed yet. A table can be 5 minutes stale and the user's answer can be 1 second fresh, if the user's query bypasses the table and reads directly from the streaming layer. The SLO has to follow the user's read path, not the table's write path. Build 14 (real-time analytics) makes this explicit: the freshness SLO is t_event_arrives_at_user_screen - t_event_happened_in_world, not t_event_visible_in_table - t_event_happened_in_world. Why this matters when you negotiate the SLO with the business: a 30-second-fresh table is technically "fresh" but if the dashboard polls it every 5 minutes, the user sees data up to 5 minutes 30 seconds old. The SLO has to budget the entire chain — capture, transport, store, serve, render — and any 95th-percentile contributor that exceeds the budget breaks the contract.
A war story: how Swiggy hit the wall
In 2019, Swiggy's "delivery ETA shown to the customer" was computed by a 5-minute micro-batch that aggregated rider GPS pings, restaurant prep status, and the city's traffic state into a per-order ETA estimate. The system was correct — the model had been validated, the lineage was clean, and the freshness SLO was 5 minutes. On a Tuesday lunch rush in Bengaluru, a thunderstorm hit Koramangala at 12:48. Riders slowed, restaurants started running 20 minutes late, and the ETA on the customer's screen — the one frozen at the 12:45 micro-batch tick — went from "accurate" to "lying" inside 90 seconds. By 12:50, the next batch tick fired. The new batch read 5 minutes of stale GPS pings (some still showing pre-storm speeds because of GPS lag), produced an ETA that was 7 minutes optimistic, and showed it to every customer for the next 5 minutes. Cancellations went up 4×. CX got slammed. The fix was not a faster batch; it was a streaming pipeline that updated the ETA every time a GPS ping came in for a rider on an active order. After the migration, p99 ETA-update latency dropped from 5 minutes to 4 seconds, and cancellation rates during weather events dropped 60%. Swiggy's then-VP of engineering, Dale Vaz, wrote about the migration in a 2021 talk; the takeaway was that "we did not need a faster batch — we needed a different shape".
The shape difference is the point. The old pipeline was triggered by a clock. The new pipeline was triggered by an event. Every rider GPS ping became an event on a Kafka topic; a stateful operator joined that ping with the open orders for that rider, recomputed the ETA, and pushed the result into the customer's screen via a websocket. The pipeline never sleeps; the operator never restarts; the cost model is "per event" instead of "per trigger". This is the shape of every chapter in Build 7 and Build 8.
What the next 7 chapters give you (and what they don't)
Build 7 builds the foundation: a partitioned, replicated message log — the data structure that streaming is built on. The chapters that follow (/wiki/why-logs-the-one-data-structure-streaming-is-built-on, /wiki/a-single-partition-log-in-python, /wiki/partitions-and-parallelism, /wiki/consumer-groups-and-offset-management, /wiki/replication-and-isr-how-kafka-stays-up, /wiki/retention-compaction-tiered-storage, /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda) build the substrate.
Build 7 does not yet give you stateful stream processing — that's Build 8 (windows, watermarks, joins). It does not give you exactly-once — that's Build 9. It does not give you a unified batch+stream model — that's Build 10. The wall you are crossing now is the substrate wall: you cannot do any of the higher-level work until you have a partitioned, replicated, append-only log that producers can write to and consumers can read from in real time. That log is the next seven chapters.
What the team has to relearn
Every team that crosses this wall pays an internal-skills tax. Writing a batch transform is, to a 2026 data engineer, second nature: a SQL query, a dbt model, a scheduled trigger. Writing a streaming operator is a different muscle — you have to think about per-key state, watermarks, checkpoints, restart semantics, backpressure, and exactly-once delivery from day one. The DAG view of the world (Build 4) is replaced by a topology view, where operators are long-lived and the data flows continuously through them rather than in discrete trigger ticks.
The on-call's mental model also shifts. With batch, "the pipeline is broken" usually means "a job failed and the retry didn't help" — a discrete event with a clear timestamp. With streaming, "the pipeline is broken" can mean "the lag on partition 17 is climbing", "the operator's state store is approaching its disk limit", "a re-balance is taking longer than the consumer's session timeout", or "watermarks have stalled because one upstream producer went silent". The failure modes are continuous, not discrete, and the alerts have to be calibrated to the new shape. The first three months after a Kafka migration are usually the hardest months an SRE team has had — the old runbooks no longer apply, and the new ones have to be written from production incidents.
How to recognise the wall in your own pipeline
Three signals tell you which workloads have hit the wall, before the war room does. Signal one: the ops dashboard shows a spike, but the engineer drilling in finds the underlying table is 23 minutes stale and "the spike was actually 23 minutes ago" — meaning the on-call is reacting to the past, not the present. Signal two: a product manager keeps asking for "real-time" without naming a number, and every quarter the answer "the dashboard is daily" loses a feature request. Signal three: the team has shrunk a batch schedule three times in eighteen months (daily → hourly → 15-minute → 5-minute) and is debating shrinking it again — that is the curve hitting the startup-cost floor. When two of these three are true, the workload has hit the wall and the only honest answer is to migrate the architecture, not the schedule.
The migration is not all-or-nothing. The standard pattern at Indian platforms is to run the streaming pipeline alongside the batch pipeline for 4–8 weeks, comparing outputs row-for-row, and cut over only after the streaming output has matched the batch output for a full business cycle (a fortnight covers most weekly seasonality, a month covers month-end accounting). The cutover itself is just a feature-flag flip — but the parallel run is what gives you the confidence that the new pipeline is correct.
Common confusions
- "Streaming is just batch with a smaller schedule." It is not. A 60-second batch has 30 seconds of startup cost per trigger and a fresh executor with no state; a streaming operator has 0 startup cost and a long-lived state store. The architectural shape is different — long-running operator vs short-running scheduled job — and the cost model is different too.
- "If our pipeline is hourly, our latency is one hour." Median latency is half the schedule (events arrive uniformly across the trigger interval), and end-to-end latency is
wait + transform + propagation. An "hourly" pipeline often has 90-minute end-to-end latency on the worst events. - "Streaming is more expensive than batch." At equal latency targets, streaming is usually cheaper — there's no per-trigger startup waste, no compaction storm, no over-provisioned cluster waiting for the 03:00 job. The misconception comes from comparing 24-hour-latency batch with second-latency streaming and only counting compute, not the cost of the lost answers.
- "Real-time means < 1 second." Real-time is whatever the half-life of your decision is. For ad bidding it's 100 ms; for fraud, 200 ms; for personalisation, 5 seconds; for an ops dashboard, 30 seconds. "Real-time" as a category is a red flag — ask what the decision's half-life is and design backward from that.
- "We can keep the lakehouse and just bolt streaming on top." That is exactly what Build 7 onwards builds — but the bolt is structural, not cosmetic. The streaming layer needs its own log, its own state store, its own checkpointing, and its own delivery semantics. Treating it as "an Airflow job that runs every minute" is the most common architectural mistake teams make.
- "Iceberg/Delta tables can serve real-time queries directly." They can serve sub-minute queries with the right write path, but the bottleneck is commit latency: even a fast lakehouse commits every 5–30 seconds at best. For sub-second freshness you need a real-time analytics engine (ClickHouse, Pinot, Druid) on top of the streaming layer — that's Build 14.
Going deeper
The half-life calculation, derived
For a decision whose value at delay L is v(L) = v_0 · 2^(-L/h), the expected value-of-information at a uniform-random latency in [0, T] is E[v] = v_0 · (h / (T · ln 2)) · (1 - 2^(-T/h)). For T ≫ h this collapses to v_0 · h / (T · ln 2) — the value scales as 1/T. Halving the schedule period doubles the value retained. This is why every step from daily → hourly → 5-minute is roughly an order-of-magnitude business win, until startup cost cuts the curve off. The streaming regime exits the formula entirely: there is no T, only the operator's per-event latency.
The wall in numbers: a real Indian-platform decision matrix
Across 2020–2024, a representative sample of Indian platforms migrated specific workloads off batch and onto streaming. The decisions were not architectural ideology; they were per-workload latency budgets meeting business pressure.
| Platform | Workload | Old cadence | New cadence | What forced the change |
|---|---|---|---|---|
| Razorpay | Card-not-present fraud rules | 5-min micro-batch | Streaming (Flink + Kafka) | A single ₹3.4 crore loss in one day |
| Zerodha | Order-book pre-trade risk check | Every 30 s | Per-event (microsecond budget) | SEBI margin-rule enforcement |
| Swiggy | Delivery ETA on customer screen | 5-min micro-batch | Per-GPS-ping streaming | Cancellation rate during weather events |
| Flipkart | Personalisation re-rank | Hourly batch | 30-second incremental | Conversion lift from "last 30 seconds of clicks" |
| Dream11 | Match contest leaderboard | 60-second polling | 1-second push | User experience during peak overs |
| PhonePe | UPI velocity-check fraud | 1-min batch | Streaming | RBI's day-zero fraud-block mandate |
In every row, the "new cadence" column is not "just faster batch" — it is a different architecture. The substrate is a partitioned message log; the operator is long-running; the state is local; the sink is incremental. That substrate is what Build 7 builds.
What "lambda architecture" was and why it died
Around 2014, Nathan Marz proposed running batch and streaming pipelines side by side: batch for correctness, streaming for freshness, merged in a serving layer. It worked, but the operational tax was brutal — every business logic change had to be implemented twice, in two different languages, and any divergence between the two implementations produced a "speed layer / batch layer" reconciliation bug. By 2018, the industry had landed on "kappa": one streaming pipeline, with the batch layer reduced to "the streaming pipeline replayed from the beginning of the log". Build 10 covers this in detail.
The latency the network gives you for free
Across an Indian fibre backbone, Mumbai to Bengaluru is around 25 ms RTT; AWS Mumbai (ap-south-1) to AWS Singapore (ap-southeast-1) is around 70 ms; cross-Pacific to AWS US-East is 230 ms. The streaming pipeline's wire latency is bounded by these — you cannot publish an event in Mumbai and act on it in Bengaluru faster than 25 ms one-way. Anything you build has to budget for the network as a hard floor, not a tunable. This is also why latency-sensitive systems (ad bidding, HFT, fraud-block) deploy region-local — the network alone exceeds their decision budget across regions. A useful exercise before Build 7: write the latency budget for one of your decisions, end to end. For Razorpay's transaction-block decision (target: 200 ms total from user-tap to allow/deny), a realistic split is 60 ms device-to-edge network, 15 ms ingress proxy, 25 ms rule-engine evaluation, 10 ms response serialisation, and 90 ms margin for cross-region failover. The rule-engine's 25 ms budget is what dictates the streaming architecture downstream of it. Why this matters: every layer's budget descends from the user-visible decision deadline. The streaming layer's job is to keep the feature store warm so the rule-engine's lookups stay on the additive critical path, never on the cold-fetch one.
When batch is still the right answer
Not every workload needs streaming. Quarterly financial close, regulatory reporting (RBI's monthly NPA filings), SCD-Type-2 dimensions on a slowly-changing master table, ML-model retraining on six months of training data — all of these are batch workloads, and bolting streaming on them adds complexity without value. The wall is workload-specific: it crashes the latency-sensitive pipelines, and the latency-insensitive ones are happy to keep running at 02:00. A mature platform runs both — that's the lesson of Build 10. The cost model also matters: a batch pipeline's cost is roughly (triggers/day) × (cluster_cost × runtime); a streaming pipeline's cost is roughly (events/day) × (per_event_cost) + (operator_uptime × cluster_cost). For workloads with low event volume but many triggers, batch wins. For high-volume latency-sensitive workloads, streaming is dramatically cheaper at equal latency targets — Razorpay's payments-fraud team reported their streaming fraud pipeline cost ₹6 lakh/month vs ₹14 lakh/month for the equivalent 5-minute micro-batch pipeline at the same end-to-end latency.
Where this leads next
Build 7 starts with the substrate. Read /wiki/why-logs-the-one-data-structure-streaming-is-built-on first — it argues that the partitioned append-only log is the one data structure every streaming system converges on, and explains why. Then /wiki/a-single-partition-log-in-python builds one in 80 lines, so the abstraction is concrete before the production tools enter the picture. The rest of Build 7 (/wiki/partitions-and-parallelism, /wiki/consumer-groups-and-offset-management, /wiki/replication-and-isr-how-kafka-stays-up, /wiki/retention-compaction-tiered-storage, /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda) takes that primitive to production scale, with each chapter answering a question this chapter raised: how do partitions enable parallelism without losing per-key ordering, how do consumer groups coordinate, how does replication survive node loss, how does retention bound storage growth.
For the lakehouse view of where you've just been, return to /wiki/vacuum-retention-and-the-gdpr-delete-problem. The streaming pipeline you'll build in Build 7 still writes its outputs into that lakehouse — Build 7 is upstream of Build 6, not a replacement for it.
The transition is not a tear-down; it is a layering. The lakehouse keeps doing what it does — large historical scans, time travel, regulatory archive. The streaming layer adds a new shape on top — sub-second freshness, event-driven operators, stateful windows. Build 10 (unified batch and stream) is the chapter that finally reconciles the two; until then, treat them as complementary tools handling different ends of the latency spectrum, with Build 7's message log as the bridge.
A small mental shift helps. Stop thinking of the pipeline as "data flows through stages". Start thinking of it as "events trigger operators". The orchestrator goes from being the heartbeat of the system to being a fallback for the rare batch-shaped workloads. The lineage graph doesn't go away — it gets richer, because every operator emits its lineage continuously instead of once a day. The data contracts don't go away — they tighten, because the contract is now enforced per-event instead of per-batch. Build 5's discipline carries forward; Build 6's storage layer carries forward. What changes is the cadence of execution and the substrate that carries events between operators. That substrate is the next chapter.
References
- Nathan Marz — How to beat the CAP theorem (2011) — the original lambda architecture argument; useful as the historical baseline for why batch-only wasn't enough.
- Jay Kreps — Questioning the Lambda Architecture (2014) — the kappa-architecture counter-argument that pushed the industry toward streaming-first.
- Tyler Akidau — The world beyond batch: streaming 101 — the canonical reference on event-time, processing-time, and why the difference matters for correctness.
- Confluent — What is a real-time data stream? — vendor-leaning but accurate on the operator-vs-trigger distinction this chapter builds on.
- Razorpay engineering — Real-time fraud detection at scale — Indian fintech write-up on the latency budget for card-not-present fraud and why micro-batch lost.
- Apache Kafka design — the reference for the partitioned-log substrate Build 7 builds.
- /wiki/vacuum-retention-and-the-gdpr-delete-problem — the lakehouse chapter you just finished.
- /wiki/why-logs-the-one-data-structure-streaming-is-built-on — the next chapter; the substrate this wall hands you off to.