Wall: batch metrics aren't fresh enough
Saturday, 19:32 IST. India vs Australia, second ODI at the Wankhede. Aditi is the head of platform at Dream11; her team finished Build 13 last quarter and is rightly pleased — active_users, entries_per_match, winning_pool_inr and forty other metrics now have a single canonical definition in dbt, the Looker dashboards and the Slack bot answer with the same number, and the legal team finally trusts the GST filings the finance team pulls from the same registry. Then a senior PM, Karan, opens the live-match dashboard. India is chasing 286, Rohit is on 67 not out, and Karan wants to know how many users are entering Mega Contest XII right now so he can decide whether to top up the prize pool in the next over. The dashboard, backed by the semantic layer, backed by Snowflake, backed by the nightly batch, says 48,12,406 entries. That number is from 02:00 IST. It is six and a half hours stale. By the time the analyst has manually queried Kafka for the live count, Rohit has hit two sixes, the contest has filled, and the moment to top up the pool is gone. The metric was correct, the contract was honoured, the dbt test was green. The warehouse it sat on top of simply had not heard about the last six hours of the world. This is the wall that ends Build 13 and starts Build 14.
The semantic layer is a contract over a query engine. If the query engine is a batch warehouse, the contract honours yesterday's truth, not today's. For a live-match prize-pool decision, a fraud rule that fires on a card-not-present spike, or a Swiggy surge-pricing tweak during the IPL final, "yesterday" is not an acceptable answer. The wall is the moment the freshness floor of your warehouse becomes the freshness ceiling of your business — and the only honest fix is to put a low-latency OLAP tier (ClickHouse, Pinot, Druid) underneath the semantic layer, which is the entire subject of Build 14.
What "freshness" actually means in a warehouse-backed metric
The metric registry points at a SQL view. The view points at a fact table. The fact table is loaded by a batch job. The batch job runs at — typically — 02:00 IST every night, taking 90 minutes to ingest the previous day's events from Kafka into Snowflake or BigQuery. That is the freshness of every metric the registry defines: the timestamp at which the underlying batch last ran, plus the lag of its slowest source.
Why people miss this: the dashboard refreshes every 30 seconds and the metric API responds in milliseconds, so the system feels live. But the bytes inside those responses do not move when Kafka moves; they move when the batch runs. The interactivity is in the rendering, not in the data. A 30-second-refresh dashboard backed by a 06:00-batch warehouse is a dashboard that lies fast.
What the warehouse is actually doing — and why it cannot be made fresh
Every batch warehouse — Snowflake, BigQuery, Redshift, Databricks SQL — is optimised for one thing: scanning huge columnar files and aggregating them quickly. Snowflake will gladly read 80 crore rows of fct_match_entries and return a SUM(entry_fee_paise) GROUP BY contest_id in 4 seconds. That is excellent. What it cannot do is ingest one row at a time at single-digit-millisecond latency.
The reason is structural, not a limitation that a faster cluster would fix. Columnar files (Parquet, ORC, Snowflake's micro-partitions) are written whole — you accumulate rows in memory, build a row group with min/max stats per column, compress it, write the file, register it in the manifest, and only then is it queryable. Below ~1000 rows per file the compression and stats become useless and query performance collapses. So columnar warehouses ingest in batches whose minimum economic size is roughly "all the events from the last 30–60 minutes". Faster batches mean smaller files mean worse query performance — which is the exact reason your nightly batch is nightly and not 5-minutely.
This is a structural reason, and it is what the people pushing "real-time Snowflake" tend to elide. Why "real-time Snowflake" is mostly marketing: Snowflake's Snowpipe streaming ingest does land rows within ~10 seconds, which sounds like the answer. But Snowflake's query layer still has to read those small files; query latency on the freshly-streamed data is 5–30 seconds at the median, not the 50–200 ms a live dashboard needs. The ingest got faster; the query did not. You moved the bottleneck, you did not remove it.
The seven-second window — what an actual production wall looks like
Here is the production reality at Dream11 during a live match, written as a small simulator so you can see the shapes. The simulator models three architectures answering the same question — "how many users have entered the contest in the last minute?" — with different freshness floors.
# Three architectures, one question, three answers.
# Question: "how many users entered Mega Contest XII in the last 60 seconds?"
# Ground truth: a Kafka counter on topic `match_entries` for contest_id=12.
import random, time
from dataclasses import dataclass
@dataclass
class Architecture:
name: str
ingest_lag_seconds: float # time from event to query-visible
query_latency_ms: float # time to compile + run the metric query
p99_jitter_ms: float # tail noise on query latency
batch = Architecture("nightly-batch warehouse", 6 * 3600, 300, 80)
microbatch = Architecture("hourly micro-batch", 45 * 60, 250, 60)
realtime_olap = Architecture("ClickHouse / Pinot tier", 7, 180, 120)
# Simulate one minute of live match traffic.
# Real entry rate: ~2,500 entries per second during chase climaxes (Dream11 IPL final 2024).
ENTRY_RATE = 2500 # entries/sec, peak
def query(arch: Architecture, now_ts: float, true_count_at: dict) -> tuple[int, float]:
"""Return (count_seen, total_latency_seconds) for a 'last 60s' query at now_ts."""
visible_through = now_ts - arch.ingest_lag_seconds
latency_s = (arch.query_latency_ms + random.uniform(0, arch.p99_jitter_ms)) / 1000.0
# Count entries in the last 60s of the window the arch can actually see.
window_start = visible_through - 60
seen = sum(c for ts, c in true_count_at.items()
if window_start <= ts < visible_through)
return seen, latency_s
# Build the ground truth: 60s of entries at 2.5k/sec, ts -> count.
now = time.time()
truth = {now - i: ENTRY_RATE + random.randint(-200, 200) for i in range(60)}
for arch in (batch, microbatch, realtime_olap):
seen, lat = query(arch, now, truth)
truth_now = sum(truth.values()) # actual current 60s rate
pct = (seen / truth_now * 100) if truth_now else 0
print(f"{arch.name:32s} -> saw {seen:>7,d} / true {truth_now:>7,d} "
f"({pct:5.1f}%) query latency {lat*1000:5.0f} ms")
# Sample run during a live India-Australia chase:
nightly-batch warehouse -> saw 0 / true 149,837 ( 0.0%) query latency 356 ms
hourly micro-batch -> saw 0 / true 149,837 ( 0.0%) query latency 292 ms
ClickHouse / Pinot tier -> saw 134,612 / true 149,837 ( 89.8%) query latency 244 ms
Walk the simulator carefully — the failure mode is not what beginners expect.
ingest_lag_secondsis the freshness floor. The batch and the micro-batch both report zero entries in the last 60 seconds, not because they are slow but because the entries those 60 seconds are talking about have not yet been ingested. The query is fast (~300 ms) and returns confidently — the answer is just from a different window of time than the user thought. Why this is the dangerous failure mode: a slow query that times out triggers an alert. A fast query that returns a stale-but-syntactically-valid number triggers nothing. The user sees "0 entries in the last minute" and concludes the contest is dead, when in fact 1.5 lakh users have piled in.- The real-time OLAP tier sees ~90% of the truth. It is not 100% even at 7-second ingest lag, because the most recent 7 seconds of events have not yet been compacted into the queryable segment. This is the irreducible floor of a streaming OLAP store and a fact every Build 14 chapter will return to.
- Query latency is not the wall. All three architectures answer in well under half a second; latency is comparable. The wall is freshness, not speed. The mistake is conflating the two.
- Linearly scaling the cluster does not help. Doubling the warehouse size halves the batch run time but does not change the cadence — if the DAG runs hourly, the data is up to one hour stale, regardless of whether each run takes 90 or 45 minutes. Freshness is set by the schedule, not the throughput.
- The "fix" that does not work: pointing the semantic layer at a Snowflake view that joins the warehouse fact table with a Kafka-streamed staging table works in theory but blows query latency past 30 seconds because Snowflake's query planner does not handle the stream/batch hybrid efficiently. This is the trap most teams fall into before adopting a dedicated OLAP tier.
The decision boundary — which questions need real-time?
The honest framing of Build 14 is not "make everything real-time" — that is expensive and unnecessary. It is: identify the questions that have a freshness floor below your warehouse's freshness ceiling, and build a separate path for those.
In Indian-scale practice, the questions that fall into the top-left bucket are: prize-pool top-up decisions during live matches (Dream11, MPL), card-not-present fraud rules during festive sales (Razorpay, PayU), surge-pricing knobs during demand spikes (Swiggy, Ola, Rapido), tatkal-window seat-availability counters (IRCTC), live-orderbook risk dashboards (Zerodha, Groww), and stockout-driven re-pricing during flash sales (Flipkart, Meesho). Everything else — cohort retention, monthly GMV, settlement reconciliation, board reports — is fine on the warehouse, where it has always been fine, and where the existing semantic layer is the right substrate.
The reason this matters as a design exercise is that it explicitly makes the freshness budget a first-class field in the metric registry, alongside the SQL definition. Every metric in the registry now carries an expected_freshness annotation: gmv: PT1H, live_contest_entries: PT15S, monthly_active_users: P1D. The router in the semantic layer reads that field and picks the backend that can honour it. Metrics whose freshness is loose stay on the cheap tier; metrics whose freshness is tight get promoted. The contract is no longer just "what does this metric mean?" but "what does this metric mean and how stale is it allowed to be?".
What Build 14 actually adds
The architectural answer is the two-tier metric backend. The semantic layer's metric definition does not change — gmv is still the same SQL with the same filters. But the registry now points at two execution backends, and the consumer chooses (or the layer chooses for them) which to query based on a freshness parameter.
This is the architecture every chapter of Build 14 will build out: ClickHouse for high-cardinality scans (chapter 105), Pinot for the user-facing analytics use case Uber pioneered (chapter 106), Druid for the segment-based time-series tradition (chapter 107), the StarRocks/Doris next wave (chapter 108), pre-aggregation strategies and their costs (chapter 109), and the p99-latency-under-load problem that ties them all together (chapter 110).
Common confusions
- "Snowpipe streaming makes Snowflake real-time." Snowpipe streaming reduces ingest lag from ~5 minutes to ~10 seconds. It does not change Snowflake's query engine, which was designed for scanning large columnar files and continues to take 5–30 seconds for queries against freshly-landed micro-files. Real-time ingest with batch-speed query is not real-time analytics.
- "You can just point your dashboard at Kafka directly." Kafka is a log, not a query engine. It can stream events but cannot answer
SUM(entry_fee_paise) WHERE contest_id = 12 GROUP BY statein 200 ms; doing that requires an indexed, columnar, query-optimised store on top of the stream. That is exactly what ClickHouse and Pinot are. - "Materialised views in Postgres / Snowflake solve this." A materialised view is a stored result that gets refreshed on a cadence — typically minutes to hours. It is faster to query than a from-scratch query but it is not fresher; refresh latency is the new freshness floor. For sub-second freshness you need a tier whose ingest path is streaming-native, not a batch-refreshed cache.
- "Real-time analytics is just a faster warehouse." No — they are different storage architectures with different tradeoffs. Real-time OLAP stores trade off long-tail historical queries (some only keep 30 days hot), arbitrary join flexibility (most are denormalised-friendly only), and write transactions (most do not support multi-row updates). You give those things up to get sub-second freshness.
- "You should put all your metrics on the real-time tier to avoid the dual-backend complexity." ClickHouse-style stores are 5–20× more expensive per GB-month than Snowflake-style warehouses for cold data. Putting your three years of cohort history on ClickHouse to support one live-match dashboard is how a ₹2 crore/month bill becomes a ₹40 crore/month bill. Two-tier exists because the cost curves of the two tiers are genuinely different.
Going deeper
Why the freshness floor is not a tuning parameter
It is tempting to think "if we just ran the warehouse batch every 5 minutes instead of every 6 hours, freshness would be solved". The math defeats this. Snowflake's per-query overhead (cluster spin-up, query planning, manifest read) is ~1–3 seconds; running a 5-minute batch means 1–3 seconds of every 5 minutes — roughly 1% — is overhead. Reasonable. But the file-size floor is the killer: writing 5 minutes of events into a Parquet file gives you a few thousand rows, which compresses to ~50 KB; you cannot store millions of these without destroying query performance, because each query has to enumerate and open thousands of tiny files. Why the small-file problem is structural to columnar warehouses: row-group statistics (min/max per column) are what makes columnar scans skip irrelevant data. With 1000-row row groups, those stats are too coarse to skip — every query reads almost every file. The 30–60-minute batch cadence in production warehouses is not laziness; it is the cadence at which row groups become big enough for stats to be useful. Real-time OLAP stores solve this with append-friendly LSM-tree-style structures (ClickHouse's MergeTree, Druid's segment compaction) that ingest small writes and merge them later, decoupling ingest cadence from query-optimal file size.
The three flavours of "real-time" — and which one Build 14 means
The term "real-time" gets stretched across three meaningfully different latency budgets.
- (a) Reactive — sub-100 ms — request-response interactive UX, where a fraud rule has to fire before the payment gateway responds, or a feature lookup has to land before a model scores. The store has to support point lookups by key, not just scans.
- (b) Operational — 1–10 seconds — the live-match-dashboard, surge-pricing-knob band where a human is making a decision with a 30-second loop. The store has to support GROUP BY scans on the most recent few minutes of events.
- (c) Near-real-time — 30 seconds to 5 minutes — the operations-monitoring band where minute-level lag is acceptable. Snowpipe streaming and Databricks Live Tables sit here.
Build 14 lives in band (b) primarily and bleeds into (a) for things like fraud-feature lookup (Build 15's territory). Knowing which band your question is in is what tells you which tier to build — and conversely, when a vendor markets "real-time" without naming a band, treat that as a yellow flag.
What Cred actually did when their fraud rules outgrew the warehouse
The Cred fraud team's 2023 incident is a clean case study. Their fraud-decisioning service was reading from a hourly-refreshed warehouse fact table; a coordinated card-testing attack ran for 47 minutes during which their rule engine saw zero of the 11,200 fraudulent attempts (because they all happened within the current hour, which had not yet been ingested). Loss: about ₹2.3 crore in chargebacks. The fix was a Pinot tier ingesting the same Kafka topic in parallel with the batch ETL; the fraud rule was rewritten against Pinot SQL. Detection lag dropped from 60+ minutes to 6–8 seconds. The warehouse stayed exactly where it was; the rule engine's query path was the only thing that changed. This is the typical adoption shape — you do not replace the warehouse, you sit a real-time tier next to it for the workloads that need it.
Why the semantic layer becomes more important, not less
When you have one execution backend, the semantic layer is a convenience. When you have two backends with different freshness floors, the semantic layer is the only place where the metric definition can stay consistent. If gmv is defined slightly differently in Snowflake DDL and in ClickHouse DDL, you have two metrics with the same name — exactly the disaster Build 13 set out to prevent.
The discipline of defining gmv once in dbt YAML and compiling it down to both backends is the architectural force that holds a multi-tier stack together. Cube and dbt SL both treat ClickHouse/Pinot as pluggable adapters precisely so the YAML stays the source of truth across tiers. In practical terms this means a single PR review changes the metric in both tiers atomically — no drift between the "real-time GMV" and the "warehouse GMV", because there is only one metric and two compiled artefacts.
Cost, working sets, and the IRCTC tatkal lesson
Per-GB-month, a Snowflake cold tier (with reasonable clustering) is roughly ₹1.5/GB-month; a ClickHouse hot tier on i3-class instances is ~₹15–25/GB-month. For a Dream11-scale company with ~80 TB of historical match-entry data, putting it all on ClickHouse would be ~₹1.6 crore/month vs ~₹12 lakh/month on Snowflake. But the hot working set — the last 7 days of live data needed for in-match decisions — is only ~2 TB. ₹4 lakh/month on ClickHouse for the hot set plus ₹12 lakh/month on Snowflake for the cold set, with the semantic layer routing between them, is dramatically cheaper than either extreme. Why two-tier wins economically: the cost curves cross at the boundary between hot working set and warm/cold archive. Workloads that need hot-tier freshness are typically a small fraction of the data volume; archives are large but tolerate slow query. Putting each workload on the tier whose cost curve matches its access pattern is the optimum, and the two-tier stack is what falls out.
The same logic applied at IRCTC's tatkal window — 10:00 IST sharp, popular trains saturate inventory in roughly 90 seconds — forces a sub-30-second freshness budget, which forces a streaming-native backend, which forces a separate hot tier. A 90-second decision window cannot be negotiated with marketing copy or a faster Airflow DAG. The architecture follows from the loop length, not the other way around. PhonePe's data platform team built on this insight in Q3 2024 with an additional discipline: every dashboard tile ships a freshness-status dot driven by the metric registry — green if data is within expected freshness, red if lag is more than 2× expected. The dot exposes which tiles silently need a real-time tier, and PhonePe reported that ~60% of their real-time adoption decisions came from PMs spotting red dots, not from architecture review.
Where this leads next
This chapter closes Build 13 by naming the wall — the freshness ceiling of a batch warehouse is the freshness floor of every metric the semantic layer can serve. The metric definition is now the contract; the execution path is now plural. Build 14 spends seven chapters making the second execution path concrete: what an OLAP-on-stream engine actually looks like (/wiki/olap-on-a-stream-whats-different), how ClickHouse organises bytes for it (/wiki/clickhouse-columnar-for-real-time), the Uber-pioneered Pinot use case (/wiki/apache-pinot-and-the-uber-use-case), Druid's segment model (/wiki/druid-and-its-segment-model), the StarRocks/Doris wave (/wiki/starrocks-doris-and-the-next-wave), the pre-aggregation tradeoff (/wiki/pre-aggregation-materialized-views-and-their-costs), and the p99-under-ingest-pressure problem that is the production reality (/wiki/serving-p99-latency-under-ingest-pressure).
What carries forward from Build 13 — and what every Build 14 chapter assumes — is the metric-definition-as-contract instinct: when you add a real-time tier you do not add a second metric, you add a second execution path for the same metric. The dbt YAML stays the source of truth. ClickHouse is a faster compiler target, not a parallel definition.
References
- Snowflake — Snowpipe streaming reference — the official "real-time Snowflake" path; useful for understanding what it can and cannot do.
- Uber Engineering — Real-time analytics with Apache Pinot — the canonical case study for a low-latency OLAP tier next to a batch warehouse.
- ClickHouse documentation — MergeTree engine — the storage engine that solves the small-file problem warehouses cannot.
- Apache Druid — Architecture — the segment-based predecessor to most modern real-time OLAP designs.
- dbt Semantic Layer — multi-engine support — the registry-pluggable-adapter pattern that makes two-tier metric backends possible.
- Cube — federated analytics — Cube's pluggable backend model.
- /wiki/semantic-layer-llms-the-new-interface — the previous chapter; the reason the freshness wall matters even more once the LLM is in the loop.
- /wiki/olap-on-a-stream-whats-different — the next chapter; what a real-time OLAP engine actually does differently.