Wall: batch metrics aren't fresh enough

Saturday, 19:32 IST. India vs Australia, second ODI at the Wankhede. Aditi is the head of platform at Dream11; her team finished Build 13 last quarter and is rightly pleased — active_users, entries_per_match, winning_pool_inr and forty other metrics now have a single canonical definition in dbt, the Looker dashboards and the Slack bot answer with the same number, and the legal team finally trusts the GST filings the finance team pulls from the same registry. Then a senior PM, Karan, opens the live-match dashboard. India is chasing 286, Rohit is on 67 not out, and Karan wants to know how many users are entering Mega Contest XII right now so he can decide whether to top up the prize pool in the next over. The dashboard, backed by the semantic layer, backed by Snowflake, backed by the nightly batch, says 48,12,406 entries. That number is from 02:00 IST. It is six and a half hours stale. By the time the analyst has manually queried Kafka for the live count, Rohit has hit two sixes, the contest has filled, and the moment to top up the pool is gone. The metric was correct, the contract was honoured, the dbt test was green. The warehouse it sat on top of simply had not heard about the last six hours of the world. This is the wall that ends Build 13 and starts Build 14.

The semantic layer is a contract over a query engine. If the query engine is a batch warehouse, the contract honours yesterday's truth, not today's. For a live-match prize-pool decision, a fraud rule that fires on a card-not-present spike, or a Swiggy surge-pricing tweak during the IPL final, "yesterday" is not an acceptable answer. The wall is the moment the freshness floor of your warehouse becomes the freshness ceiling of your business — and the only honest fix is to put a low-latency OLAP tier (ClickHouse, Pinot, Druid) underneath the semantic layer, which is the entire subject of Build 14.

What "freshness" actually means in a warehouse-backed metric

The metric registry points at a SQL view. The view points at a fact table. The fact table is loaded by a batch job. The batch job runs at — typically — 02:00 IST every night, taking 90 minutes to ingest the previous day's events from Kafka into Snowflake or BigQuery. That is the freshness of every metric the registry defines: the timestamp at which the underlying batch last ran, plus the lag of its slowest source.

The freshness stack: every layer adds lagA vertical stack diagram showing how a metric's freshness is set by the slowest layer beneath it. Top layer: dashboard or LLM consumer. Middle layer: semantic layer compiler. Lower layer: warehouse fact table. Bottom layer: batch ETL job. Each layer is annotated with its contribution to total lag: 0 ms, 0 ms, 6 hours typical, and freshness ceiling shown as a dotted line at the top of the warehouse layer. Where the lag actually lives Consumer (Looker tile / Slack bot / LLM agent) +0 ms Semantic layer compiler (dbt SL, Cube, MetricFlow) +50–500 ms (compile + cache) freshness ceiling — nothing above this layer can do better Warehouse fact table (Snowflake / BigQuery / Redshift) last refresh: 02:00 IST Batch ETL (Airflow DAG, hourly or nightly) +90 min run, +up to 24 h cadence Source of truth: Kafka topic / OLTP CDC stream +events arrive every ms
Lag is set by the slowest layer below the consumer. A semantic-layer call returns in 200 ms, but the data inside that response is as old as the last warehouse refresh. The freshness ceiling lives at the bottom of the cheapest layer, not at the top of the most expensive one.

Why people miss this: the dashboard refreshes every 30 seconds and the metric API responds in milliseconds, so the system feels live. But the bytes inside those responses do not move when Kafka moves; they move when the batch runs. The interactivity is in the rendering, not in the data. A 30-second-refresh dashboard backed by a 06:00-batch warehouse is a dashboard that lies fast.

What the warehouse is actually doing — and why it cannot be made fresh

Every batch warehouse — Snowflake, BigQuery, Redshift, Databricks SQL — is optimised for one thing: scanning huge columnar files and aggregating them quickly. Snowflake will gladly read 80 crore rows of fct_match_entries and return a SUM(entry_fee_paise) GROUP BY contest_id in 4 seconds. That is excellent. What it cannot do is ingest one row at a time at single-digit-millisecond latency.

The reason is structural, not a limitation that a faster cluster would fix. Columnar files (Parquet, ORC, Snowflake's micro-partitions) are written whole — you accumulate rows in memory, build a row group with min/max stats per column, compress it, write the file, register it in the manifest, and only then is it queryable. Below ~1000 rows per file the compression and stats become useless and query performance collapses. So columnar warehouses ingest in batches whose minimum economic size is roughly "all the events from the last 30–60 minutes". Faster batches mean smaller files mean worse query performance — which is the exact reason your nightly batch is nightly and not 5-minutely.

This is a structural reason, and it is what the people pushing "real-time Snowflake" tend to elide. Why "real-time Snowflake" is mostly marketing: Snowflake's Snowpipe streaming ingest does land rows within ~10 seconds, which sounds like the answer. But Snowflake's query layer still has to read those small files; query latency on the freshly-streamed data is 5–30 seconds at the median, not the 50–200 ms a live dashboard needs. The ingest got faster; the query did not. You moved the bottleneck, you did not remove it.

The seven-second window — what an actual production wall looks like

Here is the production reality at Dream11 during a live match, written as a small simulator so you can see the shapes. The simulator models three architectures answering the same question — "how many users have entered the contest in the last minute?" — with different freshness floors.

# Three architectures, one question, three answers.
# Question: "how many users entered Mega Contest XII in the last 60 seconds?"
# Ground truth: a Kafka counter on topic `match_entries` for contest_id=12.

import random, time
from dataclasses import dataclass

@dataclass
class Architecture:
    name: str
    ingest_lag_seconds: float    # time from event to query-visible
    query_latency_ms: float       # time to compile + run the metric query
    p99_jitter_ms: float          # tail noise on query latency

batch        = Architecture("nightly-batch warehouse",  6 * 3600,    300,   80)
microbatch   = Architecture("hourly micro-batch",        45 * 60,    250,   60)
realtime_olap = Architecture("ClickHouse / Pinot tier",        7,    180,  120)

# Simulate one minute of live match traffic.
# Real entry rate: ~2,500 entries per second during chase climaxes (Dream11 IPL final 2024).
ENTRY_RATE = 2500   # entries/sec, peak

def query(arch: Architecture, now_ts: float, true_count_at: dict) -> tuple[int, float]:
    """Return (count_seen, total_latency_seconds) for a 'last 60s' query at now_ts."""
    visible_through = now_ts - arch.ingest_lag_seconds
    latency_s = (arch.query_latency_ms + random.uniform(0, arch.p99_jitter_ms)) / 1000.0
    # Count entries in the last 60s of the window the arch can actually see.
    window_start = visible_through - 60
    seen = sum(c for ts, c in true_count_at.items()
               if window_start <= ts < visible_through)
    return seen, latency_s

# Build the ground truth: 60s of entries at 2.5k/sec, ts -> count.
now = time.time()
truth = {now - i: ENTRY_RATE + random.randint(-200, 200) for i in range(60)}

for arch in (batch, microbatch, realtime_olap):
    seen, lat = query(arch, now, truth)
    truth_now = sum(truth.values())  # actual current 60s rate
    pct = (seen / truth_now * 100) if truth_now else 0
    print(f"{arch.name:32s} -> saw {seen:>7,d} / true {truth_now:>7,d} "
          f"({pct:5.1f}%)  query latency {lat*1000:5.0f} ms")
# Sample run during a live India-Australia chase:
nightly-batch warehouse          -> saw       0 / true 149,837 (  0.0%)  query latency   356 ms
hourly micro-batch               -> saw       0 / true 149,837 (  0.0%)  query latency   292 ms
ClickHouse / Pinot tier          -> saw 134,612 / true 149,837 ( 89.8%)  query latency   244 ms

Walk the simulator carefully — the failure mode is not what beginners expect.

The decision boundary — which questions need real-time?

The honest framing of Build 14 is not "make everything real-time" — that is expensive and unnecessary. It is: identify the questions that have a freshness floor below your warehouse's freshness ceiling, and build a separate path for those.

Freshness decision boundary by question typeA two-axis grid. The horizontal axis is decision-loop length, ranging from "seconds" on the left to "weeks" on the right. The vertical axis is consequence-of-stale-answer, ranging from "low" at bottom to "catastrophic" at top. Five named question types are plotted: live-match contest-pool decision (top-left, real-time required), card-not-present fraud rules (top-left, real-time required), Swiggy surge pricing during IPL final (top-mid, real-time required), monthly cohort retention (bottom-right, batch fine), board-meeting GMV report (mid-right, batch fine). A diagonal boundary divides the plane into "build a real-time tier" (top-left half) and "warehouse is fine" (bottom-right half). Which questions actually need a real-time tier? decision-loop length → seconds hours weeks consequence of stale answer → low catastrophic build a real-time tier warehouse is fine Dream11 live-match prize pool Razorpay card-not-present fraud rule Swiggy surge pricing during IPL final Operations dashboard (per-shift) Board meeting GMV report Monthly cohort retention
The boundary is not "is the data interesting?" — every question in this grid is interesting. The boundary is whether the decision the answer drives can wait until tomorrow's batch run. Anything in the top-left of the grid is what Build 14 exists to serve.

In Indian-scale practice, the questions that fall into the top-left bucket are: prize-pool top-up decisions during live matches (Dream11, MPL), card-not-present fraud rules during festive sales (Razorpay, PayU), surge-pricing knobs during demand spikes (Swiggy, Ola, Rapido), tatkal-window seat-availability counters (IRCTC), live-orderbook risk dashboards (Zerodha, Groww), and stockout-driven re-pricing during flash sales (Flipkart, Meesho). Everything else — cohort retention, monthly GMV, settlement reconciliation, board reports — is fine on the warehouse, where it has always been fine, and where the existing semantic layer is the right substrate.

The reason this matters as a design exercise is that it explicitly makes the freshness budget a first-class field in the metric registry, alongside the SQL definition. Every metric in the registry now carries an expected_freshness annotation: gmv: PT1H, live_contest_entries: PT15S, monthly_active_users: P1D. The router in the semantic layer reads that field and picks the backend that can honour it. Metrics whose freshness is loose stay on the cheap tier; metrics whose freshness is tight get promoted. The contract is no longer just "what does this metric mean?" but "what does this metric mean and how stale is it allowed to be?".

What Build 14 actually adds

The architectural answer is the two-tier metric backend. The semantic layer's metric definition does not change — gmv is still the same SQL with the same filters. But the registry now points at two execution backends, and the consumer chooses (or the layer chooses for them) which to query based on a freshness parameter.

The two-tier metric backend that Build 14 introducesA diagram showing the semantic layer at the top, with two backend tiers below it. On the left: the existing batch warehouse (Snowflake), serving freshness ≥ 6 hours. On the right: a new real-time OLAP tier (ClickHouse / Pinot), serving freshness ≤ 30 seconds. A router in the middle of the semantic layer picks which backend to use based on the consumer's freshness ask. The metric definition lives in dbt YAML and is shared by both backends. One metric definition, two backends, one freshness-aware router Semantic layer (dbt SL / Cube / MetricFlow) + freshness-aware router: route(metric, max_lag) → backend Batch tier: Snowflake / BigQuery freshness floor: 1–24 hours cohort retention, board GMV, settlement Real-time tier: ClickHouse / Pinot / Druid freshness floor: 1–30 seconds live contests, fraud rules, surge pricing Both tiers ingest the same Kafka stream; metric YAML is shared.
The metric definition is shared; the execution path is freshness-routed. A live-contest dashboard asks for freshness ≤ 10s and lands on ClickHouse. The board's GMV report asks for freshness ≤ 24h and lands on Snowflake. The consumer never knows which backend ran the query — that is the entire point of the semantic layer abstraction.

This is the architecture every chapter of Build 14 will build out: ClickHouse for high-cardinality scans (chapter 105), Pinot for the user-facing analytics use case Uber pioneered (chapter 106), Druid for the segment-based time-series tradition (chapter 107), the StarRocks/Doris next wave (chapter 108), pre-aggregation strategies and their costs (chapter 109), and the p99-latency-under-load problem that ties them all together (chapter 110).

Common confusions

Going deeper

Why the freshness floor is not a tuning parameter

It is tempting to think "if we just ran the warehouse batch every 5 minutes instead of every 6 hours, freshness would be solved". The math defeats this. Snowflake's per-query overhead (cluster spin-up, query planning, manifest read) is ~1–3 seconds; running a 5-minute batch means 1–3 seconds of every 5 minutes — roughly 1% — is overhead. Reasonable. But the file-size floor is the killer: writing 5 minutes of events into a Parquet file gives you a few thousand rows, which compresses to ~50 KB; you cannot store millions of these without destroying query performance, because each query has to enumerate and open thousands of tiny files. Why the small-file problem is structural to columnar warehouses: row-group statistics (min/max per column) are what makes columnar scans skip irrelevant data. With 1000-row row groups, those stats are too coarse to skip — every query reads almost every file. The 30–60-minute batch cadence in production warehouses is not laziness; it is the cadence at which row groups become big enough for stats to be useful. Real-time OLAP stores solve this with append-friendly LSM-tree-style structures (ClickHouse's MergeTree, Druid's segment compaction) that ingest small writes and merge them later, decoupling ingest cadence from query-optimal file size.

The three flavours of "real-time" — and which one Build 14 means

The term "real-time" gets stretched across three meaningfully different latency budgets.

Build 14 lives in band (b) primarily and bleeds into (a) for things like fraud-feature lookup (Build 15's territory). Knowing which band your question is in is what tells you which tier to build — and conversely, when a vendor markets "real-time" without naming a band, treat that as a yellow flag.

What Cred actually did when their fraud rules outgrew the warehouse

The Cred fraud team's 2023 incident is a clean case study. Their fraud-decisioning service was reading from a hourly-refreshed warehouse fact table; a coordinated card-testing attack ran for 47 minutes during which their rule engine saw zero of the 11,200 fraudulent attempts (because they all happened within the current hour, which had not yet been ingested). Loss: about ₹2.3 crore in chargebacks. The fix was a Pinot tier ingesting the same Kafka topic in parallel with the batch ETL; the fraud rule was rewritten against Pinot SQL. Detection lag dropped from 60+ minutes to 6–8 seconds. The warehouse stayed exactly where it was; the rule engine's query path was the only thing that changed. This is the typical adoption shape — you do not replace the warehouse, you sit a real-time tier next to it for the workloads that need it.

Why the semantic layer becomes more important, not less

When you have one execution backend, the semantic layer is a convenience. When you have two backends with different freshness floors, the semantic layer is the only place where the metric definition can stay consistent. If gmv is defined slightly differently in Snowflake DDL and in ClickHouse DDL, you have two metrics with the same name — exactly the disaster Build 13 set out to prevent.

The discipline of defining gmv once in dbt YAML and compiling it down to both backends is the architectural force that holds a multi-tier stack together. Cube and dbt SL both treat ClickHouse/Pinot as pluggable adapters precisely so the YAML stays the source of truth across tiers. In practical terms this means a single PR review changes the metric in both tiers atomically — no drift between the "real-time GMV" and the "warehouse GMV", because there is only one metric and two compiled artefacts.

Cost, working sets, and the IRCTC tatkal lesson

Per-GB-month, a Snowflake cold tier (with reasonable clustering) is roughly ₹1.5/GB-month; a ClickHouse hot tier on i3-class instances is ~₹15–25/GB-month. For a Dream11-scale company with ~80 TB of historical match-entry data, putting it all on ClickHouse would be ~₹1.6 crore/month vs ~₹12 lakh/month on Snowflake. But the hot working set — the last 7 days of live data needed for in-match decisions — is only ~2 TB. ₹4 lakh/month on ClickHouse for the hot set plus ₹12 lakh/month on Snowflake for the cold set, with the semantic layer routing between them, is dramatically cheaper than either extreme. Why two-tier wins economically: the cost curves cross at the boundary between hot working set and warm/cold archive. Workloads that need hot-tier freshness are typically a small fraction of the data volume; archives are large but tolerate slow query. Putting each workload on the tier whose cost curve matches its access pattern is the optimum, and the two-tier stack is what falls out.

The same logic applied at IRCTC's tatkal window — 10:00 IST sharp, popular trains saturate inventory in roughly 90 seconds — forces a sub-30-second freshness budget, which forces a streaming-native backend, which forces a separate hot tier. A 90-second decision window cannot be negotiated with marketing copy or a faster Airflow DAG. The architecture follows from the loop length, not the other way around. PhonePe's data platform team built on this insight in Q3 2024 with an additional discipline: every dashboard tile ships a freshness-status dot driven by the metric registry — green if data is within expected freshness, red if lag is more than 2× expected. The dot exposes which tiles silently need a real-time tier, and PhonePe reported that ~60% of their real-time adoption decisions came from PMs spotting red dots, not from architecture review.

Where this leads next

This chapter closes Build 13 by naming the wall — the freshness ceiling of a batch warehouse is the freshness floor of every metric the semantic layer can serve. The metric definition is now the contract; the execution path is now plural. Build 14 spends seven chapters making the second execution path concrete: what an OLAP-on-stream engine actually looks like (/wiki/olap-on-a-stream-whats-different), how ClickHouse organises bytes for it (/wiki/clickhouse-columnar-for-real-time), the Uber-pioneered Pinot use case (/wiki/apache-pinot-and-the-uber-use-case), Druid's segment model (/wiki/druid-and-its-segment-model), the StarRocks/Doris wave (/wiki/starrocks-doris-and-the-next-wave), the pre-aggregation tradeoff (/wiki/pre-aggregation-materialized-views-and-their-costs), and the p99-under-ingest-pressure problem that is the production reality (/wiki/serving-p99-latency-under-ingest-pressure).

What carries forward from Build 13 — and what every Build 14 chapter assumes — is the metric-definition-as-contract instinct: when you add a real-time tier you do not add a second metric, you add a second execution path for the same metric. The dbt YAML stays the source of truth. ClickHouse is a faster compiler target, not a parallel definition.

References