OLAP on a stream: what's different
It is 21:47 IST on a Saturday. RCB are chasing 198 against CSK at the Chinnaswamy and Aditi's team at Dream11 has just deployed the new contest-pool dashboard onto the real-time tier. The dbt YAML for live_contest_entries is the same one Build 13 shipped — same SQL, same filters, same entry_fee_paise definition. But the query plan that runs against ClickHouse looks nothing like the one Snowflake compiled last quarter. The same metric, the same answer, but a different storage layout, a different index structure, a different flush cadence, a different memory model. This chapter is the part-by-part comparison: what an OLAP engine has to give up about the warehouse world, and what it gets back, when it agrees to answer questions about a stream that has not stopped moving.
A batch OLAP warehouse and a streaming OLAP engine share a SQL surface but disagree on every layer underneath. The warehouse optimises for huge scans over closed columnar files; the streaming engine optimises for sub-second answers over an open, growing index. The four axes that diverge are ingest cadence (whole-file batch vs continuous append), index residency (compute-time vs always-live), storage layout (one big file vs many small segments merged in the background), and query lifecycle (one-shot scan vs incremental over the latest segment). Everything Build 14 covers is a specific engineering response to one of those four axes.
The four axes where the two universes diverge
Start with what is the same: both worlds are columnar (column-major byte layout for fast scan), both speak SQL, both run aggregations as parallel work over partitioned data. If you walk a Snowflake EXPLAIN and a ClickHouse EXPLAIN next to each other, the operators have the same names — Filter, Aggregate, HashJoin, Sort. That surface similarity is what makes the marketing copy ("ClickHouse is just a fast warehouse!") plausible. It is also what makes the underlying physics confusing.
The four axes where the two universes meaningfully diverge are: ingest cadence, index residency, storage layout, and query lifecycle. Every concrete difference in the rest of this chapter — and the rest of Build 14 — is a working out of one of these four.
Why these four and not, say, "memory vs disk" or "OLTP vs OLAP": the four axes are the ones a streaming engine has to design differently from a warehouse to honour the freshness floor of the previous chapter. Memory-vs-disk is a tactic; OLTP-vs-OLAP is the wrong dichotomy (both worlds are OLAP). The axis split is what produces the chapter list of Build 14 — ClickHouse's MergeTree (axis 3), Pinot's segment-based ingest (axis 1), Druid's segment compaction (axis 3), pre-aggregation (axis 2), p99 under load (axis 4).
Ingest cadence — why a streaming engine cannot wait for the file
The previous chapter showed why Snowflake's batch ingest cadence is structural — row-group statistics need ~1000+ rows to be useful, so writing a Parquet file every 5 minutes destroys query performance. The streaming engine has the same underlying problem (small files are expensive to scan) but solves it on a different timescale: write small segments fast, merge them in the background, query the union.
Concretely: ClickHouse's MergeTree engine, Pinot's offline-and-realtime segments, and Druid's incremental + handoff segments all do the same dance. A new event arrives, gets buffered in memory for a few seconds (the configurable flush interval), is written as a small immutable segment, becomes immediately queryable. Behind the scenes a background process picks pairs (or N-tuples) of small segments and merges them into bigger ones — exactly as an LSM-tree compactor does for a key-value store. The merge does not block ingest; the merge does not block queries; the query just has to know how to read both old big segments and new small ones in the same SELECT.
That is the entire trick. The reader who has used a Postgres BRIN index, a RocksDB instance, or a Lucene segment-merging Elasticsearch cluster has already seen this pattern. Why the same pattern keeps surfacing: append-only writes plus background merges is the unique architecture that decouples ingest cadence from query-optimal file size. If you want to ingest faster than you can compact, the segments stay small and the query slows down. If you want to query faster than you can compact, you cap how small segments can be — which is the freshness floor. ClickHouse, Pinot and Druid all expose flush_interval knobs that control exactly this tradeoff; the default values (1–10 s) reflect the band where most operational analytics lives.
Index residency — the index is alive while you query
In Snowflake, every micro-partition has its zone-map (per-column min/max, distinct-count estimate, null-count) computed at the moment the partition is finalised and then never changed. Querying is "look up zone-map metadata, decide which micro-partitions to read, scan them". The index is dead — frozen at write time and only rebuilt by full-table operations. This is correct for batch: if the data does not change, the index does not need to.
In a streaming OLAP engine, that does not work. New segments arrive every second; queries land continuously. The index must be alive — receiving updates as segments flush, getting trimmed as old segments expire, supporting concurrent reads while it is being mutated. ClickHouse exposes this as the primary key (a sparse index where every Nth row's key is held in memory) plus per-segment min/max files; Pinot exposes it as star-tree, sorted, range, and inverted indices that are rebuilt per segment and held in memory across all segments. Druid exposes a bitmap index per dimension column, also held in memory.
Memory residency matters because in a 200 ms p99 budget you cannot pay one disk seek per query. Pinot's user-facing analytics use case at Uber is the reference here — 100k QPS at p99 < 50 ms — and the only way you hit that is by holding every dimension's bitmap in RAM and doing the AND/OR of bitmaps in CPU. Druid's segment cache, ClickHouse's mark cache, Pinot's index handler all converge on the same idea: the index is a working set, not a metadata blob. The numbers below come from the public Pinot configuration guide and the Druid operations docs; they are not idealised.
| engine | what's resident in RAM | typical RAM/100GB data |
|---|---|---|
| Snowflake | zone-map metadata only (KB-MB per micro-partition) | negligible |
| ClickHouse | mark cache + primary-key sparse index + uncompressed cache | ~5–10 GB |
| Pinot | per-column inverted/range/sorted indices, dictionaries | ~15–25 GB |
| Druid | bitmap indices, dictionary encodings, segment cache | ~10–20 GB |
The price of keeping the index alive is paid in RAM. ClickHouse's lighter footprint is why operators often pick it for high-cardinality scan-heavy workloads (Cred's fraud rules); Pinot's heavier footprint is why it dominates dashboard-style point-and-aggregate (Uber's UberEats analytics, LinkedIn's profile views, Stripe's merchant dashboards).
A small simulator: the same query, two universes
Here is the cleanest way to feel the difference — a single Python simulator that models both architectures answering the same question: "how many entries in the last 60 seconds, grouped by state, where contest_id = 12?" The numbers are tuned to public production data: Snowflake's typical query overhead and scan rate per warehouse-credit, ClickHouse's typical merge cadence and primary-index lookup speed.
# Two storage engines, one query, one stream.
# Question: "SELECT state, COUNT(*) FROM entries
# WHERE contest_id=12 AND ts > now() - INTERVAL 60 SECOND
# GROUP BY state"
import random, time, math
from collections import defaultdict
from dataclasses import dataclass
@dataclass
class Engine:
name: str
flush_interval_s: float # how often new data becomes queryable
primary_index_lookup_us: float # microseconds to find segments matching contest_id=12
scan_rate_rows_per_ms: float # how fast it can scan resolved rows
overhead_ms: float # query planning + cluster spin
# Tuned to public benchmarks (rough orders, not precise).
warehouse = Engine("Snowflake (batch)",
flush_interval_s=3600, # hourly batch
primary_index_lookup_us=400, # zone-map filtering, mid
scan_rate_rows_per_ms=400, # large warehouse, parallel
overhead_ms=1200) # planner + spinup
streaming = Engine("ClickHouse (MergeTree)",
flush_interval_s=5, # 5-second segments
primary_index_lookup_us=8, # sparse primary key in RAM
scan_rate_rows_per_ms=900, # mark-cache hot path
overhead_ms=12) # no spin, no planner walk
# Stream: 2,500 entries/sec across 28 Indian states, contest_id=12 dominant.
def make_stream(seconds: int = 600) -> list[tuple[float,int,str]]:
out = []
states = ["MH","KA","TN","DL","UP","WB","GJ","TS","KL","RJ"]
for i in range(seconds):
for _ in range(2500):
out.append((time.time() - (seconds-i),
12 if random.random() < 0.7 else random.randint(1,20),
random.choice(states)))
return out
def query(engine: Engine, stream: list, now_ts: float) -> tuple[dict, float]:
visible_through = now_ts - engine.flush_interval_s
relevant = [r for r in stream if r[1] == 12 and visible_through-60 <= r[0] < visible_through]
counts = defaultdict(int)
for _, _, st in relevant: counts[st] += 1
scan_ms = max(1, len(relevant) / engine.scan_rate_rows_per_ms)
total_ms = engine.overhead_ms + engine.primary_index_lookup_us/1000 + scan_ms
return dict(counts), total_ms
stream = make_stream(seconds=600)
now = time.time()
for eng in (warehouse, streaming):
result, latency_ms = query(eng, stream, now)
total = sum(result.values())
print(f"{eng.name:32s} -> {total:>7,d} entries "
f"top: {dict(sorted(result.items(), key=lambda x:-x[1])[:3])} "
f"latency {latency_ms:6.1f} ms")
# Sample run during a live RCB-CSK chase:
Snowflake (batch) -> 0 entries top: {} latency 1200.4 ms
ClickHouse (MergeTree) -> 105,317 entries top: {'MH': 18203, 'KA': 14987, 'TN': 12211} latency 97.6 ms
Walk the simulator carefully — six lines do most of the work.
flush_interval_sis the freshness floor on each engine. Snowflake's hourly batch means the "last 60 seconds" window simply has not been ingested; ClickHouse's 5-second segments have. Same SQL, same filter, completely different visible window.primary_index_lookup_usis where ClickHouse's RAM-resident sparse primary key beats Snowflake's zone-map. 8 microseconds vs 400 — both are fast, but the streaming engine pays nothing for the lookup because the index is already hot.scan_rate_rows_per_msis roughly comparable; columnar scans are columnar scans. The wall is not scan rate; the wall is everything around the scan.overhead_msis the kicker. Snowflake's warehouse spinup, query planner traversal, and metadata RPC dominate the budget — 1.2 seconds before any actual work. ClickHouse skips most of that because the engine is one process, not a service mesh. Why this is structural, not tuning: Snowflake separates compute from storage to support multi-tenant elastic scaling, which means every query crosses an RPC boundary. ClickHouse colocates compute and storage on each node, which means a query is a function call. The architecture choice is downstream of the product target — Snowflake serves thousands of customers per cluster; ClickHouse serves one workload per cluster. Neither is wrong; they answer different questions.- The grouped output (
MH: 18203, KA: 14987, TN: 12211) is the actual business answer that drives the prize-pool top-up decision. Maharashtra and Karnataka entries are coming in fast; the PM has 30 seconds to decide whether to extend the pool by ₹40 lakh. With the warehouse, that decision was being made on yesterday's number, with the streaming engine, on the actual number. - The simulator is honest about what it is not modelling: concurrent queries (Pinot's QPS profile is its real strength), memory pressure under load (the p99 ballooning that chapter 110 will dissect), late-arriving events past the watermark (chapter 109 territory), and the cost-per-GB-month tradeoff of keeping 7 days hot in ClickHouse vs 7 years hot in Snowflake.
Storage layout — segments, marks, and the merge tree
Look at one ClickHouse MergeTree table on disk and you will see, for a single date partition, dozens to hundreds of small directories named 202604_NNN_NNN_K: <partition>_<min_block>_<max_block>_<merge_level>. Each directory is a part — a complete columnar slice with its own primary index, mark file, column data files, and per-column min/max. The merge tree is the in-process daemon that watches these parts grow in count and decides when to merge.
The merge policy is what controls the tradeoff between query speed (fewer, bigger parts is faster to scan) and ingest speed (smaller flush intervals mean more parts faster). ClickHouse exposes parts_to_throw_insert (default 300, after which inserts get rejected — the backpressure signal) and min_bytes_for_wide_part (below which a part is "compact" — even tighter compression). Pinot does the same dance with its segment generation and segment relocation. Druid does it with its "incremental" segments handed off to "historical" nodes.
The merge process is what gives streaming OLAP its specific operational signature. If ingest spikes 4× (a Dream11 IPL final, a Flipkart Big Billion Days), the merge daemon falls behind, level-0 segment count grows, query latency degrades, and eventually the engine starts rejecting writes. The signal is parts_count per partition — Cred's on-call playbook for their Pinot tier has parts_count > 200 as a P2 alert and > 280 as a P1, because at 300 the engine starts dropping inserts and the fraud-detection lag balloons. PhonePe's ClickHouse cluster has the same alert tuned at 250.
Common confusions
- "ClickHouse is just Snowflake without the warehouse spin-up." It is not. The merge-tree storage layout, the always-resident sparse primary index, and the per-segment min/max files are different physics, not just different startup time. A Snowflake table copied byte-for-byte into ClickHouse would not query the same way; the layout itself is different.
- "Druid and Pinot do the same thing." They share the segment-based model but differ on the index strategy. Druid pre-computes per-dimension bitmap indices optimised for filter-then-aggregate. Pinot adds star-tree indices optimised for materialised-aggregation lookups (the dashboard-tile use case). Druid is older, Pinot's user-facing-analytics workload is its differentiator.
- "You can put OLTP on a streaming OLAP engine." No — these engines do not support multi-row transactions, point updates by primary key (Pinot has limited UPSERT, ClickHouse has
ReplacingMergeTreefor eventual dedup, neither is OLTP), or the read-your-writes guarantee an OLTP store provides. Use Postgres or DynamoDB for OLTP, period. - "Streaming OLAP replaces the warehouse." It serves a different workload — sub-second freshness for in-loop decisions. Long-tail historical queries (cohort analysis over 3 years), arbitrary joins, cost-per-TB-month archive: still warehouse territory. Two-tier exists because the cost curves cross — see chapter 103 for the IRCTC/Dream11 economics.
- "The merge tree is just LSM-tree compaction renamed." Mechanically yes, semantically no — LSM-tree compaction in RocksDB is to support point lookups by key; merge-tree compaction in ClickHouse is to support range scans by sort key. Both are append-then-merge but the index built on top is structured for different access patterns.
- "More RAM = faster queries — just throw hardware at it." Up to a point, yes. Past the working set size (the index plus the hottest segments), more RAM does nothing — you are CPU-bound on the bitmap AND/OR or scan rate. PhonePe's 2024 incident report noted that doubling cluster RAM from 1 TB to 2 TB improved p50 by 8% but did nothing for p99. The fix was reducing dimension cardinality, not adding hardware.
Going deeper
Why the merge daemon's backpressure is the operational signal
A streaming OLAP cluster has one number that, if you only watch one number, you watch this: the count of unmerged level-0 segments per partition. ClickHouse exposes it as system.parts filtered by level=0; Pinot exposes it as realtime_segment_count. If this count grows linearly while ingest is steady, your merge daemon is falling behind — either disk IO is saturated (cheap fix: faster disks), or merge concurrency is too low (cheap fix: bump background_pool_size), or your sort key has too many distinct values for efficient merging (expensive fix: redesign the schema). Why this is the right canary: query latency, ingest rate, and disk fullness are all consequences of the merge daemon's health. By the time query latency spikes, the merge daemon has been falling behind for tens of minutes. Watching the segment count is watching the leading indicator. Cred's incident postmortem (Q2 2024) traced a 47-minute fraud-detection blackout to a misconfigured merge daemon — the segment count had been climbing for 35 minutes before query latency was visibly bad.
Pinot's segment generation — offline vs realtime
Pinot pioneered an explicit split: a single table has both OFFLINE segments (built from batch sources, S3 Parquet) and REALTIME segments (built from a Kafka topic). Queries hit both transparently. The trick is that the realtime segments get converted into offline segments after a configured retention window — typically a day — at which point the realtime resources (Kafka offsets, in-memory buffers) are reclaimed. This is how Pinot keeps a 7-year history queryable without holding it in the realtime tier's RAM. ClickHouse and Druid do the same thing through different mechanisms (MaterializedView plus MergeTree for ClickHouse, segment handoff to historicals for Druid), but Pinot's split is the cleanest illustration of the underlying pattern: every byte of data eventually transitions from realtime to offline storage; the transition is when freshness stops mattering and storage cost starts mattering.
What "exactly-once" looks like at the segment boundary
Both ClickHouse and Pinot consume from Kafka by tracking per-partition offsets. The standard "at-least-once" pattern is: read a batch from Kafka, append it to the active segment, commit the offset. If the segment crashes mid-write, the offset has not advanced and the events get re-read on restart — which means they are appended again, producing duplicates. ClickHouse's solution is the _block_id deduplication on inserts (a 64-bit hash of the block content; duplicate blocks are silently dropped). Pinot's is the segmentName derived from the Kafka offset range, plus a pauseless consumer that handles handoff cleanly. Both are workable but neither is "free exactly-once" — they are exactly-once for the OLAP store given idempotent producer semantics on the Kafka side. Build 9's exactly-once chapters spell out the chain of guarantees needed end-to-end.
Why high-cardinality dimensions wreck a streaming OLAP engine
A "dimension" in OLAP vocabulary is a column you GROUP BY. Streaming engines build per-dimension indices; the size of those indices is roughly linear in cardinality. A dimension with 50 distinct values (Indian states) is fine. A dimension with 10 crore distinct values (user_id) is a disaster — the bitmap indices are bigger than the data. PhonePe's UPI dashboard team learned this the painful way in 2023: they tried to put payee_vpa (high-cardinality string) as a Druid dimension, watched their Druid cluster's RAM grow to 11 TB, and had to redesign the schema to use a hashed bucket of payee_vpa instead. The general rule is: dimensions ≤ 10,000 distinct values are cheap, 10k–1M is workable with care, > 1M needs a different strategy (sketch types like HyperLogLog, or pushing the high-cardinality column out of the dimension axis entirely).
Where this leads next
Now that the four-axis split is concrete, the next chapters of Build 14 each take one engine and walk through how it actually handles the four axes:
- /wiki/clickhouse-columnar-for-real-time — ClickHouse's MergeTree, the sparse primary key, and why mark caches matter.
- /wiki/apache-pinot-and-the-uber-use-case — why Pinot dominates user-facing analytics dashboards and what star-tree indices actually buy you.
- /wiki/druid-and-its-segment-model — Druid's segment-handoff architecture and why it is still the canonical reference design.
- /wiki/starrocks-doris-and-the-next-wave — the newer Chinese-origin engines reframing the same physics.
- /wiki/pre-aggregation-materialized-views-and-their-costs — when to compute aggregations at write time instead of read time, and what you give up.
- /wiki/serving-p99-latency-under-ingest-pressure — the production reality where ingest spikes meet query SLOs.
The rest of Build 14 is, in effect, a working out of the four axes against three engines and one set of tradeoffs. Once you have the axis vocabulary, the engine differences become small.
References
- ClickHouse — MergeTree storage engine — the canonical merge-tree implementation, with on-disk layout details.
- Apache Pinot — Architecture and segment generation — Pinot's offline vs realtime segment model.
- Apache Druid — Segments and ingestion — Druid's segment lifecycle and handoff.
- Uber Engineering — Operating Apache Pinot at Uber — the case study that shaped the user-facing-analytics vocabulary.
- O'Neil et al. — The Log-Structured Merge Tree (1996) — the foundational paper, useful for grounding the merge-tree intuition.
- ClickHouse Performance benchmarks — public, reproducible benchmarks across engines.
- /wiki/wall-batch-metrics-arent-fresh-enough — the previous chapter, the Build 13/14 boundary that motivates this entire section.
- /wiki/clickhouse-columnar-for-real-time — the next chapter, ClickHouse's specific take on these four axes.