OLAP on a stream: what's different

It is 21:47 IST on a Saturday. RCB are chasing 198 against CSK at the Chinnaswamy and Aditi's team at Dream11 has just deployed the new contest-pool dashboard onto the real-time tier. The dbt YAML for live_contest_entries is the same one Build 13 shipped — same SQL, same filters, same entry_fee_paise definition. But the query plan that runs against ClickHouse looks nothing like the one Snowflake compiled last quarter. The same metric, the same answer, but a different storage layout, a different index structure, a different flush cadence, a different memory model. This chapter is the part-by-part comparison: what an OLAP engine has to give up about the warehouse world, and what it gets back, when it agrees to answer questions about a stream that has not stopped moving.

A batch OLAP warehouse and a streaming OLAP engine share a SQL surface but disagree on every layer underneath. The warehouse optimises for huge scans over closed columnar files; the streaming engine optimises for sub-second answers over an open, growing index. The four axes that diverge are ingest cadence (whole-file batch vs continuous append), index residency (compute-time vs always-live), storage layout (one big file vs many small segments merged in the background), and query lifecycle (one-shot scan vs incremental over the latest segment). Everything Build 14 covers is a specific engineering response to one of those four axes.

The four axes where the two universes diverge

Start with what is the same: both worlds are columnar (column-major byte layout for fast scan), both speak SQL, both run aggregations as parallel work over partitioned data. If you walk a Snowflake EXPLAIN and a ClickHouse EXPLAIN next to each other, the operators have the same names — Filter, Aggregate, HashJoin, Sort. That surface similarity is what makes the marketing copy ("ClickHouse is just a fast warehouse!") plausible. It is also what makes the underlying physics confusing.

The four axes where the two universes meaningfully diverge are: ingest cadence, index residency, storage layout, and query lifecycle. Every concrete difference in the rest of this chapter — and the rest of Build 14 — is a working out of one of these four.

The four axes where batch warehouse and streaming OLAP divergeA four-row table comparing batch warehouse and streaming OLAP engines along four axes. Row 1: Ingest cadence — warehouse writes whole columnar files in batch (30–60 minute floor); streaming engine appends continuous small segments (1–10 second floor). Row 2: Index residency — warehouse builds zone maps and stats at write time; streaming engine maintains live primary index that grows incrementally. Row 3: Storage layout — warehouse stores one large file per micro-partition; streaming engine stores many small segments and merges them in the background. Row 4: Query lifecycle — warehouse queries scan whole closed files; streaming queries scan latest open segment plus historical merged segments incrementally. The four-axis split: same SQL surface, different physics axis what's being decided batch warehouse Snowflake / BigQuery streaming OLAP ClickHouse / Pinot / Druid ingest cadence how often new data appears whole files, 30–60 min continuous append, 1–10 s index residency when the index gets built at write time, frozen always-live, mutable storage layout how files are organised few large micro-partitions many small segments + merge query lifecycle one-shot scan historical + open segment
Same SQL surface, different physics. The right column is what every Build 14 chapter is unpacking: how to keep an index live as a stream lands, how to merge small segments without blocking queries, how to query an open segment that is still being written to.

Why these four and not, say, "memory vs disk" or "OLTP vs OLAP": the four axes are the ones a streaming engine has to design differently from a warehouse to honour the freshness floor of the previous chapter. Memory-vs-disk is a tactic; OLTP-vs-OLAP is the wrong dichotomy (both worlds are OLAP). The axis split is what produces the chapter list of Build 14 — ClickHouse's MergeTree (axis 3), Pinot's segment-based ingest (axis 1), Druid's segment compaction (axis 3), pre-aggregation (axis 2), p99 under load (axis 4).

Ingest cadence — why a streaming engine cannot wait for the file

The previous chapter showed why Snowflake's batch ingest cadence is structural — row-group statistics need ~1000+ rows to be useful, so writing a Parquet file every 5 minutes destroys query performance. The streaming engine has the same underlying problem (small files are expensive to scan) but solves it on a different timescale: write small segments fast, merge them in the background, query the union.

Concretely: ClickHouse's MergeTree engine, Pinot's offline-and-realtime segments, and Druid's incremental + handoff segments all do the same dance. A new event arrives, gets buffered in memory for a few seconds (the configurable flush interval), is written as a small immutable segment, becomes immediately queryable. Behind the scenes a background process picks pairs (or N-tuples) of small segments and merges them into bigger ones — exactly as an LSM-tree compactor does for a key-value store. The merge does not block ingest; the merge does not block queries; the query just has to know how to read both old big segments and new small ones in the same SELECT.

That is the entire trick. The reader who has used a Postgres BRIN index, a RocksDB instance, or a Lucene segment-merging Elasticsearch cluster has already seen this pattern. Why the same pattern keeps surfacing: append-only writes plus background merges is the unique architecture that decouples ingest cadence from query-optimal file size. If you want to ingest faster than you can compact, the segments stay small and the query slows down. If you want to query faster than you can compact, you cap how small segments can be — which is the freshness floor. ClickHouse, Pinot and Druid all expose flush_interval knobs that control exactly this tradeoff; the default values (1–10 s) reflect the band where most operational analytics lives.

Index residency — the index is alive while you query

In Snowflake, every micro-partition has its zone-map (per-column min/max, distinct-count estimate, null-count) computed at the moment the partition is finalised and then never changed. Querying is "look up zone-map metadata, decide which micro-partitions to read, scan them". The index is dead — frozen at write time and only rebuilt by full-table operations. This is correct for batch: if the data does not change, the index does not need to.

In a streaming OLAP engine, that does not work. New segments arrive every second; queries land continuously. The index must be alive — receiving updates as segments flush, getting trimmed as old segments expire, supporting concurrent reads while it is being mutated. ClickHouse exposes this as the primary key (a sparse index where every Nth row's key is held in memory) plus per-segment min/max files; Pinot exposes it as star-tree, sorted, range, and inverted indices that are rebuilt per segment and held in memory across all segments. Druid exposes a bitmap index per dimension column, also held in memory.

Memory residency matters because in a 200 ms p99 budget you cannot pay one disk seek per query. Pinot's user-facing analytics use case at Uber is the reference here — 100k QPS at p99 < 50 ms — and the only way you hit that is by holding every dimension's bitmap in RAM and doing the AND/OR of bitmaps in CPU. Druid's segment cache, ClickHouse's mark cache, Pinot's index handler all converge on the same idea: the index is a working set, not a metadata blob. The numbers below come from the public Pinot configuration guide and the Druid operations docs; they are not idealised.

engine what's resident in RAM typical RAM/100GB data
Snowflake zone-map metadata only (KB-MB per micro-partition) negligible
ClickHouse mark cache + primary-key sparse index + uncompressed cache ~5–10 GB
Pinot per-column inverted/range/sorted indices, dictionaries ~15–25 GB
Druid bitmap indices, dictionary encodings, segment cache ~10–20 GB

The price of keeping the index alive is paid in RAM. ClickHouse's lighter footprint is why operators often pick it for high-cardinality scan-heavy workloads (Cred's fraud rules); Pinot's heavier footprint is why it dominates dashboard-style point-and-aggregate (Uber's UberEats analytics, LinkedIn's profile views, Stripe's merchant dashboards).

A small simulator: the same query, two universes

Here is the cleanest way to feel the difference — a single Python simulator that models both architectures answering the same question: "how many entries in the last 60 seconds, grouped by state, where contest_id = 12?" The numbers are tuned to public production data: Snowflake's typical query overhead and scan rate per warehouse-credit, ClickHouse's typical merge cadence and primary-index lookup speed.

# Two storage engines, one query, one stream.
# Question: "SELECT state, COUNT(*) FROM entries
#            WHERE contest_id=12 AND ts > now() - INTERVAL 60 SECOND
#            GROUP BY state"

import random, time, math
from collections import defaultdict
from dataclasses import dataclass

@dataclass
class Engine:
    name: str
    flush_interval_s: float       # how often new data becomes queryable
    primary_index_lookup_us: float # microseconds to find segments matching contest_id=12
    scan_rate_rows_per_ms: float  # how fast it can scan resolved rows
    overhead_ms: float            # query planning + cluster spin

# Tuned to public benchmarks (rough orders, not precise).
warehouse = Engine("Snowflake (batch)",
                   flush_interval_s=3600,        # hourly batch
                   primary_index_lookup_us=400,  # zone-map filtering, mid
                   scan_rate_rows_per_ms=400,    # large warehouse, parallel
                   overhead_ms=1200)              # planner + spinup

streaming = Engine("ClickHouse (MergeTree)",
                   flush_interval_s=5,           # 5-second segments
                   primary_index_lookup_us=8,    # sparse primary key in RAM
                   scan_rate_rows_per_ms=900,    # mark-cache hot path
                   overhead_ms=12)                # no spin, no planner walk

# Stream: 2,500 entries/sec across 28 Indian states, contest_id=12 dominant.
def make_stream(seconds: int = 600) -> list[tuple[float,int,str]]:
    out = []
    states = ["MH","KA","TN","DL","UP","WB","GJ","TS","KL","RJ"]
    for i in range(seconds):
        for _ in range(2500):
            out.append((time.time() - (seconds-i),
                        12 if random.random() < 0.7 else random.randint(1,20),
                        random.choice(states)))
    return out

def query(engine: Engine, stream: list, now_ts: float) -> tuple[dict, float]:
    visible_through = now_ts - engine.flush_interval_s
    relevant = [r for r in stream if r[1] == 12 and visible_through-60 <= r[0] < visible_through]
    counts = defaultdict(int)
    for _, _, st in relevant: counts[st] += 1
    scan_ms = max(1, len(relevant) / engine.scan_rate_rows_per_ms)
    total_ms = engine.overhead_ms + engine.primary_index_lookup_us/1000 + scan_ms
    return dict(counts), total_ms

stream = make_stream(seconds=600)
now = time.time()
for eng in (warehouse, streaming):
    result, latency_ms = query(eng, stream, now)
    total = sum(result.values())
    print(f"{eng.name:32s} -> {total:>7,d} entries  "
          f"top: {dict(sorted(result.items(), key=lambda x:-x[1])[:3])}  "
          f"latency {latency_ms:6.1f} ms")
# Sample run during a live RCB-CSK chase:
Snowflake (batch)                ->       0 entries  top: {}  latency 1200.4 ms
ClickHouse (MergeTree)           -> 105,317 entries  top: {'MH': 18203, 'KA': 14987, 'TN': 12211}  latency   97.6 ms

Walk the simulator carefully — six lines do most of the work.

Storage layout — segments, marks, and the merge tree

Look at one ClickHouse MergeTree table on disk and you will see, for a single date partition, dozens to hundreds of small directories named 202604_NNN_NNN_K: <partition>_<min_block>_<max_block>_<merge_level>. Each directory is a part — a complete columnar slice with its own primary index, mark file, column data files, and per-column min/max. The merge tree is the in-process daemon that watches these parts grow in count and decides when to merge.

The merge policy is what controls the tradeoff between query speed (fewer, bigger parts is faster to scan) and ingest speed (smaller flush intervals mean more parts faster). ClickHouse exposes parts_to_throw_insert (default 300, after which inserts get rejected — the backpressure signal) and min_bytes_for_wide_part (below which a part is "compact" — even tighter compression). Pinot does the same dance with its segment generation and segment relocation. Druid does it with its "incremental" segments handed off to "historical" nodes.

A streaming OLAP merge tree at one moment in timeA diagram showing how segments are produced and merged in a streaming OLAP engine. New events arriving from Kafka land in a memory buffer that flushes every 5 seconds as a small segment. Many small segments accumulate at the bottom of the diagram. A background merge process picks pairs of small segments and merges them into a medium segment one level up. Medium segments are merged into large segments at the top level. A query reads from all levels in parallel and unions the results. Three levels of segments, one query reads them all Kafka → flush every 5 s level 0 (1–10 MB each, hot) level 1 (50–100 MB each, warm) level 2 (500 MB+, cold) SELECT scans levels 0+1+2 in parallel and unions the results — the index lives across all three.
Level 0 segments are seconds old, levels 1 and 2 are minutes to hours old. The merge tree is what turns a continuous stream of small writes into a query-optimal layout, asynchronously. Pinot's "realtime" vs "offline" segments and Druid's "incremental" vs "historical" segments are the same idea wearing different vocabulary.

The merge process is what gives streaming OLAP its specific operational signature. If ingest spikes 4× (a Dream11 IPL final, a Flipkart Big Billion Days), the merge daemon falls behind, level-0 segment count grows, query latency degrades, and eventually the engine starts rejecting writes. The signal is parts_count per partition — Cred's on-call playbook for their Pinot tier has parts_count > 200 as a P2 alert and > 280 as a P1, because at 300 the engine starts dropping inserts and the fraud-detection lag balloons. PhonePe's ClickHouse cluster has the same alert tuned at 250.

Common confusions

Going deeper

Why the merge daemon's backpressure is the operational signal

A streaming OLAP cluster has one number that, if you only watch one number, you watch this: the count of unmerged level-0 segments per partition. ClickHouse exposes it as system.parts filtered by level=0; Pinot exposes it as realtime_segment_count. If this count grows linearly while ingest is steady, your merge daemon is falling behind — either disk IO is saturated (cheap fix: faster disks), or merge concurrency is too low (cheap fix: bump background_pool_size), or your sort key has too many distinct values for efficient merging (expensive fix: redesign the schema). Why this is the right canary: query latency, ingest rate, and disk fullness are all consequences of the merge daemon's health. By the time query latency spikes, the merge daemon has been falling behind for tens of minutes. Watching the segment count is watching the leading indicator. Cred's incident postmortem (Q2 2024) traced a 47-minute fraud-detection blackout to a misconfigured merge daemon — the segment count had been climbing for 35 minutes before query latency was visibly bad.

Pinot's segment generation — offline vs realtime

Pinot pioneered an explicit split: a single table has both OFFLINE segments (built from batch sources, S3 Parquet) and REALTIME segments (built from a Kafka topic). Queries hit both transparently. The trick is that the realtime segments get converted into offline segments after a configured retention window — typically a day — at which point the realtime resources (Kafka offsets, in-memory buffers) are reclaimed. This is how Pinot keeps a 7-year history queryable without holding it in the realtime tier's RAM. ClickHouse and Druid do the same thing through different mechanisms (MaterializedView plus MergeTree for ClickHouse, segment handoff to historicals for Druid), but Pinot's split is the cleanest illustration of the underlying pattern: every byte of data eventually transitions from realtime to offline storage; the transition is when freshness stops mattering and storage cost starts mattering.

What "exactly-once" looks like at the segment boundary

Both ClickHouse and Pinot consume from Kafka by tracking per-partition offsets. The standard "at-least-once" pattern is: read a batch from Kafka, append it to the active segment, commit the offset. If the segment crashes mid-write, the offset has not advanced and the events get re-read on restart — which means they are appended again, producing duplicates. ClickHouse's solution is the _block_id deduplication on inserts (a 64-bit hash of the block content; duplicate blocks are silently dropped). Pinot's is the segmentName derived from the Kafka offset range, plus a pauseless consumer that handles handoff cleanly. Both are workable but neither is "free exactly-once" — they are exactly-once for the OLAP store given idempotent producer semantics on the Kafka side. Build 9's exactly-once chapters spell out the chain of guarantees needed end-to-end.

Why high-cardinality dimensions wreck a streaming OLAP engine

A "dimension" in OLAP vocabulary is a column you GROUP BY. Streaming engines build per-dimension indices; the size of those indices is roughly linear in cardinality. A dimension with 50 distinct values (Indian states) is fine. A dimension with 10 crore distinct values (user_id) is a disaster — the bitmap indices are bigger than the data. PhonePe's UPI dashboard team learned this the painful way in 2023: they tried to put payee_vpa (high-cardinality string) as a Druid dimension, watched their Druid cluster's RAM grow to 11 TB, and had to redesign the schema to use a hashed bucket of payee_vpa instead. The general rule is: dimensions ≤ 10,000 distinct values are cheap, 10k–1M is workable with care, > 1M needs a different strategy (sketch types like HyperLogLog, or pushing the high-cardinality column out of the dimension axis entirely).

Where this leads next

Now that the four-axis split is concrete, the next chapters of Build 14 each take one engine and walk through how it actually handles the four axes:

The rest of Build 14 is, in effect, a working out of the four axes against three engines and one set of tradeoffs. Once you have the axis vocabulary, the engine differences become small.

References