Pre-aggregation, materialized views, and their costs
Riya runs analytics at Cred. The "weekly active members" tile on the leadership dashboard answers in 90 ms; the dashboard refreshes every 30 seconds across 200 tabs open in Bengaluru and Gurugram. The query underneath that tile, if she ran it raw, would scan 4.8 billion event rows over a sliding 7-day window and group by member tier — about 14 seconds on a hot ClickHouse cluster, 90 seconds cold. The 90 ms answer comes from a pre-aggregated table, refreshed every two minutes by a scheduled job, that holds 1,400 rows: one per (tier, day) pair. The dashboard hits the small table; the small table hits the large table; the large table is rebuilt by a job Riya has to keep alive. Pre-aggregation buys the latency. It also creates four new problems she did not have before.
A materialised view is a query whose result is stored as a table and refreshed when the inputs change; the optimiser can rewrite future queries to read the small result instead of the large input. Every OLAP engine implements this pattern with different trade-offs — ClickHouse projections, StarRocks async MVs, Pinot star-trees, Druid rollup — but the costs are the same: refresh latency that staleness-bounds the dashboard, write amplification that bills your S3, and a query-rewrite layer that fails silently when it cannot match.
What pre-aggregation actually is
Riya's raw fact table is card_events with columns (member_id, tier, event_type, event_time, amount_paise). The dashboard tile asks: for the last 7 days, by tier, how many distinct members were active and what was the total reward GMV? Written as SQL:
SELECT tier,
COUNT(DISTINCT member_id) AS wam,
SUM(amount_paise) / 100 AS gmv_rupees
FROM card_events
WHERE event_time >= now() - INTERVAL 7 DAY
GROUP BY tier;
Run against the 4.8-billion-row fact table this is a wide scan: 7 days × ~700M events/day × ~80 bytes/row of touched columns ≈ 380 GB. Even a vectorised engine on hot NVMe scans ~10 GB/sec/node; on an 8-node ClickHouse cluster that's ~5 seconds at the absolute floor, and 14 seconds in practice once predicate evaluation, distinct-count merging, and result materialisation are added.
The pre-aggregated table is a smaller table that has already done most of the work:
CREATE TABLE card_events_daily_by_tier (
event_date Date,
tier LowCardinality(String),
wam_state AggregateFunction(uniq, UInt64), -- HLL state
gmv_paise UInt64
)
ENGINE = SummingMergeTree
ORDER BY (event_date, tier);
Each row holds one day's pre-aggregated state for one tier. The AggregateFunction(uniq, ...) column stores a HyperLogLog sketch — not a count, but a state that can be merged across days to produce the 7-day distinct count without re-scanning the fact table. Why an HLL state and not just a number: COUNT DISTINCT is not additive. If Monday has 1.2M distinct members and Tuesday has 1.3M distinct members, you cannot get the 2-day distinct count by adding — many members appear on both days. HLL sketches can be merged: union two sketches, get the sketch of the union, read off the cardinality. This is the only way pre-aggregation works for distinct counts.
The dashboard query against the pre-aggregated table reads ~7 × 5 = 35 rows (7 days × 5 tiers), merges the HLL sketches, sums the paise column, and answers in 90 ms. The fact-table scan it replaced read 380 GB. The compression ratio of pre-aggregation on this workload is roughly 5 billion rows scanned to 35 rows scanned — an eight-orders-of-magnitude reduction.
The fundamental shift is moving cost from read time to write time. The 14-second scan still happens, but it happens once every 2 minutes inside the refresh job, not 200 times per minute inside the dashboard fan-out. If the dashboard is read 12,000 times an hour and the refresh runs 30 times an hour, the cost saving is roughly 400× on read computation and the trade is up-front: a bounded refresh budget per hour. Whether the trade is worth it depends on whether the dashboard's read fan-out is high enough.
How four engines spell the same idea
The pattern is universal; the spelling is engine-specific. The same logical concept — "store this aggregation, refresh it, use it transparently" — has four distinct mechanisms in the four OLAP engines from the chapters before this one. Each spelling reflects a different bet about the workload.
-- 1. ClickHouse: PROJECTION inside the table
ALTER TABLE card_events ADD PROJECTION daily_by_tier (
SELECT toDate(event_time) AS event_date,
tier,
uniqState(member_id) AS wam_state,
sum(amount_paise) AS gmv_paise
GROUP BY event_date, tier
);
ALTER TABLE card_events MATERIALIZE PROJECTION daily_by_tier;
-- 2. StarRocks: async materialised view
CREATE MATERIALIZED VIEW mv_card_daily
REFRESH ASYNC EVERY (INTERVAL 2 MINUTE)
AS
SELECT date_trunc('day', event_time) AS event_date,
tier,
bitmap_union(to_bitmap(member_id)) AS wam_bitmap,
sum(amount_paise) AS gmv_paise
FROM card_events
GROUP BY 1, 2;
-- 3. Pinot: star-tree index inside the segment
{
"starTreeIndexConfigs": [{
"dimensionsSplitOrder": ["event_date", "tier"],
"skipStarNodeCreationForDimensions": [],
"functionColumnPairs": ["DISTINCTCOUNTHLL__member_id",
"SUM__amount_paise"],
"maxLeafRecords": 10000
}]
}
-- 4. Druid: rollup at ingestion time
{
"type": "index_parallel",
"spec": {
"dataSchema": {
"granularitySpec": { "queryGranularity": "DAY", "rollup": true },
"dimensionsSpec": { "dimensions": ["tier"] },
"metricsSpec": [
{ "type": "thetaSketch", "name": "wam_sketch", "fieldName": "member_id" },
{ "type": "longSum", "name": "gmv_paise", "fieldName": "amount_paise" }
]
}
}
}
# Output: dashboard p99 latency on 4.8B-row fact, after pre-agg
Engine | Mechanism | Refresh model | Dashboard p99
ClickHouse | PROJECTION | Synchronous w/ writes | 110 ms
StarRocks | Async MV | Every 2 min | 90 ms
Pinot | Star-tree index | Built per-segment | 45 ms
Druid | Ingestion-time rollup| At ingestion only | 30 ms
Cold-cache p99 (cluster restart, 0% local cache):
ClickHouse : 1.2 s | StarRocks : 4.0 s
Pinot : 80 ms | Druid : 50 ms
Walk what each engine actually does:
- ClickHouse PROJECTION stores the pre-aggregation as part of the fact table itself. Every INSERT into
card_eventssimultaneously updates the projection. The optimiser, at SELECT time, checks whether any defined projection covers the query and rewrites to scan the projection instead. The cost is on the writer side: every INSERT is now ~30% slower because it computes the aggregation state alongside the row write, but the reader pays nothing extra and there is zero refresh lag. This is the right choice when ingest rate is moderate (under 100k rows/sec/node) and dashboard freshness must be exact. - StarRocks async MV is a separate table, refreshed on a schedule. The refresh job is incremental: it scans only the partitions of
card_eventsthat changed since last refresh (tracked via partition-level timestamps), recomputes those slices' aggregations, and merges them. The cost is the refresh budget: at 2-minute cadence over a 4.8B-row fact with 700M-row daily partitions, the refresh job re-scans ~24M rows per run (the rows arrived in the last 2 minutes) — about 1 second of work — but it runs forever. The dashboard sees data up to 2 minutes stale. This is the right choice when ingest is high and a small staleness window is acceptable. - Pinot star-tree is a per-segment auxiliary index. When a segment is built (typically every 10 minutes during real-time ingestion or once per batch), the star-tree is built alongside the column data, holding pre-aggregated values for every prefix of the dimensions. Queries that match the dimension order traverse the star-tree directly; queries that don't fall back to the column scan. Why a star-tree and not a flat aggregation table: a star-tree handles the cross-product of pre-aggregations efficiently. With dimensions
(event_date, tier), the star-tree has nodes for every prefix —(event_date),(event_date, tier), plus a "star node" that wildcards the dimension. A query asking only bytier(not by date) hits the star-aggregated path; a query asking by(date, tier)hits the leaf path; a query asking by an unrelated dimension falls through. One index, many query shapes. - Druid rollup happens at ingestion. Druid does not store the raw
card_events; it stores the already-aggregated(event_date, tier)rows with HLL sketches and sums in their place. The raw rows are gone — Druid's bet is that you will only query at the aggregation grain you declared. If you later need a query at finer grain, you re-ingest from the source. The compression is brutal: 4.8B raw events compress to 1,400 rolled-up rows, about three million to one. The trade is no drill-down past the rollup grain.
The thing all four spellings share is the optimiser must know how to rewrite. A user submits a query against the fact table; the engine has to recognise that some pre-aggregated structure can answer it. The recognition is non-trivial: the projection/MV/star-tree must cover the query's dimensions exactly (or be a superset where a residual filter can be applied), the aggregation functions must be compatible (SUM is additive, COUNT DISTINCT is not without HLL, MIN/MAX work, MEDIAN does not), and the time window must be representable in the pre-agg's grain.
The four hidden costs
Pre-aggregation looks free. The dashboard goes from 14 seconds to 90 ms; the user is happy; the bill stays the same. Then the costs show up over the next quarter, and they show up in places nobody warned the team about.
Cost 1: refresh latency becomes staleness on the dashboard
Riya's StarRocks MV refreshes every 2 minutes. That sounds tight. But the refresh itself takes 30–60 seconds on a busy cluster (longer if the fact table is being written to concurrently and the snapshot isolation has to wait for in-flight writes to commit). Add the 2-minute sleep between runs, and the worst-case staleness on the dashboard tile is 3 minutes 30 seconds. For a "weekly active members" tile this is fine. For a fraud-detection tile that should reflect the last 30 seconds of card events, it is a blocker. The team's first instinct is to drop the refresh interval to 30 seconds; the second instinct, after watching the cluster CPU stay pinned at 80% on refresh runs, is to revert. The refresh budget is a hard physical constraint set by how fast the underlying scan can read.
Cost 2: write amplification you don't see in the storage bill until it's too late
Every projection, MV, and rollup is more bytes written. For Cred's card_events with one daily-by-tier projection, write amplification is ~1.05× (the HLL state is small). Add 8 more projections — one per dimension permutation the dashboards need — and write amplification climbs to 1.7×. At 700M events/day × 80 bytes ≈ 56 GB raw daily, the projections add another 40 GB/day. At ₹2.30/GB on S3 standard tier, this is ₹92/day, ₹2,800/month. Forgettable. Now imagine Razorpay, with 200M payment events/day across 50 dashboards each requiring 3 projections — at the same write amplification ratios, that's ~1.1 TB/day of pre-aggregated state, ₹76,000/month, before the refresh-job compute cost.
Cost 3: silent rewrite failure when the query shape drifts
The biggest production hazard: the query rewrite is best-effort, not guaranteed. A user runs the dashboard query and it answers in 90 ms — pre-agg hit. A different user changes one filter (WHERE tier IN ('Platinum', 'Gold') instead of WHERE tier = 'Platinum') and the query takes 14 seconds — pre-agg miss, fell through to fact-table scan. Neither user sees a warning; the query just runs. The dashboard team adds a new "events per hour" tile assuming the same speed and watches it answer in 18 seconds, then files a JIRA against the data-engineering team. The fix is usually "add another projection" — but every added projection costs Cost 2.
The query patterns that miss vary by engine: ClickHouse fails to rewrite when the query has a JOIN even if the JOIN side is a tiny dim that could be applied as a residual filter; StarRocks fails when an aggregation is wrapped in coalesce() or a CASE; Pinot fails when the query asks by a dimension not in the star-tree's split order. Each engine documents these limitations, but every team learns them by hitting them in production. Why no engine guarantees rewrite: query rewrite is a search problem. Given a query Q and a set of pre-agg structures M1, M2, ..., the optimiser must check whether some Mi (or a combination) can answer Q with at most a residual filter and an aggregation step. The check is computationally bounded by the optimiser's planning time budget — typically 100–500 ms — and complex query shapes (multiple joins, nested aggregations, expressions in the GROUP BY) can push the rewrite-search beyond budget. The engine then falls back to the original query; correctness is preserved, latency is not.
Cost 4: the operational graveyard of refresh jobs
A year into production, the team has 47 materialised views, 6 of which were built for dashboards that were deprecated 8 months ago. Every one of them is still refreshing. The refresh jobs are invisible until you look — they don't show up in the BI tool, they don't have an owner column on the cluster's job page, and the cost-attribution layer (Build 16) was not configured to label them with a project. Cleaning up unused MVs is a quarterly task that nobody enjoys; missing the cleanup means the refresh budget gets eaten by dead jobs and the live dashboards start missing their staleness SLAs.
The Cred team's eventual fix is a metadata.materialised_views table that records, for each MV, who created it, which dashboard URL queries it, and the last time the rewrite layer matched a query against it. MVs unmatched for 30 days are flagged for deletion. This solves the operational graveyard but does not solve Costs 1–3.
When pre-aggregation is the wrong answer
Three workload shapes where adding a materialised view makes things worse, not better.
High-cardinality dimensions where the pre-agg is nearly the same size as the fact. If card_events is grouped by member_id (which has 60M distinct values), the pre-aggregated table has 60M rows × 7 days ≈ 420M rows — almost 10% of the fact table. The compression ratio is now 10:1, not 3,000,000:1, and the read-side savings are marginal while the write-side cost is full.
Ad-hoc exploration. If the analyst pool runs unpredictable queries — different filters, different group-bys, different time windows — no fixed set of MVs covers the workload. Every new query is a coin flip on whether the rewrite hits. The right architecture is a fast scan engine (ClickHouse, StarRocks) without MVs, sized for the worst-case full scan.
Fact tables that change retroactively. If a row in card_events from 30 days ago gets updated (a chargeback, a refund), every MV that aggregates over that day must be rebuilt for that day. ClickHouse's projection model does not handle this gracefully (the projection is computed on insert, not on update); StarRocks' MV must do a partition-level refresh of the affected partition. For OLAP workloads where the past is mutable (financial reconciliation, GST filings), the refresh cost can dominate.
Common confusions
- "A materialised view is the same as a view." A view is a saved query; the rows do not exist until you select from the view, at which point the underlying query runs. A materialised view stores the rows; selects from it read storage, not recompute. The two have nearly opposite cost profiles: views are free at definition and expensive at read; MVs are expensive at definition and refresh, free at read.
- "Pre-aggregation always speeds up queries." Only if the query rewrite hits. A miss runs the original full-table scan — which is now slightly slower because the engine spent 100ms in the rewrite-search before falling through. The hit rate matters; below ~70% hit rate the maintenance cost typically exceeds the read savings.
- "COUNT DISTINCT can be pre-aggregated by storing daily distinct counts." It cannot. Distinct counts are not additive across days. The only way to pre-aggregate distincts is to store a sketch (HLL, Theta, KMV) that supports merge operations. The exact-distinct alternative requires storing the full set of distinct keys per partition, which often exceeds the size of the fact table.
- "Druid rollup and ClickHouse SummingMergeTree are the same thing." Both pre-aggregate at write time, but Druid rollup discards the raw rows entirely while ClickHouse SummingMergeTree merges at background-merge time and can fall back to row-level data for queries that need it. The Druid approach gives smaller storage but no drill-down; the ClickHouse approach gives drill-down at the cost of bigger storage and merge-load.
- "You can refresh an MV more often by adding more compute." Up to a point. The bottleneck is usually the source-table scan throughput, which is limited by storage IO. Adding more BEs/workers helps until you saturate the storage bandwidth (S3 per-prefix throttling at 5,500 reads/sec is a common ceiling). Past the saturation point, more compute waits on storage.
- "Star-trees in Pinot are just compressed materialised views." They are different data structures with different access patterns. An MV is a flat table; you scan it. A star-tree is a tree; you traverse it from the root following the dimensions in the query, terminating at a leaf or aggregated node. The traversal is logarithmic in the dimension cardinality, not linear in the row count.
Going deeper
Incremental view maintenance and the math of "additive"
The theoretical foundation for materialised views is incremental view maintenance (IVM): given a base table T and a view V = f(T), when T changes by ΔT, can V be updated to f(T ∪ ΔT) without recomputing f from scratch? The answer depends on the structure of f. SUM, COUNT, and MIN/MAX over insert-only streams are straightforwardly incremental: V_new = V_old + f(ΔT). MEDIAN, PERCENTILE, and COUNT DISTINCT are not — they require a sketch or full state.
The Materialize project (a streaming database company) productised this distinction by computing a "differential dataflow" of every change to T and propagating it through the view definition. Their engine maintains the view exactly under inserts, updates, and deletes, with worst-case latency proportional to the size of the change rather than the size of T. The trade-off is a higher per-row maintenance cost than ClickHouse's batch projections, balanced by the guarantee of consistency. For Indian fintechs that need exact reconciliation views (Razorpay's settlement dashboard, Zerodha's end-of-day P&L), differential dataflow is increasingly the right answer.
The Star-tree paper and the n-ary cube
Pinot's star-tree is a generalisation of the data cube concept from OLAP literature (Gray et al., "Data Cube: A Relational Aggregation Operator", 1996). The classical cube pre-computes every cell of the n-dimensional aggregation hypercube; for d dimensions each with cardinality c, the cube has c^d cells, which is intractable. The star-tree's compromise is to pre-compute only along a chosen split-order: dimension 1 first, then dimension 2 inside each dim-1 bucket, etc. The "star" suffix comes from a wildcard node at every split that aggregates across all values of that dimension — letting queries skip a dimension at low cost. The result is a tree with at most O(c × log d) nodes per branch, far below c^d.
The split-order choice is a design decision. Put the most-filtered dimension first so its filter is tight high in the tree; put the least-cardinality dimension last so the leaves don't explode. Pinot's ingestion pipeline lets the user override the auto-chosen order via the table config; for Karan's growth dashboard, the auto-order picked (event_date, tier) correctly because date filters are tight and tier has low cardinality.
Concurrent writes and refresh isolation
When ingestion is high (Flipkart Big Billion Days writing 2M events/sec into the fact table) and the MV refresh runs every 2 minutes, the refresh job sees a snapshot of the fact table that is already 2 minutes behind by the time it finishes. Worse, the snapshot itself can be inconsistent if the engine doesn't support snapshot isolation on the source. StarRocks and Iceberg-backed StarRocks both use Iceberg snapshots as the consistency point — the refresh reads at snapshot N, ingestion writes snapshot N+1, the next refresh reads N+1. ClickHouse's SummingMergeTree does not have explicit snapshots and relies on append-only insert semantics; if the inserts are not append-only (UPDATE or REPLACE), the projection state can diverge from the row state until a background merge reconciles them.
The Indian fintech operational pattern for this is: lock the schema, accept "append-only with corrections via tombstone" (insert a negative-amount row to cancel a positive one), and refresh MVs with explicit REFRESH MATERIALIZED VIEW ... ALL PARTITIONS once a day overnight to catch any drift. The 2-minute MV is for live dashboards; the overnight refresh is the source of truth for the morning leadership review.
The cost of forgetting an MV
Riya's team eventually built a query-attribution layer that tagged every refresh job with the dashboard URL it served and the last time that dashboard was loaded. They found 6 of 47 MVs were serving zero queries. The cumulative refresh cost across those 6 MVs was ₹4.2 lakh/year. Killing them paid for two SDE-2 hires.
The pattern generalises: in any team that has been on an OLAP engine for more than 18 months, 10–20% of the materialised views are dead. Detecting them requires query-attribution (linking the query to the rewrite hit), which most teams do not set up until they have already paid the bill. Build 16 (multi-tenancy and cost) covers the attribution mechanics; this chapter is the workload-side reason to care about it.
Where this leads next
- /wiki/serving-p99-latency-under-ingest-pressure — what happens to dashboard latency when ingestion competes with refresh and read traffic on the same cluster.
- /wiki/cost-attribution-which-team-paid-for-that-query — Build 16; how to track the refresh-job cost back to the dashboard that needs it.
- /wiki/incremental-view-maintenance-the-math — Build 10; the theory of differential dataflow that explains which aggregations can be incrementalised exactly.
The pre-aggregation pattern is older than any of the engines that implement it — Codd's data cube paper is from 1993, OLAP cubes shipped in Hyperion and Microsoft Analysis Services in the late 1990s, every BI tool has had some form of "drill-down with stored aggregates" for thirty years. What has changed is that the storage is cheap enough now that you can afford 10–20 MVs without thinking about it, and the query engines are smart enough to rewrite for you. What has not changed is the underlying physics: every MV is a contract with the future that says "this query shape will keep being asked in this form." When the dashboard changes, the contract breaks, and the cost lands on the data engineer's desk at 2 a.m.
References
- Data Cube: A Relational Aggregation Operator (Gray, Bosworth, Layman, Pirahesh, 1996) — the foundational paper for OLAP pre-aggregation. Defines the n-dimensional cube and the GROUPING SETS / ROLLUP / CUBE operators.
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm (Flajolet et al., 2007) — the sketch that makes pre-aggregated COUNT DISTINCT possible.
- Materialize Differential Dataflow (Naiad paper, McSherry et al., 2013) — the modern incremental-view-maintenance foundation.
- Apache Pinot Star-tree Index documentation — the implementation reference and split-order tuning guide.
- ClickHouse Projections documentation — the synchronous-with-write pre-aggregation model.
- StarRocks Asynchronous Materialized Views — the scheduled-refresh model with query rewrite.
- /wiki/druid-and-its-segment-model — Druid's ingestion-time rollup, where pre-aggregation is the default and raw rows are not stored.
- /wiki/starrocks-doris-and-the-next-wave — the engine-side context for the StarRocks async-MV mechanism described above.