Pre-aggregation, materialized views, and their costs

Riya runs analytics at Cred. The "weekly active members" tile on the leadership dashboard answers in 90 ms; the dashboard refreshes every 30 seconds across 200 tabs open in Bengaluru and Gurugram. The query underneath that tile, if she ran it raw, would scan 4.8 billion event rows over a sliding 7-day window and group by member tier — about 14 seconds on a hot ClickHouse cluster, 90 seconds cold. The 90 ms answer comes from a pre-aggregated table, refreshed every two minutes by a scheduled job, that holds 1,400 rows: one per (tier, day) pair. The dashboard hits the small table; the small table hits the large table; the large table is rebuilt by a job Riya has to keep alive. Pre-aggregation buys the latency. It also creates four new problems she did not have before.

A materialised view is a query whose result is stored as a table and refreshed when the inputs change; the optimiser can rewrite future queries to read the small result instead of the large input. Every OLAP engine implements this pattern with different trade-offs — ClickHouse projections, StarRocks async MVs, Pinot star-trees, Druid rollup — but the costs are the same: refresh latency that staleness-bounds the dashboard, write amplification that bills your S3, and a query-rewrite layer that fails silently when it cannot match.

What pre-aggregation actually is

Riya's raw fact table is card_events with columns (member_id, tier, event_type, event_time, amount_paise). The dashboard tile asks: for the last 7 days, by tier, how many distinct members were active and what was the total reward GMV? Written as SQL:

SELECT  tier,
        COUNT(DISTINCT member_id)        AS wam,
        SUM(amount_paise) / 100          AS gmv_rupees
FROM    card_events
WHERE   event_time >= now() - INTERVAL 7 DAY
GROUP   BY tier;

Run against the 4.8-billion-row fact table this is a wide scan: 7 days × ~700M events/day × ~80 bytes/row of touched columns ≈ 380 GB. Even a vectorised engine on hot NVMe scans ~10 GB/sec/node; on an 8-node ClickHouse cluster that's ~5 seconds at the absolute floor, and 14 seconds in practice once predicate evaluation, distinct-count merging, and result materialisation are added.

The pre-aggregated table is a smaller table that has already done most of the work:

CREATE TABLE card_events_daily_by_tier (
    event_date  Date,
    tier        LowCardinality(String),
    wam_state   AggregateFunction(uniq, UInt64),  -- HLL state
    gmv_paise   UInt64
)
ENGINE = SummingMergeTree
ORDER BY (event_date, tier);

Each row holds one day's pre-aggregated state for one tier. The AggregateFunction(uniq, ...) column stores a HyperLogLog sketch — not a count, but a state that can be merged across days to produce the 7-day distinct count without re-scanning the fact table. Why an HLL state and not just a number: COUNT DISTINCT is not additive. If Monday has 1.2M distinct members and Tuesday has 1.3M distinct members, you cannot get the 2-day distinct count by adding — many members appear on both days. HLL sketches can be merged: union two sketches, get the sketch of the union, read off the cardinality. This is the only way pre-aggregation works for distinct counts.

The dashboard query against the pre-aggregated table reads ~7 × 5 = 35 rows (7 days × 5 tiers), merges the HLL sketches, sums the paise column, and answers in 90 ms. The fact-table scan it replaced read 380 GB. The compression ratio of pre-aggregation on this workload is roughly 5 billion rows scanned to 35 rows scanned — an eight-orders-of-magnitude reduction.

Pre-aggregation shrinks the read pathA diagram comparing the dashboard's read path with and without pre-aggregation. The top half shows the raw path: dashboard query goes to a 4.8-billion-row fact table, scans 380 GB across 7 days, takes 14 seconds. The bottom half shows the pre-aggregated path: dashboard query hits a small daily-by-tier table with 1,400 rows, scans 35 rows, takes 90 ms. A side annotation notes that the pre-aggregated table is itself written by a refresh job that scans the fact table once every 2 minutes. Two read paths to the same answer Without pre-aggregation Dashboard tile renders 14 s card_events (fact) 4.8 B rows · 7-day window 380 GB scanned Result row 5 rows by tier With pre-aggregation Dashboard tile renders 90 ms card_events_daily_by_tier 1,400 rows total · 35 rows read 2 KB scanned Result row 5 rows by tier Refresh job — every 2 min scans the fact incrementally, writes new daily slice, merges existing daily HLL sketches, ~120 GB/day write
The dashboard reads from the pre-aggregated table; the refresh job pays the cost of building it. The 14-second scan does not vanish — it gets paid up-front, asynchronously, by the writer rather than the reader.

The fundamental shift is moving cost from read time to write time. The 14-second scan still happens, but it happens once every 2 minutes inside the refresh job, not 200 times per minute inside the dashboard fan-out. If the dashboard is read 12,000 times an hour and the refresh runs 30 times an hour, the cost saving is roughly 400× on read computation and the trade is up-front: a bounded refresh budget per hour. Whether the trade is worth it depends on whether the dashboard's read fan-out is high enough.

How four engines spell the same idea

The pattern is universal; the spelling is engine-specific. The same logical concept — "store this aggregation, refresh it, use it transparently" — has four distinct mechanisms in the four OLAP engines from the chapters before this one. Each spelling reflects a different bet about the workload.

-- 1. ClickHouse: PROJECTION inside the table
ALTER TABLE card_events ADD PROJECTION daily_by_tier (
    SELECT  toDate(event_time) AS event_date,
            tier,
            uniqState(member_id) AS wam_state,
            sum(amount_paise)    AS gmv_paise
    GROUP BY event_date, tier
);
ALTER TABLE card_events MATERIALIZE PROJECTION daily_by_tier;

-- 2. StarRocks: async materialised view
CREATE MATERIALIZED VIEW mv_card_daily
REFRESH ASYNC EVERY (INTERVAL 2 MINUTE)
AS
SELECT  date_trunc('day', event_time) AS event_date,
        tier,
        bitmap_union(to_bitmap(member_id)) AS wam_bitmap,
        sum(amount_paise) AS gmv_paise
FROM    card_events
GROUP BY 1, 2;

-- 3. Pinot: star-tree index inside the segment
{
  "starTreeIndexConfigs": [{
    "dimensionsSplitOrder":  ["event_date", "tier"],
    "skipStarNodeCreationForDimensions": [],
    "functionColumnPairs":   ["DISTINCTCOUNTHLL__member_id",
                              "SUM__amount_paise"],
    "maxLeafRecords":        10000
  }]
}

-- 4. Druid: rollup at ingestion time
{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "granularitySpec": { "queryGranularity": "DAY", "rollup": true },
      "dimensionsSpec":  { "dimensions": ["tier"] },
      "metricsSpec": [
        { "type": "thetaSketch", "name": "wam_sketch", "fieldName": "member_id" },
        { "type": "longSum",     "name": "gmv_paise",  "fieldName": "amount_paise" }
      ]
    }
  }
}
# Output: dashboard p99 latency on 4.8B-row fact, after pre-agg

Engine        | Mechanism            | Refresh model      | Dashboard p99
ClickHouse    | PROJECTION           | Synchronous w/ writes | 110 ms
StarRocks     | Async MV             | Every 2 min         | 90  ms
Pinot         | Star-tree index      | Built per-segment   | 45  ms
Druid         | Ingestion-time rollup| At ingestion only   | 30  ms

Cold-cache p99 (cluster restart, 0% local cache):
ClickHouse    : 1.2 s    | StarRocks : 4.0 s
Pinot         : 80 ms    | Druid     : 50 ms

Walk what each engine actually does:

The thing all four spellings share is the optimiser must know how to rewrite. A user submits a query against the fact table; the engine has to recognise that some pre-aggregated structure can answer it. The recognition is non-trivial: the projection/MV/star-tree must cover the query's dimensions exactly (or be a superset where a residual filter can be applied), the aggregation functions must be compatible (SUM is additive, COUNT DISTINCT is not without HLL, MIN/MAX work, MEDIAN does not), and the time window must be representable in the pre-agg's grain.

The four hidden costs

Pre-aggregation looks free. The dashboard goes from 14 seconds to 90 ms; the user is happy; the bill stays the same. Then the costs show up over the next quarter, and they show up in places nobody warned the team about.

Cost 1: refresh latency becomes staleness on the dashboard

Riya's StarRocks MV refreshes every 2 minutes. That sounds tight. But the refresh itself takes 30–60 seconds on a busy cluster (longer if the fact table is being written to concurrently and the snapshot isolation has to wait for in-flight writes to commit). Add the 2-minute sleep between runs, and the worst-case staleness on the dashboard tile is 3 minutes 30 seconds. For a "weekly active members" tile this is fine. For a fraud-detection tile that should reflect the last 30 seconds of card events, it is a blocker. The team's first instinct is to drop the refresh interval to 30 seconds; the second instinct, after watching the cluster CPU stay pinned at 80% on refresh runs, is to revert. The refresh budget is a hard physical constraint set by how fast the underlying scan can read.

Cost 2: write amplification you don't see in the storage bill until it's too late

Every projection, MV, and rollup is more bytes written. For Cred's card_events with one daily-by-tier projection, write amplification is ~1.05× (the HLL state is small). Add 8 more projections — one per dimension permutation the dashboards need — and write amplification climbs to 1.7×. At 700M events/day × 80 bytes ≈ 56 GB raw daily, the projections add another 40 GB/day. At ₹2.30/GB on S3 standard tier, this is ₹92/day, ₹2,800/month. Forgettable. Now imagine Razorpay, with 200M payment events/day across 50 dashboards each requiring 3 projections — at the same write amplification ratios, that's ~1.1 TB/day of pre-aggregated state, ₹76,000/month, before the refresh-job compute cost.

Cost 3: silent rewrite failure when the query shape drifts

The biggest production hazard: the query rewrite is best-effort, not guaranteed. A user runs the dashboard query and it answers in 90 ms — pre-agg hit. A different user changes one filter (WHERE tier IN ('Platinum', 'Gold') instead of WHERE tier = 'Platinum') and the query takes 14 seconds — pre-agg miss, fell through to fact-table scan. Neither user sees a warning; the query just runs. The dashboard team adds a new "events per hour" tile assuming the same speed and watches it answer in 18 seconds, then files a JIRA against the data-engineering team. The fix is usually "add another projection" — but every added projection costs Cost 2.

The query patterns that miss vary by engine: ClickHouse fails to rewrite when the query has a JOIN even if the JOIN side is a tiny dim that could be applied as a residual filter; StarRocks fails when an aggregation is wrapped in coalesce() or a CASE; Pinot fails when the query asks by a dimension not in the star-tree's split order. Each engine documents these limitations, but every team learns them by hitting them in production. Why no engine guarantees rewrite: query rewrite is a search problem. Given a query Q and a set of pre-agg structures M1, M2, ..., the optimiser must check whether some Mi (or a combination) can answer Q with at most a residual filter and an aggregation step. The check is computationally bounded by the optimiser's planning time budget — typically 100–500 ms — and complex query shapes (multiple joins, nested aggregations, expressions in the GROUP BY) can push the rewrite-search beyond budget. The engine then falls back to the original query; correctness is preserved, latency is not.

Cost 4: the operational graveyard of refresh jobs

A year into production, the team has 47 materialised views, 6 of which were built for dashboards that were deprecated 8 months ago. Every one of them is still refreshing. The refresh jobs are invisible until you look — they don't show up in the BI tool, they don't have an owner column on the cluster's job page, and the cost-attribution layer (Build 16) was not configured to label them with a project. Cleaning up unused MVs is a quarterly task that nobody enjoys; missing the cleanup means the refresh budget gets eaten by dead jobs and the live dashboards start missing their staleness SLAs.

The Cred team's eventual fix is a metadata.materialised_views table that records, for each MV, who created it, which dashboard URL queries it, and the last time the rewrite layer matched a query against it. MVs unmatched for 30 days are flagged for deletion. This solves the operational graveyard but does not solve Costs 1–3.

Refresh budget vs staleness trade-offA 2D trade-off chart showing refresh interval on the X axis (from 30 seconds to 1 hour) and dashboard staleness on the Y axis. As refresh interval shortens, staleness drops but cluster CPU pinned by refresh climbs steeply, drawn as a shaded "cluster overhead" region. Three operating points are marked: ClickHouse synchronous projection at zero staleness but high write CPU, StarRocks 2-minute MV at moderate staleness and moderate refresh CPU, and a 1-hour batch refresh at high staleness but low CPU. A dotted line marks the typical SaaS dashboard SLA at 5 minutes staleness. Refresh interval vs staleness vs CPU Refresh interval (log scale) Staleness 30 s 2 min 15 min 1 hr 5-min SLA High refresh CPU cluster overhead 60-80% CH PROJECTION (sync) StarRocks MV (2-min async) Batch overnight rollup
Dashboard staleness drops as refresh interval shortens, but cluster overhead climbs faster than linearly. Most teams settle near the 2-minute knee where staleness is acceptable and refresh CPU is bounded.

When pre-aggregation is the wrong answer

Three workload shapes where adding a materialised view makes things worse, not better.

High-cardinality dimensions where the pre-agg is nearly the same size as the fact. If card_events is grouped by member_id (which has 60M distinct values), the pre-aggregated table has 60M rows × 7 days ≈ 420M rows — almost 10% of the fact table. The compression ratio is now 10:1, not 3,000,000:1, and the read-side savings are marginal while the write-side cost is full.

Ad-hoc exploration. If the analyst pool runs unpredictable queries — different filters, different group-bys, different time windows — no fixed set of MVs covers the workload. Every new query is a coin flip on whether the rewrite hits. The right architecture is a fast scan engine (ClickHouse, StarRocks) without MVs, sized for the worst-case full scan.

Fact tables that change retroactively. If a row in card_events from 30 days ago gets updated (a chargeback, a refund), every MV that aggregates over that day must be rebuilt for that day. ClickHouse's projection model does not handle this gracefully (the projection is computed on insert, not on update); StarRocks' MV must do a partition-level refresh of the affected partition. For OLAP workloads where the past is mutable (financial reconciliation, GST filings), the refresh cost can dominate.

Common confusions

Going deeper

Incremental view maintenance and the math of "additive"

The theoretical foundation for materialised views is incremental view maintenance (IVM): given a base table T and a view V = f(T), when T changes by ΔT, can V be updated to f(T ∪ ΔT) without recomputing f from scratch? The answer depends on the structure of f. SUM, COUNT, and MIN/MAX over insert-only streams are straightforwardly incremental: V_new = V_old + f(ΔT). MEDIAN, PERCENTILE, and COUNT DISTINCT are not — they require a sketch or full state.

The Materialize project (a streaming database company) productised this distinction by computing a "differential dataflow" of every change to T and propagating it through the view definition. Their engine maintains the view exactly under inserts, updates, and deletes, with worst-case latency proportional to the size of the change rather than the size of T. The trade-off is a higher per-row maintenance cost than ClickHouse's batch projections, balanced by the guarantee of consistency. For Indian fintechs that need exact reconciliation views (Razorpay's settlement dashboard, Zerodha's end-of-day P&L), differential dataflow is increasingly the right answer.

The Star-tree paper and the n-ary cube

Pinot's star-tree is a generalisation of the data cube concept from OLAP literature (Gray et al., "Data Cube: A Relational Aggregation Operator", 1996). The classical cube pre-computes every cell of the n-dimensional aggregation hypercube; for d dimensions each with cardinality c, the cube has c^d cells, which is intractable. The star-tree's compromise is to pre-compute only along a chosen split-order: dimension 1 first, then dimension 2 inside each dim-1 bucket, etc. The "star" suffix comes from a wildcard node at every split that aggregates across all values of that dimension — letting queries skip a dimension at low cost. The result is a tree with at most O(c × log d) nodes per branch, far below c^d.

The split-order choice is a design decision. Put the most-filtered dimension first so its filter is tight high in the tree; put the least-cardinality dimension last so the leaves don't explode. Pinot's ingestion pipeline lets the user override the auto-chosen order via the table config; for Karan's growth dashboard, the auto-order picked (event_date, tier) correctly because date filters are tight and tier has low cardinality.

Concurrent writes and refresh isolation

When ingestion is high (Flipkart Big Billion Days writing 2M events/sec into the fact table) and the MV refresh runs every 2 minutes, the refresh job sees a snapshot of the fact table that is already 2 minutes behind by the time it finishes. Worse, the snapshot itself can be inconsistent if the engine doesn't support snapshot isolation on the source. StarRocks and Iceberg-backed StarRocks both use Iceberg snapshots as the consistency point — the refresh reads at snapshot N, ingestion writes snapshot N+1, the next refresh reads N+1. ClickHouse's SummingMergeTree does not have explicit snapshots and relies on append-only insert semantics; if the inserts are not append-only (UPDATE or REPLACE), the projection state can diverge from the row state until a background merge reconciles them.

The Indian fintech operational pattern for this is: lock the schema, accept "append-only with corrections via tombstone" (insert a negative-amount row to cancel a positive one), and refresh MVs with explicit REFRESH MATERIALIZED VIEW ... ALL PARTITIONS once a day overnight to catch any drift. The 2-minute MV is for live dashboards; the overnight refresh is the source of truth for the morning leadership review.

The cost of forgetting an MV

Riya's team eventually built a query-attribution layer that tagged every refresh job with the dashboard URL it served and the last time that dashboard was loaded. They found 6 of 47 MVs were serving zero queries. The cumulative refresh cost across those 6 MVs was ₹4.2 lakh/year. Killing them paid for two SDE-2 hires.

The pattern generalises: in any team that has been on an OLAP engine for more than 18 months, 10–20% of the materialised views are dead. Detecting them requires query-attribution (linking the query to the rewrite hit), which most teams do not set up until they have already paid the bill. Build 16 (multi-tenancy and cost) covers the attribution mechanics; this chapter is the workload-side reason to care about it.

Where this leads next

The pre-aggregation pattern is older than any of the engines that implement it — Codd's data cube paper is from 1993, OLAP cubes shipped in Hyperion and Microsoft Analysis Services in the late 1990s, every BI tool has had some form of "drill-down with stored aggregates" for thirty years. What has changed is that the storage is cheap enough now that you can afford 10–20 MVs without thinking about it, and the query engines are smart enough to rewrite for you. What has not changed is the underlying physics: every MV is a contract with the future that says "this query shape will keep being asked in this form." When the dashboard changes, the contract breaks, and the cost lands on the data engineer's desk at 2 a.m.

References