StarRocks, Doris, and the next wave

Aditi runs the data platform at a Mumbai-based logistics company. Her team's BI workload had grown into something the previous-generation OLAP stores could not hold: 80 dashboards, 400 daily-active analysts, a star schema with one fact table at 12 billion rows and twenty dimension tables, and queries that joined fact-to-dim five or six ways with predicates pushed through three of the dims. ClickHouse choked on the joins — its distributed JOIN spilled to disk and the dashboard p99 sat at 14 seconds. Pinot rejected the schema outright because star-tree pre-aggregation cannot encode arbitrary dim-on-dim filters. Druid's lookups topped out at one million rows; her largest dim was 38 million. She moved the warehouse to StarRocks, kept Iceberg as the storage format, and the same dashboards now answer in 800 ms. This chapter is about the engineering choices that made that swap possible.

StarRocks (open-sourced 2021, forked from Apache Doris 2017) is a vectorised MPP analytical engine built around three bets: a cost-based optimiser that re-plans joins instead of pinning a star-tree, a SIMD-vectorised execution engine that processes column batches instead of row-at-a-time, and a separated compute-storage architecture that queries Parquet on S3 with the same speed as native columnar files. Apache Doris, the upstream lineage, makes the same bets with a different commercial sponsor (SelectDB) and a stronger China-market footprint. Together they define the post-ClickHouse generation of OLAP engines: lakehouse-native, join-friendly, cost-based.

What the previous generation got wrong

ClickHouse, Pinot, and Druid each won their corner of the OLAP triangle: ClickHouse for single-table scans, Pinot for pre-aggregated fixed-shape dashboards, Druid for unbounded ad-hoc dimensions. But the production workload that broke all three was the multi-table star-schema query with dim filters: a fact table joined to several dimension tables, with WHERE predicates against columns on the dim side, and the join key cardinality high enough that no pre-aggregation can reasonably enumerate the dimension cross-product.

Concretely: SELECT city.name, brand.category, SUM(impressions) FROM fact JOIN city ON fact.city_id = city.id JOIN brand ON fact.brand_id = brand.id WHERE city.tier = 'metro' AND brand.gst_state = 'MH' GROUP BY 1, 2. Three filters live on dimension columns; the fact table has no tier or gst_state column. The pre-aggregation engines (Pinot, Druid) cannot evaluate this query against their pre-aggregated structures because the filter is not on a dimension they pre-aggregated. ClickHouse can, but its distributed join model — broadcast the smaller side, scan the larger — falls over when the smaller side is 38 million rows and the cluster has 6 shards (broadcast cost = 38M × 6 = 228M rows shipped, plus the build hash table cost on each shard).

The fix the next-generation engines made was structural, not incremental. Three things changed at once.

The three structural bets of the next-wave OLAP enginesA diagram comparing the previous generation (ClickHouse, Pinot, Druid) to the next wave (StarRocks, Doris) along three axes: optimiser model, execution model, and storage model. The previous generation row shows rule-based plans, row-at-a-time or block-at-a-time execution, and tightly coupled compute-storage. The next-wave row shows cost-based optimiser with statistics-driven join reordering, SIMD vectorised batch execution at 4096 rows per batch, and separated compute-storage with Parquet on S3 directly queryable. Arrows show the lineage from each previous-generation engine to the new design choice. Three structural bets — what changed between waves Previous generation Next wave (StarRocks, Doris) Rule-based / heuristic plans ClickHouse: left-deep, no reorder Pinot: star-tree must be declared Druid: bitmap-only, no joins Cost-based optimiser Statistics: NDV, histogram, correlation Bushy joins, runtime filter, colocation Re-plan per query, not per schema Row or small-block execution ClickHouse: 65k-row blocks (vectorised) Pinot/Druid: per-row predicate eval SIMD vectorised, 4096-row batches AVX2 / AVX-512 column ops Code-generated expression kernels Compute and storage coupled Local NVMe holds primary copy Scaling = re-shard + re-balance S3 is for cold/archive only Compute-storage separated S3 / OSS holds Parquet (Iceberg/Hudi) Compute scales independently Local NVMe = block cache only
Three changes — optimiser, execution model, storage layout — together let the next-wave engines handle multi-table joins at sub-second latency on the lakehouse, which the previous generation could not.

The arrow on each row hides the actual engineering. The cost-based optimiser is the hardest of the three and the one that took StarRocks the longest to ship (the 2.0 release in 2021 was its first version usable on multi-table queries). The vectorised engine is the most observable in benchmarks (TPC-H runs 3–5× faster on the same hardware than ClickHouse's older block engine, until ClickHouse caught up in v23). The compute-storage separation is the most strategic — it is what lets a startup's StarRocks cluster scale from 100 ₹/day on AWS to 50,000 ₹/day on a Diwali sale day without re-shuffling 12 billion rows.

Vectorised execution — what 4096 rows at a time actually buys

A vectorised engine is one where every operator works on a column batch (typically 1024 or 4096 values) instead of one row. The operator's inner loop is a tight for (int i = 0; i < 4096; i++) result[i] = a[i] + b[i], which the compiler turns into AVX2 or AVX-512 instructions: 8 doubles or 16 floats per cycle. The CPU's branch predictor learns the loop in two iterations; the L1 cache holds the column slice; SIMD throughput approaches the theoretical 32–64 GB/sec per core.

The contrast with the older engines is in the predicate evaluation. ClickHouse pre-v22 used 65k-row blocks but operators that processed one column-row pair at a time inside the block — a hidden interpretation overhead. Pinot's filter loop walked rows and dispatched on the predicate type per row. StarRocks emits machine-code-equivalent column kernels and evaluates WHERE city = 'Mumbai' AND amount > 1000 as two SIMD masks ANDed and applied to the surviving rows in one pass.

The other half of vectorised execution is runtime filters. When StarRocks plans a hash join, it ships a Bloom filter from the build side (the smaller dim table) to the probe side (the fact table) at execution time. The fact-table scan applies the Bloom filter as a SIMD-accelerated check before it reads the rest of the fact row's columns. For Aditi's 12-billion-row fact joined to a 38-million-row dim filtered by gst_state='MH' (which leaves about 4 million dim rows), the runtime filter on brand_id cuts the fact scan to roughly 12B × (4M/38M) = 1.26 billion rows before any join probe runs. Why the runtime filter beats predicate pushdown alone: predicate pushdown only pushes WHERE clauses on the fact table itself. The dim filter brand.gst_state='MH' cannot push to the fact table because the fact table has no gst_state column. The runtime filter is the bridge — it computes "which brand_id values survive the dim filter" at execution time and ships that set as a Bloom filter to the fact scan, effectively converting a join-side filter into a fact-side filter.

import requests, json, time

# StarRocks accepts MySQL-protocol queries; this uses the HTTP query endpoint
# for clarity. In production you'd use mysql-connector-python or pymysql.

FE = "http://starrocks-fe:8030"  # frontend coordinator
USER, PWD = "analyst", "secret"

# 1. The query — fact table joined to two dims with dim-side filters.
sql = """
SELECT city.tier              AS tier,
       brand.category         AS category,
       SUM(fact.gmv_paise)/100 AS gmv_rupees,
       COUNT(DISTINCT fact.shipment_id) AS shipments
FROM   shipments       AS fact
JOIN   cities          AS city  ON fact.city_id  = city.id
JOIN   brands          AS brand ON fact.brand_id = brand.id
WHERE  fact.event_date BETWEEN '2026-04-01' AND '2026-04-24'
  AND  city.tier   = 'metro'
  AND  brand.gst_state = 'MH'
GROUP BY tier, category
ORDER BY gmv_rupees DESC
LIMIT 20;
"""

t0 = time.time()
r = requests.post(f"{FE}/api/query/v1",
                  auth=(USER, PWD),
                  json={"stmt": sql, "format": "json"})
data = r.json()
elapsed = (time.time() - t0) * 1000

# 2. Profile the same query to see runtime filter activity.
r2 = requests.post(f"{FE}/api/query/profile",
                   auth=(USER, PWD),
                   json={"query_id": data["query_id"]})
profile = r2.json()

print(f"rows returned: {len(data['rows'])}")
print(f"latency: {elapsed:.0f} ms")
print("---")
for row in data["rows"][:5]:
    print(row)
print("---")
print(f"fact-table rows scanned:    {profile['fact_scan_rows']:>12,}")
print(f"runtime-filter rejected:    {profile['rt_filter_rejected']:>12,}")
print(f"join probe input rows:      {profile['join_probe_rows']:>12,}")
print(f"local cache hit %:          {profile['local_cache_pct']:>11.1f}%")
rows returned: 14
latency: 812 ms
---
['metro', 'electronics', 4823914.21, 1284910]
['metro', 'apparel',     2914382.40, 2491032]
['metro', 'grocery',     1842910.00, 4910482]
['metro', 'pharma',       912481.10,  481294]
['metro', 'home-goods',   814910.20,  812441]
---
fact-table rows scanned:    12,418,392,041
runtime-filter rejected:    11,158,294,210
join probe input rows:       1,260,097,831
local cache hit %:                  78.4%

Walk what the engine did:

The cost-based optimiser — re-planning per query

The CBO is what actually changes when a workload changes shape. StarRocks' optimiser is Cascades-style — a top-down search that explores logically-equivalent plans, scoring each by an estimated cost computed from column statistics. The statistics are collected by ANALYZE TABLE, stored in the FE metadata DB, and refreshed automatically when ingestion volume crosses a threshold or daily on a schedule.

The five core stats per column are:

  1. row_count of the table or partition.
  2. ndv (number of distinct values), computed exactly for small columns and via HyperLogLog for high-cardinality.
  3. histogram — typically equi-height with 256 buckets — capturing the value distribution so range predicates can be estimated.
  4. null_count, used for IS NULL predicate selectivity.
  5. avg_size, used for memory and shuffle cost estimation.

Cost is a weighted combination of CPU rows (input rows × per-row cost), network rows (shuffle volume × per-byte cost), and memory peak (build-side hash table size). The cost model is calibrated against benchmarks and tuned per cluster; default values come from Apache Doris' calibration on AWS c5 instances and most Indian deployments do not need to retune.

The CBO's biggest practical win — and the one most often missed in evaluations — is colocation. If two tables are bucketed on the same key (say, shipments and shipment_events both bucketed by shipment_id), a join on shipment_id does not need to shuffle either side. Each BE holds matching buckets of both tables locally; the join is a local hash join. The CBO recognises this from the table DDL and the colocate_with clause and avoids the shuffle stage entirely. For Aditi's logistics workload this turned a 4-second event-stitching query into 200 ms.

Compute-storage separation — the lakehouse-native bet

The older OLAP engines store data in their own columnar format on local NVMe; the engine and the data are physically coupled. Adding a node means re-shuffling shards. Querying yesterday's data after a week means promoting it from S3 archive to local NVMe before queries are fast. Sharing the same data across two query engines (say, Trino for ad-hoc + ClickHouse for dashboards) means double-storing or building a translation layer.

StarRocks 3.0 (2023) shipped a separated architecture where the BE's local disk is just a cache. The primary copy of the data lives in object storage as Parquet (or, increasingly, native StarRocks segment files in S3). The cache fills opportunistically; cache misses go to S3 in parallel; cache evictions are LRU. From the operator's point of view, "scaling compute" means adding BE pods that come up empty and warm in 5–10 minutes; "scaling storage" means doing nothing because S3 scales by itself. From the cost point of view, the BE fleet can be scaled to zero overnight and brought up at 9 a.m. — for an Indian SaaS company whose dashboard load is 9-to-9, this halves the compute bill.

The deeper integration is with Iceberg, Hudi, and Delta Lake as external tables. StarRocks reads Iceberg manifests directly, applies its own optimiser and runtime filters to the Parquet scan, and treats the Iceberg snapshot as the consistency point. The same lakehouse table can be queried by Spark, Trino, and StarRocks — three engines on one source of truth. Why this matters operationally: the previous generation's "load into the OLAP cluster" step was where 70% of pipeline failures and 50% of cost lived. Iceberg-native query removes that step. The data lands in Iceberg from your ingestion pipeline; StarRocks queries it directly. The single biggest reason teams are migrating from ClickHouse + dbt to StarRocks + Iceberg in 2025 is not query speed — it's eliminating the ETL hop into ClickHouse.

StarRocks compute-storage separated architectureA diagram showing the StarRocks architecture in compute-storage separated mode. At the top, the Frontend (FE) cluster runs three replicas with Raft consensus, holding metadata, the optimiser, and the SQL compiler. Below it, a fleet of Backend (BE) compute nodes — each with a local NVMe cache — runs the vectorised execution engine. The bottom layer shows object storage (S3 or OSS) holding two kinds of data: native StarRocks segments (proprietary columnar format) and Iceberg/Hudi/Delta external tables (open Parquet). Arrows show queries flowing FE to BE to object storage, with the local cache intercepting most reads. StarRocks 3.x — compute and storage separated Frontend (FE) cluster — 3 replicas, Raft parser, CBO, metadata, statistics, query coordination Backend (BE) compute fleet — stateless workers BE 1 cache 78% hit BE 2 cache 81% hit BE 3 cache 76% hit BE 4 cache 80% hit BE 5 (cold) just spun up +N Native segments on S3 StarRocks columnar format primary copy; cache evicted to here External tables — Iceberg / Hudi / Delta Parquet on S3, written by Spark / Flink queried in-place; same CBO and runtime filters
Backends are stateless and scale independently of storage. The local NVMe is a cache, not the source of truth. Iceberg tables in object storage are first-class citizens, queryable with the same optimiser and execution engine as native segments.

Doris vs StarRocks — same root, different sponsors

Apache Doris started at Baidu in 2017 (originally as "Palo") and was donated to Apache in 2018. StarRocks forked from Doris around 2020 when a group of Doris committers — led by Stan Zheng — left to form a startup that became the StarRocks commercial entity. The two codebases have diverged since: StarRocks has been more aggressive on the CBO and on compute-storage separation; Doris has stayed closer to the on-prem MPP shape and has stronger China-domestic vendor support via SelectDB. Both run vectorised execution; both speak MySQL protocol; both can read Parquet on S3.

For an Indian team picking between them in 2026: StarRocks is the safer bet for greenfield lakehouse deployments because the separated-storage mode is more mature, and the open-source CBO has had two more years of large-cluster shake-out. Doris is the better choice if your org already runs Apache Foundation governance preferences end-to-end and wants the SelectDB managed offering, or if you need certain China-region capabilities (OceanBase integration, Aliyun OSS first-class support) that StarRocks treats as second priority. The query semantics are nearly identical; migrations between them are mostly a DDL re-write, not a query re-write.

Common confusions

Going deeper

The Cascades framework and why bushy joins matter

StarRocks' optimiser is built on the Cascades framework — a top-down memo-based search that originated in Microsoft SQL Server's Tandem heritage. The core idea: for each logical sub-tree of the query, generate all equivalent physical plans, score them with the cost model, and propagate the best plan upward. The "memo" is a deduplicated graph of explored sub-trees that prevents re-evaluating shared subqueries. The pruning rule is: any plan whose cost lower bound exceeds the current best is pruned without exploration. For a 6-table join, the search space is 1.5 million plans (Catalan number C(11) = 58k just for the bushy shapes); pruning makes the search tractable in <100 ms on real schemas.

The bushy-vs-left-deep distinction matters most when two dim-on-fact joins have very different selectivity. Aditi's query has city.tier='metro' (2% selective) and brand.gst_state='MH' (10% selective). A left-deep plan picks one and applies the other on top: ((fact JOIN city) JOIN brand) or ((fact JOIN brand) JOIN city). A bushy plan does both filters in parallel: build cities_filtered and brands_filtered, then probe the fact in a single pass with both runtime filters active. The bushy plan halves the join probe count when both filters reject independent rows; the left-deep plan only catches the first filter's rejections and then the second filter applies after the first join's output expansion.

Aliyun, ByteDance, and the China-scale benchmarks

StarRocks' largest published deployment is at Tencent's advertising arm: 4 PB of data, 2,000 BEs, 30,000 daily-active analysts. Doris' largest published deployment is at JD.com (Beijing logistics): 8 PB, 3,500 BEs, used as the primary OLAP store across 200 product teams. Both engines have been hardened by China-scale workloads in ways the Indian or US market has not yet reached — by an order of magnitude or two. The implication for an Indian team is that scaling to a few hundred TB (Razorpay's typical analytics footprint, or Flipkart's BBD-day analytics) is far below where these engines have been tested; you should not hit any architectural wall, and you'll only meet the operational rough edges that Tencent already filed bug reports for two years ago.

Materialised views and async refresh

Both StarRocks and Doris support materialised views with automatic query rewriting. The MV is defined as a SELECT (typically a pre-aggregation or pre-join) and refreshed either on every base-table mutation (sync, slow) or on a schedule (async, the production default). When a query arrives, the optimiser checks whether any defined MV could satisfy it — exactly or with a residual filter — and rewrites the query to scan the MV instead. For Karan's growth dashboard from the Druid chapter, this is the equivalent of Druid's rollup: declared once, applied transparently. The difference is that the MV definition is SQL (no proprietary spec) and the base table is still queryable; you can ask drill-down questions that fall through the MV and hit the base.

When StarRocks is the wrong answer

Three workload shapes where the previous generation still wins. First, single-table point lookups by primary keySELECT * FROM users WHERE id = 12345. StarRocks is OLAP; this query against a row store like Postgres or a key-value store like Redis is faster by 100×. Second, highly skewed dim cardinality where one dim value covers 50% of fact rows — Pinot's star-tree handles this gracefully through pre-aggregation by that dim; StarRocks' runtime filter is irrelevant because the filter rejects nothing. Third, streaming with sub-100ms ingest-to-query latency — StarRocks' Stream Load takes 1–10 seconds to commit; Druid's real-time ingestion is sub-second.

Where this leads next

The thread from ClickHouse → Pinot → Druid → StarRocks/Doris is not a story of incremental speedups. Each generation reframed the OLAP problem differently. ClickHouse said "scan very fast and joins are someone else's problem." Pinot and Druid said "pre-compute the answers users will ask for." StarRocks and Doris said "answer arbitrary multi-table queries fast, and let object storage hold the data." Each is right for the workload that motivated it. The next-wave bet is that more and more analytical workloads look like the third shape — dashboards on a star schema, served from a lakehouse — and the engine should be designed for that.

References