Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

StarRocks, Doris, and the next wave

Aditi runs the data platform at a Mumbai-based logistics company. Her team's BI workload had grown into something the previous-generation OLAP stores could not hold: 80 dashboards, 400 daily-active analysts, a star schema with one fact table at 12 billion rows and twenty dimension tables, and queries that joined fact-to-dim five or six ways with predicates pushed through three of the dims. ClickHouse choked on the joins — its distributed JOIN spilled to disk and the dashboard p99 sat at 14 seconds. Pinot rejected the schema outright because star-tree pre-aggregation cannot encode arbitrary dim-on-dim filters. Druid's lookups topped out at one million rows; her largest dim was 38 million. She moved the warehouse to StarRocks, kept Iceberg as the storage format, and the same dashboards now answer in 800 ms. This chapter is about the engineering choices that made that swap possible.

StarRocks (open-sourced 2021, forked from Apache Doris 2017) is a vectorised MPP analytical engine built around three bets: a cost-based optimiser that re-plans joins instead of pinning a star-tree, a SIMD-vectorised execution engine that processes column batches instead of row-at-a-time, and a separated compute-storage architecture that queries Parquet on S3 with the same speed as native columnar files. Apache Doris, the upstream lineage, makes the same bets with a different commercial sponsor (SelectDB) and a stronger China-market footprint. Together they define the post-ClickHouse generation of OLAP engines: lakehouse-native, join-friendly, cost-based.

What the previous generation got wrong

ClickHouse, Pinot, and Druid each won their corner of the OLAP triangle: ClickHouse for single-table scans, Pinot for pre-aggregated fixed-shape dashboards, Druid for unbounded ad-hoc dimensions. But the production workload that broke all three was the multi-table star-schema query with dim filters: a fact table joined to several dimension tables, with WHERE predicates against columns on the dim side, and the join key cardinality high enough that no pre-aggregation can reasonably enumerate the dimension cross-product.

Concretely: SELECT city.name, brand.category, SUM(impressions) FROM fact JOIN city ON fact.city_id = city.id JOIN brand ON fact.brand_id = brand.id WHERE city.tier = 'metro' AND brand.gst_state = 'MH' GROUP BY 1, 2. Three filters live on dimension columns; the fact table has no tier or gst_state column. The pre-aggregation engines (Pinot, Druid) cannot evaluate this query against their pre-aggregated structures because the filter is not on a dimension they pre-aggregated. ClickHouse can, but its distributed join model — broadcast the smaller side, scan the larger — falls over when the smaller side is 38 million rows and the cluster has 6 shards (broadcast cost = 38M × 6 = 228M rows shipped, plus the build hash table cost on each shard).

The fix the next-generation engines made was structural, not incremental. Three things changed at once.

Three changes — optimiser, execution model, storage layout — together let the next-wave engines handle multi-table joins at sub-second latency on the lakehouse, which the previous generation could not.

The arrow on each row hides the actual engineering. The cost-based optimiser is the hardest of the three and the one that took StarRocks the longest to ship (the 2.0 release in 2021 was its first version usable on multi-table queries). The vectorised engine is the most observable in benchmarks (TPC-H runs 3–5× faster on the same hardware than ClickHouse's older block engine, until ClickHouse caught up in v23). The compute-storage separation is the most strategic — it is what lets a startup's StarRocks cluster scale from 100 ₹/day on AWS to 50,000 ₹/day on a Diwali sale day without re-shuffling 12 billion rows.

Vectorised execution — what 4096 rows at a time actually buys

A vectorised engine is one where every operator works on a column batch (typically 1024 or 4096 values) instead of one row. The operator's inner loop is a tight for (int i = 0; i < 4096; i++) result[i] = a[i] + b[i], which the compiler turns into AVX2 or AVX-512 instructions: 8 doubles or 16 floats per cycle. The CPU's branch predictor learns the loop in two iterations; the L1 cache holds the column slice; SIMD throughput approaches the theoretical 32–64 GB/sec per core.

The contrast with the older engines is in the predicate evaluation. ClickHouse pre-v22 used 65k-row blocks but operators that processed one column-row pair at a time inside the block — a hidden interpretation overhead. Pinot's filter loop walked rows and dispatched on the predicate type per row. StarRocks emits machine-code-equivalent column kernels and evaluates WHERE city = 'Mumbai' AND amount > 1000 as two SIMD masks ANDed and applied to the surviving rows in one pass.

The other half of vectorised execution is runtime filters. When StarRocks plans a hash join, it ships a Bloom filter from the build side (the smaller dim table) to the probe side (the fact table) at execution time. The fact-table scan applies the Bloom filter as a SIMD-accelerated check before it reads the rest of the fact row's columns. For Aditi's 12-billion-row fact joined to a 38-million-row dim filtered by gst_state='MH' (which leaves about 4 million dim rows), the runtime filter on brand_id cuts the fact scan to roughly 12B × (4M/38M) = 1.26 billion rows before any join probe runs. Why the runtime filter beats predicate pushdown alone: predicate pushdown only pushes WHERE clauses on the fact table itself. The dim filter brand.gst_state='MH' cannot push to the fact table because the fact table has no gst_state column. The runtime filter is the bridge — it computes "which brand_id values survive the dim filter" at execution time and ships that set as a Bloom filter to the fact scan, effectively converting a join-side filter into a fact-side filter.

import requests, json, time

# StarRocks accepts MySQL-protocol queries; this uses the HTTP query endpoint
# for clarity. In production you'd use mysql-connector-python or pymysql.

FE = "http://starrocks-fe:8030"  # frontend coordinator
USER, PWD = "analyst", "secret"

# 1. The query — fact table joined to two dims with dim-side filters.
sql = """
SELECT city.tier              AS tier,
       brand.category         AS category,
       SUM(fact.gmv_paise)/100 AS gmv_rupees,
       COUNT(DISTINCT fact.shipment_id) AS shipments
FROM   shipments       AS fact
JOIN   cities          AS city  ON fact.city_id  = city.id
JOIN   brands          AS brand ON fact.brand_id = brand.id
WHERE  fact.event_date BETWEEN '2026-04-01' AND '2026-04-24'
  AND  city.tier   = 'metro'
  AND  brand.gst_state = 'MH'
GROUP BY tier, category
ORDER BY gmv_rupees DESC
LIMIT 20;
"""

t0 = time.time()
r = requests.post(f"{FE}/api/query/v1",
                  auth=(USER, PWD),
                  json={"stmt": sql, "format": "json"})
data = r.json()
elapsed = (time.time() - t0) * 1000

# 2. Profile the same query to see runtime filter activity.
r2 = requests.post(f"{FE}/api/query/profile",
                   auth=(USER, PWD),
                   json={"query_id": data["query_id"]})
profile = r2.json()

print(f"rows returned: {len(data['rows'])}")
print(f"latency: {elapsed:.0f} ms")
print("---")
for row in data["rows"][:5]:
    print(row)
print("---")
print(f"fact-table rows scanned:    {profile['fact_scan_rows']:>12,}")
print(f"runtime-filter rejected:    {profile['rt_filter_rejected']:>12,}")
print(f"join probe input rows:      {profile['join_probe_rows']:>12,}")
print(f"local cache hit %:          {profile['local_cache_pct']:>11.1f}%")

rows returned: 14
latency: 812 ms
---
['metro', 'electronics', 4823914.21, 1284910]
['metro', 'apparel',     2914382.40, 2491032]
['metro', 'grocery',     1842910.00, 4910482]
['metro', 'pharma',       912481.10,  481294]
['metro', 'home-goods',   814910.20,  812441]
---
fact-table rows scanned:    12,418,392,041
runtime-filter rejected:    11,158,294,210
join probe input rows:       1,260,097,831
local cache hit %:                  78.4%

Walk what the engine did:

shipments is the 12-billion-row fact table; cities and brands are the dims. The optimiser's first decision is the join order. Cost-based, it estimates the dim filters first: city.tier='metro' keeps 9 of 412 cities (~2%), brand.gst_state='MH' keeps 4M of 38M brands (~10%). The build side for both joins is the dim; the probe side is the fact. The CBO chooses to build hash tables on brands and cities, broadcast the small cities build to every node, and shuffle-distribute the larger brands build by brand_id matching the fact's distribution. This bushy plan would not be considered by a rule-based optimiser; ClickHouse pre-23 always picked left-deep.
runtime-filter rejected: 11.16B is where the speed comes from. The dim-side hash builds finish first; their key sets are converted into Bloom filters and shipped to the fact scan. The fact scan applies them at SIMD speed — roughly 2–3 ns per row — and rejects 90% of fact rows before any column projection. Only 1.26B rows reach the join probe, which is what the optimiser estimated. Why a Bloom filter and not the exact key set: 4 million keys as a hash set is ~64 MB; the same set as a Bloom filter at 1% false-positive rate is ~5 MB. Shipping 5 MB to every fact-scan worker, then evaluating it with three SIMD-friendly hash probes, is faster than the alternative — and the false positives are caught when the actual hash join probe runs. The rejection is "fast and approximate now, exact later".
local cache hit % 78% is the compute-storage-separation story. The fact-table Parquet files live on S3 (Mumbai region). StarRocks' worker nodes (called BEs, or Backends) have a local NVMe cache that holds recently-read Parquet column chunks. 78% means most of the 1.26B rows that survived the runtime filter were served from local NVMe; only 22% (~270M rows × ~50 bytes/row of needed columns ≈ 13 GB) went to S3. S3 read at 200 MB/sec per BE × 8 BEs = 1.6 GB/sec, so the S3 part takes ~8 seconds. But that 8 seconds runs in parallel with the local-cache scan; the wall-clock latency is the slowest of the two, and after a few queries the working set is hot and S3 reads drop to <10%.
COUNT(DISTINCT fact.shipment_id) over 1.26B rows is exact, not approximate. StarRocks' distinct-count uses a two-phase approach: each BE locally hash-deduplicates its slice into a partial set, then the final stage shuffles by shipment_id and merges. For low-cardinality distincts this is fast; for the 700M-distinct case here, it would be slow without the runtime filter pre-trim. With the trim it is cheap.
The frontend (FE) compiled the SQL, ran statistics-driven planning, and dispatched fragments to backends (BEs). StarRocks separates FE and BE: FE is Java, runs the parser, optimiser, and metadata; BE is C++, runs the vectorised execution. Frontends are stateless replicas (raft consensus on metadata); backends are the data plane. This is the same coordinator/worker shape as Trino, and unlike ClickHouse where every node is identical.

The cost-based optimiser — re-planning per query

The CBO is what actually changes when a workload changes shape. StarRocks' optimiser is Cascades-style — a top-down search that explores logically-equivalent plans, scoring each by an estimated cost computed from column statistics. The statistics are collected by ANALYZE TABLE, stored in the FE metadata DB, and refreshed automatically when ingestion volume crosses a threshold or daily on a schedule.

The five core stats per column are:

row_count of the table or partition.
ndv (number of distinct values), computed exactly for small columns and via HyperLogLog for high-cardinality.
histogram — typically equi-height with 256 buckets — capturing the value distribution so range predicates can be estimated.
null_count, used for IS NULL predicate selectivity.
avg_size, used for memory and shuffle cost estimation.

Cost is a weighted combination of CPU rows (input rows × per-row cost), network rows (shuffle volume × per-byte cost), and memory peak (build-side hash table size). The cost model is calibrated against benchmarks and tuned per cluster; default values come from Apache Doris' calibration on AWS c5 instances and most Indian deployments do not need to retune.

The CBO's biggest practical win — and the one most often missed in evaluations — is colocation. If two tables are bucketed on the same key (say, shipments and shipment_events both bucketed by shipment_id), a join on shipment_id does not need to shuffle either side. Each BE holds matching buckets of both tables locally; the join is a local hash join. The CBO recognises this from the table DDL and the colocate_with clause and avoids the shuffle stage entirely. For Aditi's logistics workload this turned a 4-second event-stitching query into 200 ms.

Compute-storage separation — the lakehouse-native bet

The older OLAP engines store data in their own columnar format on local NVMe; the engine and the data are physically coupled. Adding a node means re-shuffling shards. Querying yesterday's data after a week means promoting it from S3 archive to local NVMe before queries are fast. Sharing the same data across two query engines (say, Trino for ad-hoc + ClickHouse for dashboards) means double-storing or building a translation layer.

StarRocks 3.0 (2023) shipped a separated architecture where the BE's local disk is just a cache. The primary copy of the data lives in object storage as Parquet (or, increasingly, native StarRocks segment files in S3). The cache fills opportunistically; cache misses go to S3 in parallel; cache evictions are LRU. From the operator's point of view, "scaling compute" means adding BE pods that come up empty and warm in 5–10 minutes; "scaling storage" means doing nothing because S3 scales by itself. From the cost point of view, the BE fleet can be scaled to zero overnight and brought up at 9 a.m. — for an Indian SaaS company whose dashboard load is 9-to-9, this halves the compute bill.

The deeper integration is with Iceberg, Hudi, and Delta Lake as external tables. StarRocks reads Iceberg manifests directly, applies its own optimiser and runtime filters to the Parquet scan, and treats the Iceberg snapshot as the consistency point. The same lakehouse table can be queried by Spark, Trino, and StarRocks — three engines on one source of truth. Why this matters operationally: the previous generation's "load into the OLAP cluster" step was where 70% of pipeline failures and 50% of cost lived. Iceberg-native query removes that step. The data lands in Iceberg from your ingestion pipeline; StarRocks queries it directly. The single biggest reason teams are migrating from ClickHouse + dbt to StarRocks + Iceberg in 2025 is not query speed — it's eliminating the ETL hop into ClickHouse.

Backends are stateless and scale independently of storage. The local NVMe is a cache, not the source of truth. Iceberg tables in object storage are first-class citizens, queryable with the same optimiser and execution engine as native segments.

Doris vs StarRocks — same root, different sponsors

Apache Doris started at SearchDragon in 2017 (originally as "Palo") and was donated to Apache in 2018. StarRocks forked from Doris around 2020 when a group of Doris committers — led by Stan Zheng — left to form a startup that became the StarRocks commercial entity. The two codebases have diverged since: StarRocks has been more aggressive on the CBO and on compute-storage separation; Doris has stayed closer to the on-prem MPP shape and has stronger China-domestic vendor support via SelectDB. Both run vectorised execution; both speak MySQL protocol; both can read Parquet on S3.

For an Indian team picking between them in 2026: StarRocks is the safer bet for greenfield lakehouse deployments because the separated-storage mode is more mature, and the open-source CBO has had two more years of large-cluster shake-out. Doris is the better choice if your org already runs Apache Foundation governance preferences end-to-end and wants the SelectDB managed offering, or if you need certain China-region capabilities (OceanBase integration, Aliyun OSS first-class support) that StarRocks treats as second priority. The query semantics are nearly identical; migrations between them are mostly a DDL re-write, not a query re-write.

Common confusions

"StarRocks is a faster ClickHouse." It is not just faster — it has a different optimiser model. ClickHouse's strength is single-table scans on dense local NVMe; StarRocks' strength is multi-table star-schema queries with runtime filters. On a single-table scan benchmark, ClickHouse v24 still beats StarRocks. On a TPC-DS-style join workload, StarRocks wins by 5–10×. Pick by your workload shape, not by the headline benchmark.
"Compute-storage separation means slower queries." Only on cold cache. After a few minutes of repeated dashboard queries, the working set is in local NVMe and latency matches a coupled architecture. The trade-off is that the first query of the day is slower (5–10 seconds vs 500 ms) until the cache warms.
"The CBO removes the need for table design." It does not. Bucketing for colocation is still a design decision the user makes; the CBO recognises and exploits colocation but cannot create it. Picking the right DISTRIBUTED BY HASH(...) and BUCKETS N matters as much as it did in Greenplum or Vertica.
"Doris and StarRocks are 100% compatible." They are not. SQL syntax overlaps ~90%; DDL overlaps ~80%; system tables and the metadata model are different. A migration takes a week of DDL rewrites and a re-import of statistics, not an afternoon. The optimiser hint syntax is also different (StarRocks uses /*+ ... */, Doris uses [hint]).
"You should always use external Iceberg tables instead of native segments." External Iceberg is the right answer when the same data is queried by Spark and Trino as well. For pure StarRocks-only workloads, native segments are still 1.5–2× faster on the same hardware (Parquet's Snappy-compressed pages decode slower than StarRocks' bit-packed format). The choice is between integration breadth and per-query speed.
"Vectorised means GPU-accelerated." It does not. Vectorised execution targets CPU SIMD (AVX2, AVX-512). GPU-accelerated OLAP exists (HEAVY.AI, BlazingSQL) but is a different category. StarRocks does not run on GPUs; the speedup comes from cache-friendly column batches and AVX instructions, both running on the same x86 cores ClickHouse uses.

Going deeper

The Cascades framework and why bushy joins matter

StarRocks' optimiser is built on the Cascades framework — a top-down memo-based search that originated in Microsoft SQL Server's Tandem heritage. The core idea: for each logical sub-tree of the query, generate all equivalent physical plans, score them with the cost model, and propagate the best plan upward. The "memo" is a deduplicated graph of explored sub-trees that prevents re-evaluating shared subqueries. The pruning rule is: any plan whose cost lower bound exceeds the current best is pruned without exploration. For a 6-table join, the search space is 1.5 million plans (Catalan number C(11) = 58k just for the bushy shapes); pruning makes the search tractable in <100 ms on real schemas.

The bushy-vs-left-deep distinction matters most when two dim-on-fact joins have very different selectivity. Aditi's query has city.tier='metro' (2% selective) and brand.gst_state='MH' (10% selective). A left-deep plan picks one and applies the other on top: ((fact JOIN city) JOIN brand) or ((fact JOIN brand) JOIN city). A bushy plan does both filters in parallel: build cities_filtered and brands_filtered, then probe the fact in a single pass with both runtime filters active. The bushy plan halves the join probe count when both filters reject independent rows; the left-deep plan only catches the first filter's rejections and then the second filter applies after the first join's output expansion.

Aliyun, BitWave, and the China-scale benchmarks

StarRocks' largest published deployment is at Penguine's advertising arm: 4 PB of data, 2,000 BEs, 30,000 daily-active analysts. Doris' largest published deployment is at JD.com (Beijing logistics): 8 PB, 3,500 BEs, used as the primary OLAP store across 200 product teams. Both engines have been hardened by China-scale workloads in ways the Indian or US market has not yet reached — by an order of magnitude or two. The implication for an Indian team is that scaling to a few hundred TB (PaisaBridge's typical analytics footprint, or BharatBazaar's BBD-day analytics) is far below where these engines have been tested; you should not hit any architectural wall, and you'll only meet the operational rough edges that Penguine already filed bug reports for two years ago.

Materialised views and async refresh

Both StarRocks and Doris support materialised views with automatic query rewriting. The MV is defined as a SELECT (typically a pre-aggregation or pre-join) and refreshed either on every base-table mutation (sync, slow) or on a schedule (async, the production default). When a query arrives, the optimiser checks whether any defined MV could satisfy it — exactly or with a residual filter — and rewrites the query to scan the MV instead. For Karan's growth dashboard from the Druid chapter, this is the equivalent of Druid's rollup: declared once, applied transparently. The difference is that the MV definition is SQL (no proprietary spec) and the base table is still queryable; you can ask drill-down questions that fall through the MV and hit the base.

When StarRocks is the wrong answer

Three workload shapes where the previous generation still wins. First, single-table point lookups by primary key — SELECT * FROM users WHERE id = 12345. StarRocks is OLAP; this query against a row store like Postgres or a key-value store like Redis is faster by 100×. Second, highly skewed dim cardinality where one dim value covers 50% of fact rows — Pinot's star-tree handles this gracefully through pre-aggregation by that dim; StarRocks' runtime filter is irrelevant because the filter rejects nothing. Third, streaming with sub-100ms ingest-to-query latency — StarRocks' Stream Load takes 1–10 seconds to commit; Druid's real-time ingestion is sub-second.

Where this leads next

/wiki/pre-aggregation-materialized-views-and-their-costs — the next chapter; how StarRocks' MVs, ClickHouse's projections, Pinot's star-trees, and Druid's rollup compare on the same workload.
/wiki/serving-p99-latency-under-ingest-pressure — the operational chapter; how vectorised CBO engines behave when ingest competes with read traffic.
/wiki/copy-on-write-vs-merge-on-read-iceberg-vs-hudi — the lakehouse table-format choice you make underneath the StarRocks engine; the engine is fast, but the table format determines write semantics.

The thread from ClickHouse → Pinot → Druid → StarRocks/Doris is not a story of incremental speedups. Each generation reframed the OLAP problem differently. ClickHouse said "scan very fast and joins are someone else's problem." Pinot and Druid said "pre-compute the answers users will ask for." StarRocks and Doris said "answer arbitrary multi-table queries fast, and let object storage hold the data." Each is right for the workload that motivated it. The next-wave bet is that more and more analytical workloads look like the third shape — dashboards on a star schema, served from a lakehouse — and the engine should be designed for that.

References

StarRocks: A New Era for Real-time Analytics (CIDR 2024) — the canonical paper from Stan Zheng and team. Section 4 (vectorised exec) and Section 6 (compute-storage separation) are the must-reads.
Apache Doris documentation — Architecture — the upstream lineage, with the FE/BE split explained from first principles.
The Volcano Optimizer Generator (Graefe, 1994) — the foundational paper for top-down query optimisation; StarRocks' Cascades engine is a direct descendant.
Photon: A Fast Query Engine for Lakehouse Systems (SIGMOD 2022) — Databricks' vectorised engine paper; the design choices StarRocks made are a near-mirror.
SelectDB Engineering Blog — Doris vs StarRocks — vendor-biased but the technical comparison of the optimiser and the execution model is accurate.
Roaring Bitmaps and Bloom filters in MPP query engines (Lemire et al.) — the data-structure paper for the runtime-filter mechanism.
/wiki/druid-and-its-segment-model — the immediately preceding chapter; Druid's rollup is the previous-generation answer to the same question.
/wiki/clickhouse-columnar-for-real-time — the engine that StarRocks most directly competes with on benchmarks.