ClickHouse and the MergeTree Family

In short

The last chapter said: for the analytical query a single human runs once on a laptop, you do not need a server — DuckDB in your Python process is the right answer. This chapter is the other half of that argument. For the analytical query that ten thousand users hit per second against a table with three trillion rows, you very much do need a server, and the server you almost certainly want is ClickHouse.

ClickHouse was born inside Yandex around 2009, open-sourced in 2016, and spent the next decade quietly becoming the fastest open-source OLAP database on the planet. Single-node scan throughput crosses a billion rows per second on a modern Xeon when the columns are well-encoded and the filter is selective — about 50 to 200 times faster than Postgres for the same analytical query, and roughly an order of magnitude faster than Spark on the same hardware. It scales horizontally to hundreds of nodes; Cloudflare ingests nine million HTTP requests per second per data centre into it; Yandex.Metrica runs petabyte-scale aggregation queries on it; Posthog, Sentry, Plausible and GitLab all use it as their analytics engine.

The thing that makes ClickHouse fast is not one trick — it is the same vectorised columnar machinery the previous two chapters described, plus aggressive SIMD, plus an exceptionally well-engineered storage layer called MergeTree. MergeTree stores each table as a collection of immutable parts, each sorted by a user-chosen primary key, each carrying its own sparse index and column statistics. Inserts always create a new part; existing parts are never mutated. A background merger continuously combines small parts into bigger ones, exactly like an LSM tree. Queries scan all relevant parts in parallel, using the sparse index to skip ranges that cannot match.

On top of plain MergeTree, ClickHouse layered an entire family of variant engines — ReplacingMergeTree keeps the newest version of duplicate keys, SummingMergeTree sums numeric columns on merge, AggregatingMergeTree maintains arbitrary pre-aggregations, CollapsingMergeTree uses a sign column to cancel rows for soft deletes, ReplicatedMergeTree adds multi-replica coordination via ZooKeeper / Keeper. Each is a small twist on the base merge logic. Together they cover the bulk of real-time analytical and time-series workloads. This chapter walks through the storage model, the family tree, the sparse-index trick, and how you actually design a ClickHouse table for a workload like "an Indian startup ingesting one billion clickstream events a day."

The previous chapter ended with the line "in the next chapter you'll trade the embedded model for the server model and look at ClickHouse's MergeTree family." That trade is not a cosmetic one. The embedded model assumes one process, one user, data on one machine. The server model assumes the opposite of all three: dozens of concurrent ingest streams, thousands of concurrent dashboards, hundreds of terabytes of data spread across a cluster. The architecture that wins one of those workloads loses the other badly, and ClickHouse is the canonical answer to the second.

To see why, take a concrete workload. Cloudflare's HTTP analytics: every HTTP request that touches Cloudflare's edge — currently around 70 million per second globally — produces an event with maybe 100 columns (URL, status code, country, ASN, cache status, response time, bot signals). All of those events are ingested into ClickHouse. Customers query them with arbitrary WHERE and GROUP BY clauses through a UI, and expect dashboards to render in under a second. There is no precomputed cube that covers the query space; it is too high-dimensional. The query engine has to actually scan the events. At the volumes Cloudflare runs, "scan the events" means reading on the order of a trillion rows in well under a second. A row store cannot do this. A general-purpose columnar warehouse like Snowflake can do it but at substantial cost and with multi-second latency. ClickHouse does it on commodity hardware, in milliseconds, at a fraction of the cost. Why this matters: there is a class of analytical product — observability, real-time dashboards, product analytics — where the latency requirement is interactive (under one second) and the data is too large for a per-tenant cube. ClickHouse is what you build those products on.

What "the fastest" actually means

The phrase "fastest OLAP database" gets thrown around a lot, so it is worth being precise about what ClickHouse is fast at.

Single-table scans with filters and aggregations. This is the bread-and-butter analytical query: SELECT country, count() FROM events WHERE timestamp > now() - 1 day AND status = 500 GROUP BY country ORDER BY count() DESC LIMIT 20. ClickHouse can scan and aggregate on the order of a billion rows per second per CPU core when the columns are well-encoded and the filter is selective. On a 64-core machine that is 60+ billion rows per second of aggregation throughput. Why this is plausible: a column of Int32s is 4 bytes per row, so a billion rows is 4 GB. Modern DDR5 memory delivers 50+ GB/s of sequential read throughput per socket; AVX-512 can sum eight Int32s per cycle. A billion rows per second per core is not a marketing number — it is what falls out of streaming a column through SIMD.

Time-range queries on time-ordered data. When the table's sort key is (timestamp, ...), a query for "the last 5 minutes" only touches the parts whose timestamp range overlaps that window. The other parts — possibly representing years of history — are skipped without being opened.

Pre-aggregated queries via materialised views. When you can express your dashboard as "give me the count of events per country per minute," you can materialise that into a SummingMergeTree updated continuously by a materialised view. The dashboard then queries a thousand-row table instead of a billion-row table. Sub-millisecond latency.

Where ClickHouse is not the fastest: point lookups by primary key (Postgres or any indexed row store wins), updates and deletes (ClickHouse can do them but they are expensive and asynchronous), high-concurrency transactional writes (it is not an OLTP system at all). If you need to update individual rows millions of times a day, you are using the wrong tool. Why: the immutable-parts model that makes scans fast is structurally bad at point updates. Picking the right system for the access pattern, again.

The MergeTree storage model

Everything in ClickHouse comes back to MergeTree. You should think of a MergeTree table as a directory of immutable, sorted, columnar files, plus a background process that periodically combines small files into bigger ones. That sentence has four words doing a lot of work — immutable, sorted, columnar, combining — and each is worth unpacking.

The picture is a literal one — a MergeTree table is a directory on disk, and inside that directory you can ls and see exactly the parts shown above. A part is a self-contained mini-table. It has one binary file per column (the column's values, compressed with LZ4 or ZSTD), an index file, and a small columns.txt describing the schema. Crucially, every part is sorted by the table's primary key. Why sorted: when the data inside a part is sorted by, say, (event_time, user_id), a query that filters on event_time > X can binary-search the sparse index to find the first matching row and stream forward, skipping everything before. Sorting transforms a full scan into a range scan for free.

Inserts work like this: a batch of rows arrives, ClickHouse sorts them by the primary key in memory, encodes each column to its compressed binary form, writes the resulting files to disk as a new part directory, and atomically renames it into place. The existing parts are never touched. This is the LSM-tree trick — writes are pure appends, never in-place mutations — and it is what gives MergeTree its very high ingest throughput. Why immutability matters: in-place mutation requires you to lock the data, find the row, decompress the column block it lives in, modify it, recompress, write back. None of that is needed when you only ever append. The cost is that you accumulate small parts, which is exactly what the background merger fixes.

The background merger runs continuously. It picks small adjacent parts (by some size and time heuristic) and k-way merges them into one larger part — exactly like LSM compaction. Because every part is already sorted by the same primary key, the merge is a streaming merge-sort, linear in the data size. The newly merged part replaces its inputs atomically; queries in flight that were reading the old parts finish reading them, and queries that arrive after see the merged part. Old parts get garbage-collected after their reference count drops to zero. Why merging is essential: without it, after a million inserts you would have a million tiny parts, and a query would have to open and read sparse-index data from all of them. Merging keeps the part count to a few hundred per partition, even after months of continuous ingest.

A part's other significant feature is its sparse primary index. Unlike a B-tree index which stores one entry per row, ClickHouse stores one index entry per N rows (default 8192, configurable). Each entry holds the primary-key value of the first row in that block, plus an offset into the column files. To find rows matching a WHERE clause, ClickHouse binary-searches the sparse index to find the candidate blocks (granules), then scans only those granules from disk. Why sparse: a billion-row table at one entry per row would have a billion index entries — gigabytes of index, far too large to keep in memory. At one entry per 8192 rows, the index has 122,000 entries, fits in a couple of MB of RAM, and still narrows a billion-row scan to one or two granules of 8192 rows each for an equality lookup. The trade-off is that you read 8192 rows even when you only want one — but for analytical workloads where you usually want many rows in a range, that is a great trade.

The PRIMARY KEY is the sort order, not a uniqueness constraint

This is the single most important conceptual gotcha for engineers coming from Postgres. In ClickHouse, PRIMARY KEY does not mean "unique identifier." It means "the columns the data is physically sorted by inside each part." There is no uniqueness check. You can insert ten rows with identical primary key values and ClickHouse will happily store all ten — they will just sit next to each other on disk because they sort the same way.

The right way to read ORDER BY (event_time, user_id) (which is the recommended way to declare it in modern ClickHouse) is: "sort the data by event_time first, then by user_id within each event_time." The choice of sort key is the most consequential schema decision you make, because it controls which queries are cheap and which are expensive. Why: a query that filters on a prefix of the sort key (e.g. WHERE event_time > X when the sort is (event_time, user_id)) can use the sparse index to skip granules. A query that filters only on a non-prefix (WHERE user_id = X alone) cannot — it has to scan every granule.

This is the rule that drives the entire schema design playbook for ClickHouse: figure out the dominant query pattern first, then sort the table by the columns those queries filter on, in the right order. In a time-series workload that is almost always (timestamp, some_dimension). In a per-user analytics workload where most queries filter by user, it might be (user_id, timestamp). There is no one right answer — only "right for your queries."

The MergeTree family

MergeTree is the base. From it ClickHouse derives a small family of engines, each adding a twist to what the background merger does when it combines parts. Every member of the family inherits the immutable-parts, sparse-index, columnar storage of MergeTree — they only differ in what happens when two rows with the same sort key meet during a merge.

ReplacingMergeTree is the simplest variant. Declare an optional version column. When the merger sees two rows with the same sort key, it discards the older one (lowest version) and keeps the newest. Useful for "latest state per entity" tables — fed by a CDC stream from your OLTP database, you get a table where the latest version of each row eventually wins. The catch is eventual: until the merge happens, both versions are on disk, and queries that don't tolerate duplicates must use SELECT ... FINAL or argMax(value, version) GROUP BY id to filter to the latest. Why this is fine in practice: most analytical queries either aggregate (where duplicates roll up cleanly) or use argMax explicitly. The "background dedup" story is good enough for things like a search index keyed by document ID.

SummingMergeTree sums numeric columns when sort keys collide during merge. The classic use case is a pre-aggregated counter table: you ingest "user X did action Y at time T, count = 1" and after merging you have one row per (user, action, hour) with the total count. The base table effectively becomes a continuously-rolled-up counter. Combined with materialised views (next section), this is one of the most powerful patterns in ClickHouse.

AggregatingMergeTree generalises SummingMergeTree to arbitrary aggregations. Instead of summing, columns hold aggregate function states — partial computations of uniq(), quantile(), avg() and so on, which can be merged associatively. A column might store the HyperLogLog state of uniqState(user_id) rather than a final count. When the merger combines two rows it merges the HLL states; when you query, you finalise with uniqMerge(state) to get the count. This is the engine you build production materialised views on for sub-second dashboards over billions of events.

CollapsingMergeTree uses a sign column that is +1 for "row added" and -1 for "row removed". On merge, matching pairs cancel each other out, leaving zero rows. This is how ClickHouse models row-level changes: to "delete" a row, you insert its mirror with sign = -1 and let the merger eventually collapse them. Aggregations have to weight by sign to be correct between merges. VersionedCollapsingMergeTree adds a version column so that out-of-order arrivals (a -1 arriving before its +1 because of network reordering) still collapse correctly. Why this contortion: ClickHouse fundamentally cannot do efficient in-place row deletes — the parts are immutable and the columns are compressed. Sign-based collapse is the workaround that gives you logical mutability while keeping the physical append-only model.

ReplicatedMergeTree is not a merger variant — it is the multi-node version of MergeTree. Replicas coordinate via ZooKeeper (or its in-house replacement, ClickHouse Keeper) to agree on what parts exist. Each insert is replicated to all replicas; merges are coordinated so that the same merge runs on every replica and produces byte-identical parts. The Replicated prefix composes with every other variant: ReplicatedReplacingMergeTree, ReplicatedAggregatingMergeTree, and so on. Why ZooKeeper / Keeper: replication needs a consensus log of "which parts exist, in what order, on which replicas." That is a classic Paxos / Raft problem. Yandex used ZooKeeper for years; modern ClickHouse ships Keeper — a Raft-based drop-in replacement — to get rid of the JVM dependency.

The killer feature: incremental materialised views

The MergeTree family becomes much more powerful when you combine it with ClickHouse's materialised views. Unlike Postgres materialised views (which are recomputed periodically), ClickHouse materialised views are incremental: they fire on each insert into the source table and write the transformation result into a target table. If the target is a SummingMergeTree or AggregatingMergeTree, the result is a continuously-maintained roll-up.

The pattern looks like this. You have a billion-row events table. You want a dashboard showing daily active users per country. You write:

CREATE MATERIALIZED VIEW dau_by_country_mv
TO dau_by_country
AS SELECT
    toDate(event_time) AS day,
    country,
    uniqState(user_id) AS unique_users_state
FROM events
GROUP BY day, country;

dau_by_country is an AggregatingMergeTree table holding uniqStates. Every insert into events fires the view, which writes the per-batch partial HLL states into dau_by_country. The background merger then merges those states. Your dashboard queries SELECT day, country, uniqMerge(unique_users_state) FROM dau_by_country GROUP BY day, country — which touches a table with thousands of rows instead of billions, and returns in single-digit milliseconds. Why this beats traditional materialised views: there is no recompute step, no staleness window, no "the view rebuilt at midnight so the dashboard is wrong until then." The aggregation grows incrementally as data arrives, in roughly the same I/O the insert was already going to do.

This pattern is why ClickHouse is the analytics backend at Posthog, Plausible and many others: dashboards are queries against materialised AggregatingMergeTree rollups, while the raw event table sits underneath for ad-hoc deep-dives.

Real-world scale

ClickHouse's reputation comes from the workloads it actually runs in production, not from benchmarks. The numbers are worth internalising as a calibration of what is possible.

The headline that sticks: a 10-node ClickHouse cluster routinely answers dashboard queries with sub-second p99 latency over trillions of rows. That is the operating point of products like Cloudflare's analytics dashboard, Posthog's session insights, and Sentry's error timelines. The same workload on a row store would either take minutes per query or require a multi-million-rupee cube-precomputation pipeline. Why ClickHouse hits this and Snowflake mostly does not: Snowflake is a general-purpose warehouse priced for query elasticity; ClickHouse is a specialised analytics database run by you on your own machines with a tightly engineered hot path. For the specific workload of "interactive dashboards over time-series events," ClickHouse is one to two orders of magnitude cheaper per query. Comparison studies consistently put ClickHouse at the top of price-performance for this workload class.

Worked example: a billion clickstream events a day

Here is the canonical real-world design. You are an engineer at an Indian e-commerce startup growing fast — say a Flipkart-shaped company at scale, or a newer Tata Neu / Meesho. The product team wants every page view, every product click, every add-to-cart, every search, every checkout step tracked. The marketing team wants to slice these events by user, by city, by acquisition channel, by hour. The volume is about a billion events per day, peaking at 30,000 events per second during evening sales. You have to design the ClickHouse schema.

Designing the ClickHouse table for one billion clickstream events per day

The schema decision boils down to four questions: what is the engine, what is the partition key, what is the sort key, and what columns get specialised types.

CREATE TABLE events
(
    event_time      DateTime CODEC(DoubleDelta, ZSTD(3)),
    user_id         UUID,
    session_id      UUID,
    event_name      LowCardinality(String),
    page_url        String,
    product_id      Nullable(UInt64),
    city            LowCardinality(String),
    channel         LowCardinality(String),
    properties      Map(String, String),
    event_date      Date MATERIALIZED toDate(event_time)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
TTL event_time + INTERVAL 18 MONTH DELETE
SETTINGS index_granularity = 8192;

Walk through the choices.

Engine = MergeTree. Plain MergeTree, not a variant. You want every event preserved — no replacement, no summing, no collapsing. The raw event table is the source of truth; you'll layer materialised views on top for the dashboards. Why not ReplacingMergeTree: clickstream events are not a "latest state per entity" table. Each click is an independent fact. ReplacingMergeTree would silently lose duplicate events that happened to share a timestamp with the same user.

PARTITION BY toYYYYMM(event_time). Each month's data lives in its own set of parts on disk. The benefits are operational: dropping last year's data is ALTER TABLE events DROP PARTITION 202404, which is a metadata operation that runs in milliseconds — just unlinking the directory. Without partitioning, the equivalent DELETE would rewrite parts and take hours. Why monthly and not daily: partitioning by day would mean 365 partitions a year. ClickHouse handles small numbers of partitions much better than large numbers because the merger plans within partitions, not across. The rule of thumb is "tens to low hundreds of partitions per table, total." Monthly hits that for a multi-year retention.

ORDER BY (event_time, user_id). This is the most consequential decision. The dominant query patterns at an analytics startup are time-windowed: "events in the last 5 minutes," "yesterday's funnel by hour," "this week's top pages." All of those filter on a time range first. Sorting by event_time first means a 5-minute query reads only the granules whose event_time range overlaps that window — a tiny fraction of the table. The secondary user_id then groups events for the same user within each timestamp tick, which makes per-user sub-queries also efficient. Why not (user_id, event_time): if your most common query is "everything user X did, ever," that order would be better. But for a clickstream where dashboards dominate and per-user queries are rarer, time-first wins.

LowCardinality(String) for event_name, city, channel. These columns have small distinct value counts (maybe 50 event names, 1000 cities, 20 channels). LowCardinality dictionary-encodes them — the column on disk is an integer per row plus a small dictionary. A LowCardinality column is typically 5-10x smaller and 5-10x faster to filter and group than a plain String for low-cardinality data. Why this matters at billion-row scale: a String event_name averaging 12 bytes uses 12 GB per billion rows. A LowCardinality version uses 1 GB plus a 1 KB dictionary. That difference cascades into every scan, every group-by.

CODEC(DoubleDelta, ZSTD(3)) on event_time. Timestamps in clickstream data are nearly monotonically increasing, with small deltas between consecutive values. DoubleDelta encodes the second derivative — the change in the change — which is almost zero for steady streams, then ZSTD compresses the result. Realistic compression ratio: 20-50x for timestamps. Why this matters: timestamps are 8 bytes per row, the largest fixed-size column in your schema. Compressing them aggressively pays back hard.

TTL event_time + INTERVAL 18 MONTH DELETE. ClickHouse will automatically drop data older than 18 months in the background. No cron job, no janitor service.

Now the dashboard query that pays for the design:

SELECT count()
FROM events
WHERE user_id = '550e8400-e29b-41d4-a716-446655440000'
  AND event_time > now() - INTERVAL 1 DAY;

What ClickHouse does:

Partition pruning. Only the partition for the current month is touched (and if now() - 1 day crosses a month boundary, also the previous one).
Sparse-index seeking. Within each touched partition, the sparse index is binary-searched on event_time > now() - 1 day to find the first relevant granule. Everything older than 24 hours is skipped without ever opening the column files.
Granule scan. The matching granules — typically tens of thousands of rows for a single user over a day — are read column-by-column. Only the user_id column is read first (for the filter); only matching rows are even decoded for count().
Vectorised aggregation. The count() is a SIMD-summed counter across the matching rows.

End-to-end runtime on a 10-node cluster: on the order of 10 to 50 milliseconds, mostly network and dispatch overhead. The actual scanning work is microseconds. Why so fast for what looks like "one user out of a billion": the sort key did not include user_id first, so the database cannot binary-search to the user. But the time-range filter narrows the scan to one day of data — about 1/30th of the month, and within that, the user_id filter is just a SIMD equality check across a few million rows. At a billion rows per second per core scan throughput, a few million rows is a few milliseconds.

The same query in Postgres on a comparable schema, with appropriate indexes, would take seconds even with an index on user_id — because Postgres has to fetch each event row from the heap (random I/O), while ClickHouse just streams the columnar data sequentially.

If you wanted that user-lookup query to be even faster, you'd add a projection — a kind of secondary materialised sort order, also stored as parts but ordered by (user_id, event_time) — and ClickHouse would automatically pick the projection when the query matches. This is the modern alternative to maintaining a separate per-user table.

Where ClickHouse fits — and where it doesn't

Pulling the threads together, ClickHouse occupies a very specific niche, and being honest about its boundaries makes it more usable.

Use ClickHouse when: your workload is heavy on time-series or event data, you need interactive (sub-second) latency on aggregations over billions to trillions of rows, your queries are predominantly filter-and-aggregate (not point lookups, not multi-table joins on huge tables), and your write pattern is bulk inserts rather than single-row OLTP transactions. This is observability, product analytics, web analytics, ad analytics, IoT telemetry, financial market data — basically any "events arriving fast, dashboards queried fast" workload.

Don't use ClickHouse when: you need OLTP transactions (use Postgres), you need millisecond point updates and deletes (use Postgres or a key-value store), your workload is a star-schema warehouse with many big-to-big joins (Snowflake, BigQuery and Spark are better), or your data is small enough to fit on a laptop (the previous chapter — DuckDB).

Between DuckDB and ClickHouse you have a remarkably wide coverage of the analytical workload space: DuckDB for the embedded, single-process, laptop-scale case; ClickHouse for the server-grade, real-time, cluster-scale case. The chapters that follow go up one level of abstraction — into data warehouses and lakehouses (Snowflake, BigQuery, Iceberg, Delta Lake) where the design priority shifts from raw query speed to multi-tenancy, governance and elastic separation of storage from compute.

What you should take away

ClickHouse is the fastest open-source OLAP database for time-series and event-analytics workloads. A billion rows per second per core scan throughput is a real, measurable number, not marketing. A 10-node cluster routinely answers dashboard queries with sub-second p99 over trillions of rows.
MergeTree is the heart of it. Immutable, sorted, columnar parts; sparse primary index; LSM-style background merging. Inserts append; merges compact; queries skip whole granule ranges via the sparse index.
The "primary key" is the sort order, not a uniqueness constraint. The single most consequential schema decision is the ORDER BY clause: it determines which queries are fast and which are not. For time-series data, (timestamp, dimension) is the default that pays back almost always.
The MergeTree family is variants on the merge step. ReplacingMergeTree dedupes by version, SummingMergeTree sums numeric columns, AggregatingMergeTree maintains aggregate states, CollapsingMergeTree handles soft deletes via a sign column, ReplicatedMergeTree adds multi-replica coordination. They compose orthogonally with each other.
Materialised views are the killer feature. Combine a base events table with a SummingMergeTree or AggregatingMergeTree target through an incremental materialised view, and you get continuously-maintained sub-millisecond rollups over billions of rows. This is how every ClickHouse-powered analytics product (Posthog, Plausible, Cloudflare dashboards) is built.
Schema design is the whole game. Pick the engine, pick the partition key, pick the sort key, declare the right LowCardinality and codec hints, layer materialised views on top. The rest takes care of itself.

In the next chapter you'll move up the stack from real-time analytical engines to the warehouse era — star and snowflake schemas, the dimensional modelling vocabulary that drives Snowflake, BigQuery and Redshift workloads.