ClickHouse: columnar for real-time
It is 23:14 on a Wednesday. Karan, an SRE at Cred, opens the fraud-rules dashboard and the chart for "auto-debit attempts per minute" updates with data that is 4 seconds old. Three months ago that chart ran on Snowflake and updated every 30 minutes — a window during which a coordinated attack could push through 60 lakh fraudulent debit attempts before anyone saw the spike. The query Cred runs is a 14-line SELECT with three filters and one GROUP BY. Snowflake answered it in 2.8 seconds. ClickHouse answers it in 41 milliseconds. The SQL is identical. The difference is everything underneath: how the bytes are laid out, where the index lives, and what the query plan does not have to do.
ClickHouse is one process that owns the columnar file, the sparse primary key, and the query engine. The MergeTree layout writes small immutable parts every few seconds, holds a sparse index of the sort key resident in RAM, and merges parts asynchronously into bigger ones — which is why a query against a 200-billion-row table can decide which 8 MB to read in 8 microseconds. The whole engine is a working out of one bet: keep the sort-key index hot, keep the columnar parts immutable, and let a background daemon worry about file size.
MergeTree — the part is the unit of everything
A ClickHouse table is a directory of parts. A part is a complete, self-contained columnar slice — every column you defined gets its own .bin file inside the part, plus a .mrk2 mark file that maps row numbers to byte offsets, plus a primary.idx file holding the sparse index for that part's sort key. When you INSERT 1 lakh rows, ClickHouse writes one part. When you insert again, it writes another. Parts are immutable: nothing ever modifies a part after it is written. Updates happen by writing a new part that supersedes (via a merge) the old one. Deletes happen by writing a tombstone marker (or, with ALTER TABLE ... DELETE, by rewriting the affected parts).
The part is the unit of everything. Replication moves parts. Backups copy parts. The merge daemon merges pairs of parts into a bigger part. A query is "find the parts that match the WHERE, scan their relevant columns, union the results." If you understand the part, you understand 80% of ClickHouse.
The directory name 202604_18_18_0 decodes as <partition>_<min_block>_<max_block>_<merge_level>: this part belongs to partition 202604 (April 2026, if we partition by month), came from one INSERT block (block 18 to block 18), and has been merged zero times — it is a level-0 part. After two such parts merge, the result is 202604_18_19_1. After eight level-0 parts merge in a chain, you get 202604_18_25_3. Why the level matters operationally: ClickHouse's merge policy uses the level to decide which parts are eligible to merge with which — only nearby levels merge together, which prevents a single 50 GB part from being merged with every 10 MB part that arrives, an O(n²) blowup. The level is essentially the LSM-tree level applied to columnar storage.
The sparse primary key — why 200 billion rows is fine
The single most counter-intuitive thing about ClickHouse, coming from a Postgres or MySQL background, is that the "primary key" is not unique and not row-addressed. It is a sparse index that holds, for every 8192-th row in sort-key order, the sort key's value. If your sort key is (event_time, user_id) and your part has 8.2 lakh rows, the primary.idx has roughly 100 entries — one per "granule" of 8192 rows.
To find rows where event_time BETWEEN '2026-04-25 11:00' AND '2026-04-25 11:05', ClickHouse:
- Loads
primary.idxfrom each part (already in RAM). - Binary-searches the index to find the granules whose value range overlaps the query range.
- For each candidate granule, reads the
.mrk2file to translate granule index → byte offset in the.binfile. - Reads exactly those byte ranges.
Step 2 is microseconds. Step 3 is one disk read per column per granule. Step 4 is the only step that scales with data size, and even there, only the matching ranges are read. Why "sparse" is the right design here: a dense per-row index for 200 billion rows would be 1.6 TB of pointer overhead, dwarfing the data. A sparse index over 8192-row granules is 0.012% the size — still 25 MB for 200 billion rows, fits in RAM, gives you binary-search-fast pruning. The cost is granule-level granularity: if a granule has any matching row, you read all 8192 rows of it. For analytical queries with thousands of matching rows that's irrelevant; for OLTP-style point lookups that's catastrophic, which is why ClickHouse is not OLTP.
The granule size of 8192 is index_granularity, configurable per table. Smaller means more precise pruning but bigger index. Larger means coarser pruning but a tiny index. The default is 8192 because empirically, for analytical workloads, that is where the curves cross. Razorpay's payments-events table uses 8192. Cred's auto-debit-attempts table uses 4096 because their queries are sharper-grained. PhonePe's UPI-tx table uses 16384 because their queries scan wider.
A working ClickHouse table — DDL, insert, query, plan
Here is a complete loop: create the table, insert, query, look at the plan. The output is from a single-node ClickHouse 24.x running on a t3.large.
# ClickHouse via clickhouse-connect — runs against a real cluster.
import clickhouse_connect
import time, random, datetime as dt
c = clickhouse_connect.get_client(host="localhost", port=8123,
username="default", database="default")
c.command("DROP TABLE IF EXISTS auto_debits")
c.command("""
CREATE TABLE auto_debits (
event_time DateTime64(3),
user_id UInt64,
merchant_id UInt32,
amount_paise UInt64,
state LowCardinality(String),
status LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
SETTINGS index_granularity = 8192
""")
# Insert 50 lakh rows in 5 batches — like 5 micro-batches landing every 6 s.
states = ["MH","KA","TN","DL","UP","WB","GJ","TS","KL","RJ","HR","PB"]
statuses = ["SUCCESS","DECLINED","TIMEOUT","FRAUD_BLOCK"]
for batch in range(5):
rows = []
base = dt.datetime(2026, 4, 25, 11, batch * 6 // 60, batch * 6 % 60)
for i in range(10_00_000): # 10 lakh per batch
rows.append([base + dt.timedelta(milliseconds=i),
random.randint(10_00_000, 99_99_999),
random.randint(100, 999),
random.choice([14900, 49900, 99900, 1_99_900]),
random.choice(states),
random.choices(statuses, weights=[88, 8, 3, 1])[0]])
c.insert("auto_debits",
rows,
column_names=["event_time","user_id","merchant_id",
"amount_paise","state","status"])
print(f"batch {batch}: inserted, parts={c.query('SELECT count() FROM system.parts WHERE table=%(t)s AND active', parameters={'t':'auto_debits'}).result_rows[0][0]}")
# Query: state-wise counts of auto-debit attempts in the last 60 seconds, fraud only.
t0 = time.time()
res = c.query("""
SELECT state, count() AS attempts, sum(amount_paise)/100 AS rupees
FROM auto_debits
WHERE event_time >= now() - INTERVAL 60 SECOND
AND status = 'FRAUD_BLOCK'
GROUP BY state
ORDER BY attempts DESC
LIMIT 5
""")
for row in res.result_rows: print(row)
print(f"latency = {(time.time()-t0)*1000:.1f} ms")
# Plan and granule statistics.
plan = c.query("EXPLAIN indexes=1 SELECT count() FROM auto_debits "
"WHERE event_time >= now() - INTERVAL 60 SECOND")
for r in plan.result_rows: print(r[0])
batch 0: inserted, parts=1
batch 1: inserted, parts=2
batch 2: inserted, parts=2 # background merge ran
batch 3: inserted, parts=3
batch 4: inserted, parts=2 # another merge
('MH', 1247, 1485340.5)
('KA', 983, 1170327.0)
('TN', 812, 967220.5)
('DL', 611, 728990.0)
('UP', 584, 697160.0)
latency = 27.4 ms
Expression
Filter (WHERE)
ReadFromMergeTree (default.auto_debits)
Indexes:
MinMax
Keys: event_time
Condition: (event_time in [TS, +inf))
Parts: 1/2
Granules: 81/610
Primary
Keys: event_time, user_id
Condition: (event_time in [TS, +inf))
Parts: 1/1
Granules: 8/81
The plan is the headline. Walk it carefully.
Parts: 1/2at the MinMax stage means ClickHouse read the per-partminmax_event_time.idxand rejected one of the two surviving parts entirely — its event_time range is from before "now − 60s". The query never opens that part's.binfiles.Granules: 81/610at the MinMax stage means within the surviving part, the partition-level filter narrowed 610 granules down to 81 — the granules whose min/max overlaps the time window.Granules: 8/81at the Primary stage means the sparse primary index further narrowed to 8 granules. Eight granules × 8192 rows = roughly 65,000 rows. Out of 50 lakh ingested, the query scans 65 thousand. Why this two-step pruning matters: MinMax is per-part metadata (already in RAM), Primary is per-part sparse index (already in RAM). Neither step has touched a.binfile yet, and they have together rejected 99% of the data. Only at step 4 do we actually read columnar bytes — and we read 8 granules' worth, which is roughly 200 KB compressed for the columns the query touches.LowCardinality(String)onstateandstatusis doing real work. ClickHouse stores these as a per-part dictionary (one entry per distinct string) plus a column of small integer codes. For a column with 12 distinct states across 50 lakh rows, the dictionary is 12 entries and the column data is 50 lakh × 1 byte = 5 MB instead of 50 lakh × 8 bytes pointer + string bytes = 80 MB+. The compression matters; the GROUP BY also matters (the engine groups by integer code, then translates back to string at output time only).PARTITION BY toYYYYMM(event_time)is the partition-level pruning axis. A query for "April 2026" only touches the202604parts and never opens202603or earlier. Use partition keys that match your typical query range: month for monthly dashboards, day for hourly dashboards. Don't partition by user_id or any high-cardinality column — you'll get one part per user and the merge daemon will collapse.ORDER BY (event_time, user_id)is the sort key, which is also (by default) the primary key — the columns the sparse index covers. Pick the columns that filter your hot queries hardest. event_time first means range queries on time prune most aggressively; user_id second means within a time window, queries about a specific user can also use the index.
Merges, mark caches, and the production knobs
ClickHouse's merge policy is what holds the operational signature together. The relevant knobs:
| setting | default | what it controls |
|---|---|---|
parts_to_throw_insert |
300 | If active parts per partition exceeds this, INSERT is rejected with Too many parts |
parts_to_delay_insert |
150 | Above this, INSERTs are throttled (sleep proportional to how far over) |
max_bytes_to_merge_at_max_space_in_pool |
161 GB | Largest single merge — caps the merge work unit |
background_pool_size |
16 | Concurrent merge threads |
mark_cache_size |
5 GB | RAM held for .mrk2 files across all parts |
uncompressed_cache_size |
8 GB | RAM held for decompressed column blocks |
The mark cache is a quiet hero. Every query that touches a part needs the part's mark files to translate granule index → byte offset; without the cache, every query pays a small disk read per column-part pair. With the cache, the marks for hot parts are always in RAM and the query goes straight from primary index to .bin byte range. PhonePe's 2024 capacity-planning doc states the rule of thumb: keep mark_cache_size at roughly 0.5% of total table data — so a 1 TB table tier wants 5 GB of mark cache.
What the chart is telling on-call: every INSERT adds parts, every merge subtracts. If the merge daemon falls behind — disk IO saturated, background_pool_size too small, sort key too high-cardinality — the sawtooth's baseline rises. Cred's playbook: page on system.parts.active > 200 per partition. Their P1 incident in October 2024 was a misconfigured merge daemon that let the count climb to 287 over 35 minutes; if they had not paged at 200, the cluster would have hit 300 and started rejecting writes — and the auto-debit-fraud detector would have lost 4 minutes of events.
Common confusions
- "ClickHouse
ORDER BYis the same as PostgresORDER BY." Not at all. In Postgres,ORDER BYin a query sorts the result. In ClickHouse,ORDER BYin aCREATE TABLEdefines the physical sort order on disk and the primary index columns. Get it wrong and queries that should use the index do full scans instead. - "The primary key is unique, like Postgres." It is not. Two rows can have identical
(event_time, user_id)and both will be stored. If you want dedup, useReplacingMergeTree(eventually deduplicates on merge) or a separateINSERT ... DEDUPLICATEstrategy. Do not assume uniqueness. - "More partitions = more parallelism = faster." Wrong direction. Each partition has its own merge state machine and each query has to touch every partition that overlaps its time range. Hundreds of small partitions (e.g., partition by day for a 10-year table) creates merge-daemon contention. The PhonePe rule of thumb:
<= 1000 partitions per table, target~50–200. - "
LowCardinality(String)is just a compression flag." It is a type. The column is stored as a dictionary plus integer codes; this changes how GROUP BY works (groups on integer, translates at output), how filters work (compares integers), and how memory is consumed (dictionary lives in RAM). It is the right choice when distinct count is < 10,000; for unbounded cardinality columns it makes things worse. - "ClickHouse can't do JOINs." It can — but it materialises the right side into memory before joining, so a JOIN against a 10 GB right-side table needs 10 GB of RAM per query. Use
JOIN-ed lookups for small dimension tables; for large fact-to-fact joins, denormalise at write time or use thejoinGet/dictionary mechanisms. - "ClickHouse replaces Kafka." It does not. ClickHouse's
Kafkaengine consumes from Kafka into a MergeTree table; it does not store the source-of-truth log. Kafka is the durable log; ClickHouse is the queryable index over that log. Lose ClickHouse, replay from Kafka. Lose Kafka, you have lost the source of truth.
Going deeper
Skip indices — when the primary key is not enough
The primary index covers the sort-key columns. Queries that filter on other columns get no help from it. ClickHouse's answer is skip indices (also called secondary indices, though they are not indices in the Postgres sense). A skip index is a small per-granule summary of a non-sort-key column: minmax (per-granule min and max), set (the distinct values in the granule, capped at N), bloom_filter (probabilistic membership), tokenbf_v1 (bloom over tokenised text). When a query filters on the indexed column, the skip index is consulted granule-by-granule and granules whose summary excludes the value are skipped. The trick: skip indices are advisory — false positives mean extra granules are read, never wrong answers. Cred uses bloom_filter on device_hash because their fraud queries filter "transactions from this specific device fingerprint" and the hash is high-cardinality but each query targets one value. Razorpay uses tokenbf_v1 on the merchant note field because their fraud team greps for token strings.
ReplacingMergeTree and the dedup-on-merge pattern
The vanilla MergeTree lets duplicates exist. ReplacingMergeTree(version_column) deduplicates on merge — when two parts merge and both contain the same primary-key value, only the row with the highest version_column survives. This is eventual dedup: until a merge fires, both rows are visible. Queries that need point-in-time dedup write SELECT ... FINAL which forces dedup at query time (slow). The pattern is right when (a) duplicates are rare, (b) eventual consistency on dedup is acceptable, (c) you have a monotonic version column (event_time, ingest_time, sequence number). PhonePe uses ReplacingMergeTree for their UPI-tx aggregation tier because the upstream pipeline can replay events on retry — duplicates happen, are rare, and dedup-eventually is fine for the dashboard.
Distributed tables and the cluster topology
A single-node ClickHouse handles up to a few TB; past that you shard. The pattern is: each node owns a shard of the data (rows hashed on a sharding key), and a Distributed table is a metadata-only layer that fans queries out to the shards and merges the results. Replication is per-shard via ReplicatedMergeTree engine — every part written on the leader is replicated to followers via Zookeeper-coordinated metadata exchange. The interesting failure mode: if your sharding key is poorly chosen (e.g., user_id modulo N when 70% of your traffic is from 5 power users), one shard becomes hot and the cluster's effective throughput collapses to that one node. The fix is cityHash64(user_id) or similar — uniformly distribute the hash. Cred's payments tier shards on cityHash64(user_id) % 8 across 8 nodes; their dashboards on cityHash64(merchant_id) % 4 across 4. Different shard keys per use case, same cluster.
When ClickHouse is the wrong tool
ClickHouse loses to other tools when (a) the workload is point-key OLTP (use Postgres or DynamoDB — ClickHouse's granule-level granularity wastes IO on single-row reads), (b) the workload needs heavy fact-to-fact joins (use Snowflake or Trino — ClickHouse's hash-join materialisation is RAM-hungry), (c) the data is small and queries are infrequent (use DuckDB — running a ClickHouse cluster for 50 GB of data is overkill), (d) the workload needs strong consistency on updates (use Postgres — ClickHouse mutations are async, eventually-consistent). Knowing where it loses is part of using it well.
Where this leads next
- /wiki/apache-pinot-and-the-uber-use-case — Pinot's segment model is ClickHouse's MergeTree wearing different vocabulary; the comparison sharpens both.
- /wiki/druid-and-its-segment-model — Druid's segment-handoff design predates ClickHouse and shows where the older bitmap-index approach still wins.
- /wiki/pre-aggregation-materialized-views-and-their-costs — ClickHouse's
MATERIALIZED VIEWis how you turn a 41 ms query into a 4 ms query, and what you trade. - /wiki/serving-p99-latency-under-ingest-pressure — when the merge daemon falls behind, p99 explodes; this chapter dissects the mechanism.
ClickHouse is one specific working-out of the four-axis split from chapter 104 — small parts at the ingest axis, sparse-resident primary index at the index axis, MergeTree at the storage axis, granule-level pruning at the query axis. The next two chapters take Pinot and Druid through the same axes and show where they pick differently.
References
- ClickHouse — MergeTree engine reference — DDL options, settings, index types, all in one canonical page.
- ClickHouse — Sparse primary indices — the official explainer for the granule-and-mark mechanism.
- Aleksei Milovidov — ClickHouse architecture (talk) — the original author walking through the design choices.
- Cloudflare — ClickHouse for HTTP analytics — the production case study that put ClickHouse on the map outside Yandex.
- Uber Engineering — ClickHouse on COSMOS — Uber's logging tier, useful for the merge-policy and storage tradeoff numbers.
- PhonePe Engineering — Real-time analytics for UPI — Indian production reference, particularly on cluster sizing and partition design.
- /wiki/olap-on-a-stream-whats-different — the previous chapter, the four-axis framework that this chapter applies to one engine.
- /wiki/apache-pinot-and-the-uber-use-case — the next chapter, same axes against a different engine.