ClickHouse: columnar for real-time

It is 23:14 on a Wednesday. Karan, an SRE at Cred, opens the fraud-rules dashboard and the chart for "auto-debit attempts per minute" updates with data that is 4 seconds old. Three months ago that chart ran on Snowflake and updated every 30 minutes — a window during which a coordinated attack could push through 60 lakh fraudulent debit attempts before anyone saw the spike. The query Cred runs is a 14-line SELECT with three filters and one GROUP BY. Snowflake answered it in 2.8 seconds. ClickHouse answers it in 41 milliseconds. The SQL is identical. The difference is everything underneath: how the bytes are laid out, where the index lives, and what the query plan does not have to do.

ClickHouse is one process that owns the columnar file, the sparse primary key, and the query engine. The MergeTree layout writes small immutable parts every few seconds, holds a sparse index of the sort key resident in RAM, and merges parts asynchronously into bigger ones — which is why a query against a 200-billion-row table can decide which 8 MB to read in 8 microseconds. The whole engine is a working out of one bet: keep the sort-key index hot, keep the columnar parts immutable, and let a background daemon worry about file size.

MergeTree — the part is the unit of everything

A ClickHouse table is a directory of parts. A part is a complete, self-contained columnar slice — every column you defined gets its own .bin file inside the part, plus a .mrk2 mark file that maps row numbers to byte offsets, plus a primary.idx file holding the sparse index for that part's sort key. When you INSERT 1 lakh rows, ClickHouse writes one part. When you insert again, it writes another. Parts are immutable: nothing ever modifies a part after it is written. Updates happen by writing a new part that supersedes (via a merge) the old one. Deletes happen by writing a tombstone marker (or, with ALTER TABLE ... DELETE, by rewriting the affected parts).

The part is the unit of everything. Replication moves parts. Backups copy parts. The merge daemon merges pairs of parts into a bigger part. A query is "find the parts that match the WHERE, scan their relevant columns, union the results." If you understand the part, you understand 80% of ClickHouse.

Anatomy of a ClickHouse MergeTree part on diskA diagram showing the file layout of a single ClickHouse MergeTree part. Inside a part directory there are three groups of files: per-column data files (event_time.bin, user_id.bin, amount_paise.bin, state.bin) holding compressed columnar bytes; corresponding mark files (event_time.mrk2, user_id.mrk2 etc) mapping row indices to byte offsets in the bin files; and metadata files (primary.idx for the sparse primary key, columns.txt listing column names and types, count.txt with the row count, and minmax_event_time.idx with the per-part min and max of the sort key column). One part directory: 202604_18_18_0/ column data (.bin) compressed columnar bytes event_time.bin (3.1 MB) user_id.bin (5.4 MB) amount_paise.bin (2.7 MB) state.bin (0.9 MB) merchant_id.bin (1.8 MB) device_hash.bin (4.2 MB) one .bin per column you SELECT marks (.mrk2) row index → byte offset event_time.mrk2 (12 KB) user_id.mrk2 (12 KB) amount_paise.mrk2 (12 KB) state.mrk2 (12 KB) merchant_id.mrk2 (12 KB) device_hash.mrk2 (12 KB) one mark every 8192 rows part metadata resident in RAM primary.idx (3 KB) sparse sort-key index minmax_event_time.idx part-level partition prune columns.txt count.txt (1.6M rows) checksums.txt all loaded at attach time
One part on disk. The accent-bordered files on the right (`primary.idx`, `minmax_event_time.idx`) are what live in RAM and what every query consults first. The `.bin` files are read only after pruning has decided which row ranges to touch.

The directory name 202604_18_18_0 decodes as <partition>_<min_block>_<max_block>_<merge_level>: this part belongs to partition 202604 (April 2026, if we partition by month), came from one INSERT block (block 18 to block 18), and has been merged zero times — it is a level-0 part. After two such parts merge, the result is 202604_18_19_1. After eight level-0 parts merge in a chain, you get 202604_18_25_3. Why the level matters operationally: ClickHouse's merge policy uses the level to decide which parts are eligible to merge with which — only nearby levels merge together, which prevents a single 50 GB part from being merged with every 10 MB part that arrives, an O(n²) blowup. The level is essentially the LSM-tree level applied to columnar storage.

The sparse primary key — why 200 billion rows is fine

The single most counter-intuitive thing about ClickHouse, coming from a Postgres or MySQL background, is that the "primary key" is not unique and not row-addressed. It is a sparse index that holds, for every 8192-th row in sort-key order, the sort key's value. If your sort key is (event_time, user_id) and your part has 8.2 lakh rows, the primary.idx has roughly 100 entries — one per "granule" of 8192 rows.

To find rows where event_time BETWEEN '2026-04-25 11:00' AND '2026-04-25 11:05', ClickHouse:

  1. Loads primary.idx from each part (already in RAM).
  2. Binary-searches the index to find the granules whose value range overlaps the query range.
  3. For each candidate granule, reads the .mrk2 file to translate granule index → byte offset in the .bin file.
  4. Reads exactly those byte ranges.

Step 2 is microseconds. Step 3 is one disk read per column per granule. Step 4 is the only step that scales with data size, and even there, only the matching ranges are read. Why "sparse" is the right design here: a dense per-row index for 200 billion rows would be 1.6 TB of pointer overhead, dwarfing the data. A sparse index over 8192-row granules is 0.012% the size — still 25 MB for 200 billion rows, fits in RAM, gives you binary-search-fast pruning. The cost is granule-level granularity: if a granule has any matching row, you read all 8192 rows of it. For analytical queries with thousands of matching rows that's irrelevant; for OLTP-style point lookups that's catastrophic, which is why ClickHouse is not OLTP.

The granule size of 8192 is index_granularity, configurable per table. Smaller means more precise pruning but bigger index. Larger means coarser pruning but a tiny index. The default is 8192 because empirically, for analytical workloads, that is where the curves cross. Razorpay's payments-events table uses 8192. Cred's auto-debit-attempts table uses 4096 because their queries are sharper-grained. PhonePe's UPI-tx table uses 16384 because their queries scan wider.

A working ClickHouse table — DDL, insert, query, plan

Here is a complete loop: create the table, insert, query, look at the plan. The output is from a single-node ClickHouse 24.x running on a t3.large.

# ClickHouse via clickhouse-connect — runs against a real cluster.
import clickhouse_connect
import time, random, datetime as dt

c = clickhouse_connect.get_client(host="localhost", port=8123,
                                  username="default", database="default")

c.command("DROP TABLE IF EXISTS auto_debits")
c.command("""
CREATE TABLE auto_debits (
    event_time   DateTime64(3),
    user_id      UInt64,
    merchant_id  UInt32,
    amount_paise UInt64,
    state        LowCardinality(String),
    status       LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
SETTINGS index_granularity = 8192
""")

# Insert 50 lakh rows in 5 batches — like 5 micro-batches landing every 6 s.
states = ["MH","KA","TN","DL","UP","WB","GJ","TS","KL","RJ","HR","PB"]
statuses = ["SUCCESS","DECLINED","TIMEOUT","FRAUD_BLOCK"]
for batch in range(5):
    rows = []
    base = dt.datetime(2026, 4, 25, 11, batch * 6 // 60, batch * 6 % 60)
    for i in range(10_00_000):                                # 10 lakh per batch
        rows.append([base + dt.timedelta(milliseconds=i),
                     random.randint(10_00_000, 99_99_999),
                     random.randint(100, 999),
                     random.choice([14900, 49900, 99900, 1_99_900]),
                     random.choice(states),
                     random.choices(statuses, weights=[88, 8, 3, 1])[0]])
    c.insert("auto_debits",
             rows,
             column_names=["event_time","user_id","merchant_id",
                           "amount_paise","state","status"])
    print(f"batch {batch}: inserted, parts={c.query('SELECT count() FROM system.parts WHERE table=%(t)s AND active', parameters={'t':'auto_debits'}).result_rows[0][0]}")

# Query: state-wise counts of auto-debit attempts in the last 60 seconds, fraud only.
t0 = time.time()
res = c.query("""
SELECT state, count() AS attempts, sum(amount_paise)/100 AS rupees
FROM auto_debits
WHERE event_time >= now() - INTERVAL 60 SECOND
  AND status = 'FRAUD_BLOCK'
GROUP BY state
ORDER BY attempts DESC
LIMIT 5
""")
for row in res.result_rows: print(row)
print(f"latency = {(time.time()-t0)*1000:.1f} ms")

# Plan and granule statistics.
plan = c.query("EXPLAIN indexes=1 SELECT count() FROM auto_debits "
               "WHERE event_time >= now() - INTERVAL 60 SECOND")
for r in plan.result_rows: print(r[0])
batch 0: inserted, parts=1
batch 1: inserted, parts=2
batch 2: inserted, parts=2     # background merge ran
batch 3: inserted, parts=3
batch 4: inserted, parts=2     # another merge

('MH', 1247, 1485340.5)
('KA',  983, 1170327.0)
('TN',  812,  967220.5)
('DL',  611,  728990.0)
('UP',  584,  697160.0)
latency = 27.4 ms

Expression
  Filter (WHERE)
    ReadFromMergeTree (default.auto_debits)
      Indexes:
        MinMax
          Keys: event_time
          Condition: (event_time in [TS, +inf))
          Parts: 1/2
          Granules: 81/610
        Primary
          Keys: event_time, user_id
          Condition: (event_time in [TS, +inf))
          Parts: 1/1
          Granules: 8/81

The plan is the headline. Walk it carefully.

Merges, mark caches, and the production knobs

ClickHouse's merge policy is what holds the operational signature together. The relevant knobs:

setting default what it controls
parts_to_throw_insert 300 If active parts per partition exceeds this, INSERT is rejected with Too many parts
parts_to_delay_insert 150 Above this, INSERTs are throttled (sleep proportional to how far over)
max_bytes_to_merge_at_max_space_in_pool 161 GB Largest single merge — caps the merge work unit
background_pool_size 16 Concurrent merge threads
mark_cache_size 5 GB RAM held for .mrk2 files across all parts
uncompressed_cache_size 8 GB RAM held for decompressed column blocks

The mark cache is a quiet hero. Every query that touches a part needs the part's mark files to translate granule index → byte offset; without the cache, every query pays a small disk read per column-part pair. With the cache, the marks for hot parts are always in RAM and the query goes straight from primary index to .bin byte range. PhonePe's 2024 capacity-planning doc states the rule of thumb: keep mark_cache_size at roughly 0.5% of total table data — so a 1 TB table tier wants 5 GB of mark cache.

Active part count over a 6-minute window with a healthy merge daemonA timeline chart showing active part count fluctuating between roughly 8 and 24 over a 6-minute window. INSERTs land every 5 seconds (small upticks in the count) and merges fire approximately every 30 seconds (larger drops in the count). A red dashed line at 150 marks the parts_to_delay_insert threshold, well above the working range. A second red dashed line at 300 marks the parts_to_throw_insert threshold, even further above. The line never crosses either threshold, indicating a healthy steady state. Active parts over 6 minutes — INSERTs add, merges drop 0 10 20 30 40 0s 60s 120s 180s 240s 300s delay (150) healthy band: 8–24 active parts throw threshold = 300, off-chart above
The shape you want to see in `system.parts` — a sawtooth between INSERTs (small upticks) and merges (downticks), staying well below the throttle line. When the upticks outpace the downticks, the line drifts up and you have minutes to act before the throw threshold rejects writes.

What the chart is telling on-call: every INSERT adds parts, every merge subtracts. If the merge daemon falls behind — disk IO saturated, background_pool_size too small, sort key too high-cardinality — the sawtooth's baseline rises. Cred's playbook: page on system.parts.active > 200 per partition. Their P1 incident in October 2024 was a misconfigured merge daemon that let the count climb to 287 over 35 minutes; if they had not paged at 200, the cluster would have hit 300 and started rejecting writes — and the auto-debit-fraud detector would have lost 4 minutes of events.

Common confusions

Going deeper

Skip indices — when the primary key is not enough

The primary index covers the sort-key columns. Queries that filter on other columns get no help from it. ClickHouse's answer is skip indices (also called secondary indices, though they are not indices in the Postgres sense). A skip index is a small per-granule summary of a non-sort-key column: minmax (per-granule min and max), set (the distinct values in the granule, capped at N), bloom_filter (probabilistic membership), tokenbf_v1 (bloom over tokenised text). When a query filters on the indexed column, the skip index is consulted granule-by-granule and granules whose summary excludes the value are skipped. The trick: skip indices are advisory — false positives mean extra granules are read, never wrong answers. Cred uses bloom_filter on device_hash because their fraud queries filter "transactions from this specific device fingerprint" and the hash is high-cardinality but each query targets one value. Razorpay uses tokenbf_v1 on the merchant note field because their fraud team greps for token strings.

ReplacingMergeTree and the dedup-on-merge pattern

The vanilla MergeTree lets duplicates exist. ReplacingMergeTree(version_column) deduplicates on merge — when two parts merge and both contain the same primary-key value, only the row with the highest version_column survives. This is eventual dedup: until a merge fires, both rows are visible. Queries that need point-in-time dedup write SELECT ... FINAL which forces dedup at query time (slow). The pattern is right when (a) duplicates are rare, (b) eventual consistency on dedup is acceptable, (c) you have a monotonic version column (event_time, ingest_time, sequence number). PhonePe uses ReplacingMergeTree for their UPI-tx aggregation tier because the upstream pipeline can replay events on retry — duplicates happen, are rare, and dedup-eventually is fine for the dashboard.

Distributed tables and the cluster topology

A single-node ClickHouse handles up to a few TB; past that you shard. The pattern is: each node owns a shard of the data (rows hashed on a sharding key), and a Distributed table is a metadata-only layer that fans queries out to the shards and merges the results. Replication is per-shard via ReplicatedMergeTree engine — every part written on the leader is replicated to followers via Zookeeper-coordinated metadata exchange. The interesting failure mode: if your sharding key is poorly chosen (e.g., user_id modulo N when 70% of your traffic is from 5 power users), one shard becomes hot and the cluster's effective throughput collapses to that one node. The fix is cityHash64(user_id) or similar — uniformly distribute the hash. Cred's payments tier shards on cityHash64(user_id) % 8 across 8 nodes; their dashboards on cityHash64(merchant_id) % 4 across 4. Different shard keys per use case, same cluster.

When ClickHouse is the wrong tool

ClickHouse loses to other tools when (a) the workload is point-key OLTP (use Postgres or DynamoDB — ClickHouse's granule-level granularity wastes IO on single-row reads), (b) the workload needs heavy fact-to-fact joins (use Snowflake or Trino — ClickHouse's hash-join materialisation is RAM-hungry), (c) the data is small and queries are infrequent (use DuckDB — running a ClickHouse cluster for 50 GB of data is overkill), (d) the workload needs strong consistency on updates (use Postgres — ClickHouse mutations are async, eventually-consistent). Knowing where it loses is part of using it well.

Where this leads next

ClickHouse is one specific working-out of the four-axis split from chapter 104 — small parts at the ingest axis, sparse-resident primary index at the index axis, MergeTree at the storage axis, granule-level pruning at the query axis. The next two chapters take Pinot and Druid through the same axes and show where they pick differently.

References