Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

ClickHouse: columnar for real-time

It is 23:14 on a Wednesday. Karan, an SRE at KreditClub, opens the fraud-rules dashboard and the chart for "auto-debit attempts per minute" updates with data that is 4 seconds old. Three months ago that chart ran on Snowflake and updated every 30 minutes — a window during which a coordinated attack could push through 60 lakh fraudulent debit attempts before anyone saw the spike. The query KreditClub runs is a 14-line SELECT with three filters and one GROUP BY. Snowflake answered it in 2.8 seconds. ClickHouse answers it in 41 milliseconds. The SQL is identical. The difference is everything underneath: how the bytes are laid out, where the index lives, and what the query plan does not have to do.

ClickHouse is one process that owns the columnar file, the sparse primary key, and the query engine. The MergeTree layout writes small immutable parts every few seconds, holds a sparse index of the sort key resident in RAM, and merges parts asynchronously into bigger ones — which is why a query against a 200-billion-row table can decide which 8 MB to read in 8 microseconds. The whole engine is a working out of one bet: keep the sort-key index hot, keep the columnar parts immutable, and let a background daemon worry about file size.

MergeTree — the part is the unit of everything

A ClickHouse table is a directory of parts. A part is a complete, self-contained columnar slice — every column you defined gets its own .bin file inside the part, plus a .mrk2 mark file that maps row numbers to byte offsets, plus a primary.idx file holding the sparse index for that part's sort key. When you INSERT 1 lakh rows, ClickHouse writes one part. When you insert again, it writes another. Parts are immutable: nothing ever modifies a part after it is written. Updates happen by writing a new part that supersedes (via a merge) the old one. Deletes happen by writing a tombstone marker (or, with ALTER TABLE ... DELETE, by rewriting the affected parts).

The part is the unit of everything. Replication moves parts. Backups copy parts. The merge daemon merges pairs of parts into a bigger part. A query is "find the parts that match the WHERE, scan their relevant columns, union the results." If you understand the part, you understand 80% of ClickHouse.

Anatomy of a ClickHouse MergeTree part on diskA diagram showing the file layout of a single ClickHouse MergeTree part. Inside a part directory there are three groups of files: per-column data files (event_time.bin, user_id.bin, amount_paise.bin, state.bin) holding compressed columnar bytes; corresponding mark files (event_time.mrk2, user_id.mrk2 etc) mapping row indices to byte offsets in the bin files; and metadata files (primary.idx for the sparse primary key, columns.txt listing column names and types, count.txt with the row count, and minmax_event_time.idx with the per-part min and max of the sort key column). One part directory: 202604_18_18_0/ column data (.bin) compressed columnar bytes event_time.bin (3.1 MB) user_id.bin (5.4 MB) amount_paise.bin (2.7 MB) state.bin (0.9 MB) merchant_id.bin (1.8 MB) device_hash.bin (4.2 MB) one .bin per column you SELECT marks (.mrk2) row index → byte offset event_time.mrk2 (12 KB) user_id.mrk2 (12 KB) amount_paise.mrk2 (12 KB) state.mrk2 (12 KB) merchant_id.mrk2 (12 KB) device_hash.mrk2 (12 KB) one mark every 8192 rows part metadata resident in RAM primary.idx (3 KB) sparse sort-key index minmax_event_time.idx part-level partition prune columns.txt count.txt (1.6M rows) checksums.txt all loaded at attach time
One part on disk. The accent-bordered files on the right (`primary.idx`, `minmax_event_time.idx`) are what live in RAM and what every query consults first. The `.bin` files are read only after pruning has decided which row ranges to touch.

The directory name 202604_18_18_0 decodes as <partition>_<min_block>_<max_block>_<merge_level>: this part belongs to partition 202604 (April 2026, if we partition by month), came from one INSERT block (block 18 to block 18), and has been merged zero times — it is a level-0 part. After two such parts merge, the result is 202604_18_19_1. After eight level-0 parts merge in a chain, you get 202604_18_25_3. Why the level matters operationally: ClickHouse's merge policy uses the level to decide which parts are eligible to merge with which — only nearby levels merge together, which prevents a single 50 GB part from being merged with every 10 MB part that arrives, an O(n²) blowup. The level is essentially the LSM-tree level applied to columnar storage.

The sparse primary key — why 200 billion rows is fine

The single most counter-intuitive thing about ClickHouse, coming from a Postgres or MySQL background, is that the "primary key" is not unique and not row-addressed. It is a sparse index that holds, for every 8192-th row in sort-key order, the sort key's value. If your sort key is (event_time, user_id) and your part has 8.2 lakh rows, the primary.idx has roughly 100 entries — one per "granule" of 8192 rows.

To find rows where event_time BETWEEN '2026-04-25 11:00' AND '2026-04-25 11:05', ClickHouse:

  1. Loads primary.idx from each part (already in RAM).
  2. Binary-searches the index to find the granules whose value range overlaps the query range.
  3. For each candidate granule, reads the .mrk2 file to translate granule index → byte offset in the .bin file.
  4. Reads exactly those byte ranges.

Step 2 is microseconds. Step 3 is one disk read per column per granule. Step 4 is the only step that scales with data size, and even there, only the matching ranges are read. Why "sparse" is the right design here: a dense per-row index for 200 billion rows would be 1.6 TB of pointer overhead, dwarfing the data. A sparse index over 8192-row granules is 0.012% the size — still 25 MB for 200 billion rows, fits in RAM, gives you binary-search-fast pruning. The cost is granule-level granularity: if a granule has any matching row, you read all 8192 rows of it. For analytical queries with thousands of matching rows that's irrelevant; for OLTP-style point lookups that's catastrophic, which is why ClickHouse is not OLTP.

The granule size of 8192 is index_granularity, configurable per table. Smaller means more precise pruning but bigger index. Larger means coarser pruning but a tiny index. The default is 8192 because empirically, for analytical workloads, that is where the curves cross. PaisaBridge's payments-events table uses 8192. KreditClub's auto-debit-attempts table uses 4096 because their queries are sharper-grained. DigiPaisa's UPI-tx table uses 16384 because their queries scan wider.

A working ClickHouse table — DDL, insert, query, plan

Here is a complete loop: create the table, insert, query, look at the plan. The output is from a single-node ClickHouse 24.x running on a t3.large.

# ClickHouse via clickhouse-connect — runs against a real cluster.
import clickhouse_connect
import time, random, datetime as dt

c = clickhouse_connect.get_client(host="localhost", port=8123,
                                  username="default", database="default")

c.command("DROP TABLE IF EXISTS auto_debits")
c.command("""
CREATE TABLE auto_debits (
    event_time   DateTime64(3),
    user_id      UInt64,
    merchant_id  UInt32,
    amount_paise UInt64,
    state        LowCardinality(String),
    status       LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time, user_id)
SETTINGS index_granularity = 8192
""")

# Insert 50 lakh rows in 5 batches — like 5 micro-batches landing every 6 s.
states = ["MH","KA","TN","DL","UP","WB","GJ","TS","KL","RJ","HR","PB"]
statuses = ["SUCCESS","DECLINED","TIMEOUT","FRAUD_BLOCK"]
for batch in range(5):
    rows = []
    base = dt.datetime(2026, 4, 25, 11, batch * 6 // 60, batch * 6 % 60)
    for i in range(10_00_000):                                # 10 lakh per batch
        rows.append([base + dt.timedelta(milliseconds=i),
                     random.randint(10_00_000, 99_99_999),
                     random.randint(100, 999),
                     random.choice([14900, 49900, 99900, 1_99_900]),
                     random.choice(states),
                     random.choices(statuses, weights=[88, 8, 3, 1])[0]])
    c.insert("auto_debits",
             rows,
             column_names=["event_time","user_id","merchant_id",
                           "amount_paise","state","status"])
    print(f"batch {batch}: inserted, parts={c.query('SELECT count() FROM system.parts WHERE table=%(t)s AND active', parameters={'t':'auto_debits'}).result_rows[0][0]}")

# Query: state-wise counts of auto-debit attempts in the last 60 seconds, fraud only.
t0 = time.time()
res = c.query("""
SELECT state, count() AS attempts, sum(amount_paise)/100 AS rupees
FROM auto_debits
WHERE event_time >= now() - INTERVAL 60 SECOND
  AND status = 'FRAUD_BLOCK'
GROUP BY state
ORDER BY attempts DESC
LIMIT 5
""")
for row in res.result_rows: print(row)
print(f"latency = {(time.time()-t0)*1000:.1f} ms")

# Plan and granule statistics.
plan = c.query("EXPLAIN indexes=1 SELECT count() FROM auto_debits "
               "WHERE event_time >= now() - INTERVAL 60 SECOND")
for r in plan.result_rows: print(r[0])
batch 0: inserted, parts=1
batch 1: inserted, parts=2
batch 2: inserted, parts=2     # background merge ran
batch 3: inserted, parts=3
batch 4: inserted, parts=2     # another merge

('MH', 1247, 1485340.5)
('KA',  983, 1170327.0)
('TN',  812,  967220.5)
('DL',  611,  728990.0)
('UP',  584,  697160.0)
latency = 27.4 ms

Expression
  Filter (WHERE)
    ReadFromMergeTree (default.auto_debits)
      Indexes:
        MinMax
          Keys: event_time
          Condition: (event_time in [TS, +inf))
          Parts: 1/2
          Granules: 81/610
        Primary
          Keys: event_time, user_id
          Condition: (event_time in [TS, +inf))
          Parts: 1/1
          Granules: 8/81

The plan is the headline. Walk it carefully.

  • Parts: 1/2 at the MinMax stage means ClickHouse read the per-part minmax_event_time.idx and rejected one of the two surviving parts entirely — its event_time range is from before "now − 60s". The query never opens that part's .bin files.
  • Granules: 81/610 at the MinMax stage means within the surviving part, the partition-level filter narrowed 610 granules down to 81 — the granules whose min/max overlaps the time window.
  • Granules: 8/81 at the Primary stage means the sparse primary index further narrowed to 8 granules. Eight granules × 8192 rows = roughly 65,000 rows. Out of 50 lakh ingested, the query scans 65 thousand. Why this two-step pruning matters: MinMax is per-part metadata (already in RAM), Primary is per-part sparse index (already in RAM). Neither step has touched a .bin file yet, and they have together rejected 99% of the data. Only at step 4 do we actually read columnar bytes — and we read 8 granules' worth, which is roughly 200 KB compressed for the columns the query touches.
  • LowCardinality(String) on state and status is doing real work. ClickHouse stores these as a per-part dictionary (one entry per distinct string) plus a column of small integer codes. For a column with 12 distinct states across 50 lakh rows, the dictionary is 12 entries and the column data is 50 lakh × 1 byte = 5 MB instead of 50 lakh × 8 bytes pointer + string bytes = 80 MB+. The compression matters; the GROUP BY also matters (the engine groups by integer code, then translates back to string at output time only).
  • PARTITION BY toYYYYMM(event_time) is the partition-level pruning axis. A query for "April 2026" only touches the 202604 parts and never opens 202603 or earlier. Use partition keys that match your typical query range: month for monthly dashboards, day for hourly dashboards. Don't partition by user_id or any high-cardinality column — you'll get one part per user and the merge daemon will collapse.
  • ORDER BY (event_time, user_id) is the sort key, which is also (by default) the primary key — the columns the sparse index covers. Pick the columns that filter your hot queries hardest. event_time first means range queries on time prune most aggressively; user_id second means within a time window, queries about a specific user can also use the index.

Merges, mark caches, and the production knobs

ClickHouse's merge policy is what holds the operational signature together. The relevant knobs:

setting default what it controls
parts_to_throw_insert 300 If active parts per partition exceeds this, INSERT is rejected with Too many parts
parts_to_delay_insert 150 Above this, INSERTs are throttled (sleep proportional to how far over)
max_bytes_to_merge_at_max_space_in_pool 161 GB Largest single merge — caps the merge work unit
background_pool_size 16 Concurrent merge threads
mark_cache_size 5 GB RAM held for .mrk2 files across all parts
uncompressed_cache_size 8 GB RAM held for decompressed column blocks

The mark cache is a quiet hero. Every query that touches a part needs the part's mark files to translate granule index → byte offset; without the cache, every query pays a small disk read per column-part pair. With the cache, the marks for hot parts are always in RAM and the query goes straight from primary index to .bin byte range. DigiPaisa's 2024 capacity-planning doc states the rule of thumb: keep mark_cache_size at roughly 0.5% of total table data — so a 1 TB table tier wants 5 GB of mark cache.

Active part count over a 6-minute window with a healthy merge daemonA timeline chart showing active part count fluctuating between roughly 8 and 24 over a 6-minute window. INSERTs land every 5 seconds (small upticks in the count) and merges fire approximately every 30 seconds (larger drops in the count). A red dashed line at 150 marks the parts_to_delay_insert threshold, well above the working range. A second red dashed line at 300 marks the parts_to_throw_insert threshold, even further above. The line never crosses either threshold, indicating a healthy steady state. Active parts over 6 minutes — INSERTs add, merges drop 0 10 20 30 40 0s 60s 120s 180s 240s 300s delay (150) healthy band: 8–24 active parts throw threshold = 300, off-chart above
The shape you want to see in `system.parts` — a sawtooth between INSERTs (small upticks) and merges (downticks), staying well below the throttle line. When the upticks outpace the downticks, the line drifts up and you have minutes to act before the throw threshold rejects writes.

What the chart is telling on-call: every INSERT adds parts, every merge subtracts. If the merge daemon falls behind — disk IO saturated, background_pool_size too small, sort key too high-cardinality — the sawtooth's baseline rises. KreditClub's playbook: page on system.parts.active > 200 per partition. Their P1 incident in October 2024 was a misconfigured merge daemon that let the count climb to 287 over 35 minutes; if they had not paged at 200, the cluster would have hit 300 and started rejecting writes — and the auto-debit-fraud detector would have lost 4 minutes of events.

Common confusions

  • "ClickHouse ORDER BY is the same as Postgres ORDER BY." Not at all. In Postgres, ORDER BY in a query sorts the result. In ClickHouse, ORDER BY in a CREATE TABLE defines the physical sort order on disk and the primary index columns. Get it wrong and queries that should use the index do full scans instead.
  • "The primary key is unique, like Postgres." It is not. Two rows can have identical (event_time, user_id) and both will be stored. If you want dedup, use ReplacingMergeTree (eventually deduplicates on merge) or a separate INSERT ... DEDUPLICATE strategy. Do not assume uniqueness.
  • "More partitions = more parallelism = faster." Wrong direction. Each partition has its own merge state machine and each query has to touch every partition that overlaps its time range. Hundreds of small partitions (e.g., partition by day for a 10-year table) creates merge-daemon contention. The DigiPaisa rule of thumb: <= 1000 partitions per table, target ~50–200.
  • "LowCardinality(String) is just a compression flag." It is a type. The column is stored as a dictionary plus integer codes; this changes how GROUP BY works (groups on integer, translates at output), how filters work (compares integers), and how memory is consumed (dictionary lives in RAM). It is the right choice when distinct count is < 10,000; for unbounded cardinality columns it makes things worse.
  • "ClickHouse can't do JOINs." It can — but it materialises the right side into memory before joining, so a JOIN against a 10 GB right-side table needs 10 GB of RAM per query. Use JOIN-ed lookups for small dimension tables; for large fact-to-fact joins, denormalise at write time or use the joinGet/dictionary mechanisms.
  • "ClickHouse replaces Kafka." It does not. ClickHouse's Kafka engine consumes from Kafka into a MergeTree table; it does not store the source-of-truth log. Kafka is the durable log; ClickHouse is the queryable index over that log. Lose ClickHouse, replay from Kafka. Lose Kafka, you have lost the source of truth.

Going deeper

Skip indices — when the primary key is not enough

The primary index covers the sort-key columns. Queries that filter on other columns get no help from it. ClickHouse's answer is skip indices (also called secondary indices, though they are not indices in the Postgres sense). A skip index is a small per-granule summary of a non-sort-key column: minmax (per-granule min and max), set (the distinct values in the granule, capped at N), bloom_filter (probabilistic membership), tokenbf_v1 (bloom over tokenised text). When a query filters on the indexed column, the skip index is consulted granule-by-granule and granules whose summary excludes the value are skipped. The trick: skip indices are advisory — false positives mean extra granules are read, never wrong answers. KreditClub uses bloom_filter on device_hash because their fraud queries filter "transactions from this specific device fingerprint" and the hash is high-cardinality but each query targets one value. PaisaBridge uses tokenbf_v1 on the merchant note field because their fraud team greps for token strings.

ReplacingMergeTree and the dedup-on-merge pattern

The vanilla MergeTree lets duplicates exist. ReplacingMergeTree(version_column) deduplicates on merge — when two parts merge and both contain the same primary-key value, only the row with the highest version_column survives. This is eventual dedup: until a merge fires, both rows are visible. Queries that need point-in-time dedup write SELECT ... FINAL which forces dedup at query time (slow). The pattern is right when (a) duplicates are rare, (b) eventual consistency on dedup is acceptable, (c) you have a monotonic version column (event_time, ingest_time, sequence number). DigiPaisa uses ReplacingMergeTree for their UPI-tx aggregation tier because the upstream pipeline can replay events on retry — duplicates happen, are rare, and dedup-eventually is fine for the dashboard.

Distributed tables and the cluster topology

A single-node ClickHouse handles up to a few TB; past that you shard. The pattern is: each node owns a shard of the data (rows hashed on a sharding key), and a Distributed table is a metadata-only layer that fans queries out to the shards and merges the results. Replication is per-shard via ReplicatedMergeTree engine — every part written on the leader is replicated to followers via Zookeeper-coordinated metadata exchange. The interesting failure mode: if your sharding key is poorly chosen (e.g., user_id modulo N when 70% of your traffic is from 5 power users), one shard becomes hot and the cluster's effective throughput collapses to that one node. The fix is cityHash64(user_id) or similar — uniformly distribute the hash. KreditClub's payments tier shards on cityHash64(user_id) % 8 across 8 nodes; their dashboards on cityHash64(merchant_id) % 4 across 4. Different shard keys per use case, same cluster.

When ClickHouse is the wrong tool

ClickHouse loses to other tools when (a) the workload is point-key OLTP (use Postgres or DynamoDB — ClickHouse's granule-level granularity wastes IO on single-row reads), (b) the workload needs heavy fact-to-fact joins (use Snowflake or Trino — ClickHouse's hash-join materialisation is RAM-hungry), (c) the data is small and queries are infrequent (use DuckDB — running a ClickHouse cluster for 50 GB of data is overkill), (d) the workload needs strong consistency on updates (use Postgres — ClickHouse mutations are async, eventually-consistent). Knowing where it loses is part of using it well.

Where this leads next

ClickHouse is one specific working-out of the four-axis split from chapter 104 — small parts at the ingest axis, sparse-resident primary index at the index axis, MergeTree at the storage axis, granule-level pruning at the query axis. The next two chapters take Pinot and Druid through the same axes and show where they pick differently.

References