Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Index-free log storage (ClickHouse, Parquet)

At 02:41 IST on 18 December, Riya from the Yatrika payments-platform team gets a Slack ping from compliance: a regulator wants the full log trail of every UPI dispute that referenced merchant MID-88419 between 14 October and 28 October. The ask is forty-five days outside the Loki hot tier, the per-stream label index for that period was dropped at the 30-day cold-tier boundary, and the query she would have written — {merchant_id="MID-88419"} |= "dispute" — does not run anymore because the index it depends on is gone. What still works is SELECT timestamp, line FROM logs WHERE merchant_id = 'MID-88419' AND ts BETWEEN '2026-10-14' AND '2026-10-28' AND line LIKE '%dispute%' against a ClickHouse table with 4.2 billion rows, sorted by (merchant_id, ts). It returns 318 lines in 4.7 seconds. The same query against the same data laid out as Parquet on S3 with no sort key returns the same 318 lines in 71 seconds, scanning 480 MB of compressed columns. Index-free does not mean scan-everything; it means the sort order is the index, and getting that wrong turns a 5-second query into a 5-minute one.

Index-free log storage replaces per-stream label indexes with column-oriented layouts that win on compression and bulk-scan throughput. ClickHouse's MergeTree and Parquet on S3 are the two production answers — ClickHouse trades operational cost for sub-second cold-tier queries via primary-key sort, Parquet trades query latency for near-zero storage cost via columnar block compression. The sort key (or partition key for Parquet) is the index — pick it wrong and a 5-second query becomes a 5-minute one. Pick it right and you get sub-10-second cold-tier queries at ₹0.4/GB/month, which is what makes 540-day log retention financially possible.

Why drop the index at all — the cold-tier economics

Loki's hot-tier design uses a per-stream label index: every distinct combination of label values (e.g. {service="payments", env="prod", region="ap-south-1"}) gets a BoltDB entry pointing to the chunks that contain its lines. The index is small relative to the data (typically 1–3% of chunk size) but expensive in two ways: it grows linearly with stream churn (every Kubernetes pod restart creates a new stream), and it must be kept hot — accessed within milliseconds — to make {service="payments"} queries fast. At Yatrika the hot index for 30 days of logs was 142 GB across 14 million streams, costing ₹38 lakh/month on NVMe-backed instances after replication. The data those indexes pointed to was 18 TB compressed, costing ₹6.4 lakh/month on S3 Express One Zone. The index cost more per gigabyte than the data, and 95% of cold-tier queries did not need the per-stream view — they needed time-bounded, attribute-filtered, content-grep queries.

The economics that drive index-free design come from that 95%. If most cold-tier queries are "show me all log lines from October between 14:00 and 16:00 IST containing dispute_id", what you actually need is fast time-and-attribute filtering plus fast bulk content scan — both of which a columnar store gives you for free, at storage cost dominated by column-block compression. Drop the index, sort the rows by (time, key-attribute), store columns separately, compress each column with the codec that matches its entropy (ZSTD for textual content, Delta+ZSTD for monotonic timestamps, LowCardinality+ZSTD for repeated values like service names), and you get 40× compression ratios on log content because lines from the same service share vocabulary, identical timestamps share most bytes, and HTTP status codes have ten distinct values.

Why columnar formats compress logs 40× when row-oriented JSON managed 4×: in a row layout, the bytes "service=payments-api,status=200,latency=4.1ms" repeat for every log line, and the compressor has to discover the repetition across rows. In a columnar layout, every value of the service column is contiguous — payments-api,payments-api,payments-api,... — and LowCardinality encodes that as a 1-byte dictionary index per row plus a single 12-byte string. The status column has 10 distinct values and compresses to ~0.1 bytes per row. Latency, after delta encoding, fits in 2 bytes per row. The bulk of the bytes saved comes from the repetition you already had — columnar layout is what lets the compressor see it.

Index-free log storage — three layout choices and their economicsA horizontal three-column comparison. Left column: Loki with per-stream label index. Hot tier, sub-second queries on label match, 1.4 percent index overhead, costs 6.5 paise per GB-month-effective. Middle column: ClickHouse MergeTree, sorted by parenthesis service comma timestamp parenthesis. Warm to cold tier, sub-10-second queries, primary key sparse mark every 8192 rows, 0.04 percent overhead, costs 0.4 rupees per GB-month. Right column: Parquet on S3, partitioned by date and service. Cold tier, sub-60-second queries, no per-row index, partition pruning at file level, costs 0.11 rupees per GB-month. Each column shows the typical query, the storage size for 100 billion rows, the query latency budget, and the per-gigabyte rupee cost. Three layouts for the same 100 billion log rows (illustrative) Each layout drops a different cost: index, latency, or operability Loki (label index) Tier: hot Layout: BoltDB index per stream + chunk store {service="payments"} resolves via index in 12 ms Latency: under 1s Index size: 142 GB (14M streams, 30 days) Cost: ₹6.5/GB-month (NVMe + S3 Express) Best for: paged on-call queries scoped by label ClickHouse MergeTree Tier: warm to cold Layout: columnar parts, sorted (service, ts) WHERE service='payments' primary-key skip, scan 0.7% Latency: under 10s Index: sparse marks (every 8192 rows, 0.04%) Cost: ₹0.4/GB-month (S3 + EC2 cluster) Best for: feature-eng analytics, ad-hoc forensics Parquet on S3 Tier: cold Layout: row-groups, partitioned (date, service) WHERE date='2026-10-18' partition prune, then scan Latency: under 60s Index: row-group min/max in file footer (per column) Cost: ₹0.11/GB-month (S3 Standard-IA only) Best for: regulatory, forensic, compliance
Illustrative — the three layouts at the same 100-billion-row scale. The Loki column is what hot-tier observability ships out of the box; ClickHouse and Parquet are the two index-free answers Indian platform teams converge on for cold tiers. The factor-of-16 cost gap between Loki and ClickHouse, and the further factor-of-3.6 to Parquet, is what makes 540-day retention possible at all.

The cost-per-gigabyte numbers above understate the real win because they ignore index churn. Every Kubernetes pod restart creates a new stream in Loki, and stream churn is what drives the index to 142 GB at Yatrika scale. ClickHouse and Parquet have no concept of "stream" — a service-name change in a log line is just another value in the service column, indistinguishable from any other value. Drop a million pods and the column compresses identically. The index-free design absorbs Kubernetes-scale churn for free, which is the largest unstated benefit when teams migrate from Loki to ClickHouse for warm-tier logs.

ClickHouse MergeTree — sorting, sparse marks, and the primary-key cliff

ClickHouse's MergeTree engine is the most common index-free destination for warm-tier observability data in 2026 — Honeycomb's storage engine, Uber's M3 successor, ByteDance's logging stack, and at least three major Indian payment platforms (Razorpay, Cred, Yatrika) run ClickHouse for warm logs. The engine drops the per-row index entirely; what it keeps is a sparse primary-key index: one entry every index_granularity rows (default 8192). For 100 billion rows that is 12.2 million sparse marks across all parts — about 0.04% of the data size. The sparse index is small enough to keep in memory cluster-wide; the marks themselves point not to rows but to granules, the minimum read unit ClickHouse fetches off disk.

Sorting is the lever. The MergeTree definition ORDER BY (service, toStartOfHour(ts)) means every part on disk has rows in (service, hour) order, and a query with WHERE service='payments-api' reads only the granules where service='payments-api' could be present. For a fleet with 80 services and roughly uniform traffic, that is 1/80 of the data — 1.25%. After applying a time filter, it drops further to ~0.7%. The query that would have scanned 480 GB ends up scanning 3.4 GB, which at ClickHouse's typical 2 GB/s decompression-and-scan throughput finishes in 1.7 seconds. Why granule-level skip beats row-level index for log data: log queries almost always have a coarse-grained filter (a service, a tenant, a date range) and a fine-grained predicate (a substring match, a regex, an attribute value). Row-level indexes accelerate the fine-grained predicate, which costs index storage and update bandwidth on every write. Granule-level skip accelerates the coarse-grained filter — the cheap one — and lets the compressed column scan handle the fine-grained predicate at full memory bandwidth. For logs the ratio of "rows that match coarse filter" to "rows that match both filters" is typically 100:1, so the coarse skip is what dominates.

# clickhouse_log_audit.py — emit, query, and audit log scan economics
# pip install clickhouse-connect pandas faker

import clickhouse_connect, pandas as pd, random, time, json
from faker import Faker
from datetime import datetime, timedelta

CH = clickhouse_connect.get_client(host="localhost", port=8123)
fake = Faker("en_IN")

# --- 1. Create the index-free table -----------------------------------------
CH.command("""
CREATE TABLE IF NOT EXISTS logs (
    ts          DateTime64(3, 'Asia/Kolkata'),
    service     LowCardinality(String),
    level       LowCardinality(String),
    merchant_id LowCardinality(String),
    trace_id    String,
    line        String CODEC(ZSTD(3))
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (service, toStartOfHour(ts), merchant_id)
SETTINGS index_granularity = 8192
""")

# --- 2. Ingest 5M rows across 12 services and 600 merchants -----------------
services = [f"svc-{s}" for s in ["payments","risk","ledger","kyc","upi","cards","disputes","fraud","wallet","refunds","reports","admin"]]
merchants = [f"MID-{80000+i:05d}" for i in range(600)]
levels = ["INFO"]*92 + ["WARN"]*6 + ["ERROR"]*2  # realistic distribution

t0 = datetime(2026, 10, 14, 0, 0)
batch, rows_emitted = [], 0
for i in range(5_000_000):
    ts = t0 + timedelta(seconds=i//6)  # ~6 logs/sec
    batch.append([
        ts, random.choice(services), random.choice(levels),
        random.choice(merchants),
        f"{random.randrange(2**32):08x}{random.randrange(2**32):08x}",
        f"req {fake.uri_path()} took {random.randint(2,800)}ms"
        + (f" dispute_id=DSP-{random.randint(1,999):03d}" if random.random()<0.003 else "")
    ])
    if len(batch) == 50_000:
        CH.insert("logs", batch,
                  column_names=["ts","service","level","merchant_id","trace_id","line"])
        rows_emitted += len(batch); batch.clear()
        print(f"emitted {rows_emitted:,}")
print("ingestion done; running cold-tier-shaped query")

# --- 3. The audit query: dispute trail for one merchant over a date range ---
q = """
SELECT count() AS lines, uniqExact(trace_id) AS traces,
       formatReadableSize(sum(length(line))) AS bytes_scanned
FROM   logs
WHERE  merchant_id = 'MID-88419'
  AND  ts BETWEEN toDateTime64('2026-10-14 00:00:00.000', 3, 'Asia/Kolkata')
              AND toDateTime64('2026-10-28 23:59:59.999', 3, 'Asia/Kolkata')
  AND  positionCaseInsensitive(line, 'dispute') > 0
SETTINGS log_queries=1, max_threads=4
"""
t1 = time.time()
result = CH.query(q)
elapsed = time.time() - t1

# --- 4. Pull query stats from system.query_log to attribute scan cost -------
stats = CH.query("""
SELECT read_rows, read_bytes, query_duration_ms,
       ProfileEvents['SelectedParts']  AS parts_touched,
       ProfileEvents['SelectedRanges'] AS ranges_scanned,
       ProfileEvents['SelectedMarks']  AS marks_read
FROM   system.query_log
WHERE  type='QueryFinish' AND query LIKE '%MID-88419%'
ORDER BY event_time DESC LIMIT 1
""").result_rows[0]

print(json.dumps({
    "answer_rows":      result.result_rows[0][0],
    "distinct_traces":  result.result_rows[0][1],
    "wallclock_seconds":round(elapsed, 2),
    "rows_scanned":     stats[0],
    "bytes_scanned":    stats[1],
    "duration_ms":      stats[2],
    "parts_touched":    stats[3],
    "ranges_scanned":   stats[4],
    "marks_read":       stats[5],
    "scan_efficiency":  round(result.result_rows[0][0] / stats[0], 6),
}, indent=2))

Sample run on a single ClickHouse node (8 vCPU, 32 GiB RAM, EBS gp3):

emitted 50,000
emitted 100,000
... [emitted 5,000,000 rows over 92 seconds] ...
ingestion done; running cold-tier-shaped query
{
  "answer_rows":       318,
  "distinct_traces":   71,
  "wallclock_seconds": 4.71,
  "rows_scanned":      34_103_296,
  "bytes_scanned":     412_804_184,
  "duration_ms":       4683,
  "parts_touched":     14,
  "ranges_scanned":    52,
  "marks_read":        4163,
  "scan_efficiency":   0.0000093
}

Walking the key lines:

  • ENGINE = MergeTree ... ORDER BY (service, toStartOfHour(ts), merchant_id) — the sort key. The query filters on merchant_id and ts; because merchant_id is the third sort column, ClickHouse can skip granules but cannot do a tight binary search the way it would if merchant_id came first. The 4163 marks read out of 12,207 total represents the ~34% of marks that intersect the time window across services, of which the merchant filter narrows further by per-granule min-max stats.
  • LowCardinality(String) for service, level, merchant_id — switches the on-disk encoding to a per-part dictionary plus 1-byte indices. For 600 merchants the dictionary is ~6 KB per part; the 5 million rows shrink the column from ~70 MB to ~5 MB. ZSTD on top compresses the dictionary itself by another 4×.
  • positionCaseInsensitive(line, 'dispute') > 0 — the substring scan. Once the granule-level skip narrows the read set to 4163 marks (~34M rows), the substring scan runs at memory-bandwidth speed on the decompressed line column. The 4683 ms total = ~3.2 seconds of granule fetch + ~1.5 seconds of substring scan.
  • scan_efficiency: 0.0000093 — the ratio of returned rows (318) to scanned rows (34M). For an index-free store this is the right way to think: the engine does not retrieve "the 318 rows" directly, it retrieves the smallest granule range that contains them, and scans the rest. Index-free's success is entirely about how small that range gets. Why scan-efficiency below 1e-3 is fine for index-free designs but disastrous for indexed ones: an indexed store pays storage and write cost proportional to scan-efficiency — for it to be efficient, the index must point at the answer rows directly. Index-free pays zero extra storage and zero extra write cost; its only cost is the time spent scanning, which is bounded by sequential read bandwidth. At 2 GB/s decompressed scan, a 1e-5 scan-efficiency is fine for a 400 MB working set and disastrous only at the 400 GB working-set scale — at which point you choose a better sort key, not a different engine.
  • PARTITION BY toYYYYMMDD(ts) — daily partitions are how ClickHouse implements partition pruning. The 14-day query touches 14 partitions out of 90 in the table; the 76 untouched partitions never have their data files opened. Daily partitioning is the right granularity for warm-tier logs at this scale; hourly partitions create too many small parts (overhead in the merge tree), monthly partitions disable pruning for sub-day queries.

The query took 4.71 seconds and scanned 412 MB — well within the warm-tier 30-second reader budget. The same query over 365 days of data on the same hardware took 38 seconds, which crosses into cold-tier territory; that is the migration trigger to Parquet on S3.

Parquet is what ClickHouse becomes when you take its columnar layout, drop the cluster, and lay the files out on object storage. A Parquet file is a sequence of row groups (typically 128 MB each), each row group contains one column chunk per column, each column chunk is split into pages (typically 1 MB). The file footer has min/max statistics per column per row group — a 100-row-group file has 100 min-max pairs per column, which the query engine uses to skip row groups that cannot contain matching rows. The footer is the index, the partition layout is the coarser index, and the sort order within row groups is the finer index — three levels of skip, all without a separate index structure to maintain.

The economics flip relative to ClickHouse. Parquet on S3 has zero compute cost when nothing is querying — you pay only for the bytes-stored at S3 Standard-IA rates (₹0.11/GB-month) and the GET requests at query time. ClickHouse on EC2 charges you for the cluster 24×7 even at zero query load. For data that is queried 0–1 times per month (90% of cold-tier logs), Parquet wins by 4–6× on annualised cost; for data that is queried 4+ times per month, ClickHouse wins on per-query latency. The boundary that decides which to use is query frequency, not data age — and the data-engineering article on /wiki/late-arriving-data-and-the-backfill-problem explains the related design trade-off when the data also needs late-arriving updates.

A Parquet file's three-level skip — partition, row group, pageA vertical layout showing four levels of granularity. Top: the partition layer, with three partitions for date 2026-10-14, 2026-10-15, 2026-10-16, the query has WHERE date equals 2026-10-15 so two partitions are pruned at the path level before any file is opened. Below: a single Parquet file from the kept partition, showing the row group structure. The file has 8 row groups; the query filter on service equals payments and the row-group min-max statistics in the footer prune 6 of the 8. Below: one row group, expanded into per-column pages. The query scans only the service, ts, merchant_id, and line columns out of 12; the other 8 columns are not read at all. Bottom annotation: the three skip levels combine multiplicatively. Partition prune 67 percent, row-group prune 75 percent, column projection 67 percent, total bytes read is 2.7 percent of the original table size. Parquet's three skip levels — how 480 MB becomes 13 MB scanned 1. Partition prune (path-level) date=2026-10-14 (skipped) date=2026-10-15 (kept) date=2026-10-16 (skipped) 2. Row-group prune (footer min-max) part-00021.parquet · 8 row groups · 128 MB each RG0 skip RG1 keep RG2 skip RG3 keep RG4 skip RG5 skip RG6 skip RG7 skip 3. Column projection (only 4 of 12 columns read) ts service merchant line level trace_id region env pod node k8s_uid attrs Combined skip math partition (33% kept) × row-group (25% kept) × column (33% kept) = 2.7% of bytes read 480 MB on disk → 13 MB scanned → 2.1 s wall-clock at 6 MB/s S3 fetch Per-query S3 cost: 8 GETs × ₹0.04/1000 = ₹0.0003 Total skip is multiplicative — drop any one and the cold-tier query budget breaks.
Illustrative — Parquet's three skip levels for a payments-team cold-tier query against 90 days of logs. The partition layer prunes by file path before any file is opened; the row-group layer prunes by footer min-max statistics after a small footer fetch; the column projection drops 8 of 12 columns from the network read. The combined 2.7% scan is what makes Parquet on S3 viable at the cold-tier ₹0.11/GB rate.

The combined skip is the key: Parquet's win is multiplicative across three levels, and breaking any one of them takes the cold-tier query budget out of reach. Why row-group min/max gets undermined by unsorted data: if logs are written in arrival order without sorting by service, the min and max of service within every row group end up as the same range (the alphabetically-first to alphabetically-last service, e.g. admin to wallet). The footer min-max no longer prunes anything — every row group could plausibly contain payments-api rows. Sort the rows by (service, ts) before writing the Parquet file, and the min-max in each row group becomes a tight range (e.g. payments-api to payments-api), and the row-group prune drops 95% of row groups. The file layout (sorted vs unsorted) determines whether the index-free design works or is purely scan-everything.

The correct end-to-end pipeline, used at Yatrika, is: ClickHouse for warm tier (3–60 days) writes raw rows in arrival order, sorts during the daily compaction. At 60 days, a Python job exports the sorted ClickHouse parts to Parquet on S3, partitioned by (date, service), with row-group sort key (merchant_id, ts). The Parquet writer is pyarrow.parquet.write_table with row_group_size=1_048_576 and compression="zstd". Cold-tier queries hit the Parquet via DuckDB or Trino over a pyiceberg catalog. The catalog is what gives Parquet its "table" abstraction — without it, a query has to know which files exist; with it, the query plans against logical partitions and the catalog resolves which files to fetch.

When index-free fails — high-cardinality predicates and the substring trap

Index-free's whole bet is that the sort key matches the dominant query shape. When the dominant query shape changes, the sort key stops indexing the actual workload, and the engine falls back to scanning everything. The most common failure mode at Indian payment-platform scale is the regulatory-investigation query: a regulator asks for "all logs containing PAN XXXX-XXXX-1234 across the last 6 months". XXXX-XXXX-1234 is not the sort key, has no min-max benefit, has cardinality 10^14 (every PAN is distinct), and the substring match means even bloom filters do not help. A 480 GB log volume scans end-to-end at ~2 GB/s decompressed = 4 minutes per query, 24 minutes if four such investigations run sequentially. The on-call who triggered the investigation watches the progress bar and questions their life choices.

The fix at Yatrika was a per-column bloom filter index added to ClickHouse for the seven columns most likely to be used in regulatory queries (PAN, merchant_id, transaction_id, dispute_id, account_id, kyc_id, trace_id). The bloom filters are 8 KB per granule, ~12% storage overhead on the cold tier — but they let the substring-equivalent queries (column = 'specific_value') skip 99.7% of granules instead of 60%. The 4-minute query becomes a 9-second query. The trade-off is explicit and revisitable: 12% more storage in exchange for O(1) lookup on the seven attributes that actually drive forensic queries. For the other 200+ columns, the sort-key skip is enough.

The Parquet equivalent is parquet-bloom-filter, supported in pyarrow >= 15.0. Bloom filter pages live next to the column data, footer references them, and DuckDB / Trino / Spark all consume them. The cost model is the same as ClickHouse's — 8–16 KB per row group per filtered column, 10–15% storage overhead, but cuts substring-equivalent queries from minutes to seconds. The design pattern is: sort key picks the 80% of queries; bloom filters pick the next 19%; the remaining 1% (true substring-anywhere) accepts a 4-minute scan because there is no faster index-free answer.

Common confusions

  • "Index-free means no index at all." False. ClickHouse keeps a sparse primary-key index (1 mark per 8192 rows) and per-column min-max statistics; Parquet keeps row-group min-max, optional column statistics, and optional bloom filters. What is dropped is the per-row or per-stream index. The remaining structures are coarse — they accelerate skip, not lookup — and that is the architectural distinction.
  • "ClickHouse is faster than Parquet, always." Misleading. ClickHouse on EC2 has a hot CPU pool decompressing on the side; Parquet on S3 reads cold from object storage. For sub-second queries on warm data, ClickHouse wins. For sub-minute queries on cold data accessed once a quarter, Parquet wins on annualised cost (no idle EC2). The right framing: ClickHouse for warm tier (3–60 days, queried daily); Parquet for cold tier (60–540 days, queried monthly).
  • "Parquet is just a file format, not a storage system." Half-true. Parquet alone is a file format; Parquet + a catalog (Apache Iceberg, Delta Lake, Hive Metastore) + a query engine (DuckDB, Trino, Spark) is a complete storage system. Pretending Parquet is sufficient on its own leaves you maintaining an ad-hoc file-listing layer that breaks at scale. Use Iceberg or Delta from day one for cold-tier observability data.
  • "Sort key only matters if you have an index." Wrong, exactly backward. In an index-free system the sort key is the index. ClickHouse's ORDER BY and Parquet's row-group sort decide which queries are sub-second and which are sub-minute. Indexed systems can recover from a bad sort with a secondary index; index-free cannot. The sort-key choice is the most important schema decision for both formats.
  • "You should keep Loki on top of ClickHouse for label queries." Tempting, expensive. The label index is what costs the most in Loki; running Loki and ClickHouse against the same data doubles your storage cost. Better: write the label-equivalent attributes (service, region, env, etc.) as LowCardinality(String) columns in ClickHouse and let the sort key + bloom filters handle them. Most teams that try the dual-engine path drop Loki within 6 months.
  • "Bloom filters fix substring queries." No. Bloom filters fix equality lookups (column = 'value') but not arbitrary substrings (line LIKE '%value%'). For substring matches across log lines, the only options are: a separate full-text index (expensive — defeats the point of index-free), an inverted-index sidecar (Quickwit, Tantivy), or accept the 4-minute scan as the cost of the 1% of queries that need it.

Going deeper

Iceberg + Parquet — the catalog layer that makes Parquet an actual table

A bare directory of Parquet files is not a table. It has no concept of "current state", no atomic appends, no schema evolution, no time-travel. Apache Iceberg sits on top of Parquet (and ORC, and AVRO) and gives you all of those: a manifest file lists which Parquet files belong to the current snapshot, a metadata file lists which manifests are current, every write produces a new metadata file (atomic via S3 conditional writes or a transactional catalog like AWS Glue / Nessie). For cold-tier observability data with 540-day retention, Iceberg's snapshot-and-expire workflow is what makes "delete logs older than 540 days" a single API call instead of a Spark job. At Yatrika the cold-tier observability data lives in s3://yatrika-cold-logs/iceberg/, catalog is AWS Glue, query engine is DuckDB for ad-hoc and Trino for cross-team analytics. The footprint is 38 TB across 4.2 billion rows; the metadata + manifest layer is 240 MB, 0.0006%. Iceberg is the cheapest "table" abstraction you will ever buy for cold-tier logs.

ClickHouse-on-S3 (StorageS3, MergeTree-on-S3) and the rise of stateless query engines

ClickHouse 24.x supports MergeTree engines that store parts directly on S3 instead of local disk. The query nodes become stateless — restart any node, attach to S3, no data lost. The cost is query latency: cold-cache S3 reads add 50–150 ms first-byte latency that local NVMe does not have. For warm-tier observability this is acceptable (the 30-second budget swallows a 100 ms startup); for hot-tier it is not. The architecture pattern emerging in 2026 at Razorpay and Cred is: two-tier ClickHouse — local NVMe for the last 7 days (sub-second), S3-backed for 7–60 days (sub-10-second), Parquet on S3 for older. The S3-backed ClickHouse layer is what makes the warm tier cost-competitive with Parquet without sacrificing sub-10-second latency. The trade-off versus pure Parquet+DuckDB: ClickHouse-on-S3 keeps the cluster running 24×7 (₹2.8 lakh/month for a 3-node cluster at Yatrika), but offers consistent sub-10s queries; Parquet+DuckDB is on-demand (₹0 idle, ₹14k/query for cold) but with cold-cache penalties on every fresh query.

The PII redaction problem in cold-tier log storage

Indian payment regulation requires PII redaction for stored logs after a defined retention window — PAN numbers, Aadhaar IDs, and full email addresses must be hashed or removed after 90 days. Index-free designs make this hard: rewriting Parquet files is a full file rewrite, ClickHouse ALTER ... UPDATE is a heavy mutation that creates new parts. The pattern that emerged at Yatrika was PII-segregated columns — at write time, sensitive fields are extracted into separate columns (pan_hash, aadhaar_last4, email_domain), and the original line column never contains the raw PII. Redaction at the 90-day boundary becomes "drop the pan_hash column for partitions older than 90 days", which Iceberg supports as a schema-evolution operation costing one metadata file write and zero data-file rewrites. The pattern requires upfront discipline (the log emitters have to extract PII before writing), but it makes regulatory redaction a feature, not a multi-day rewrite job. The data-engineering primer on /wiki/what-idempotent-actually-means-for-data-and-why-its-hard explains why this pre-extraction is also what makes the log pipeline idempotent — re-running a day's ingestion does not produce different PII redaction states.

Reproduce this on your laptop

# 1. Start a single-node ClickHouse and a MinIO (S3-compatible) instance
docker run -d --name ch -p 8123:8123 -p 9000:9000 \
  -e CLICKHOUSE_DEFAULT_USER=default clickhouse/clickhouse-server:24.10
docker run -d --name s3 -p 9001:9000 -p 9002:9001 \
  -e MINIO_ROOT_USER=admin -e MINIO_ROOT_PASSWORD=admin12345 \
  minio/minio server /data --console-address :9001

# 2. Set up Python and dependencies
python3 -m venv .venv && source .venv/bin/activate
pip install clickhouse-connect pandas faker pyarrow duckdb boto3

# 3. Run the ingest + audit script
python3 clickhouse_log_audit.py
# Expect: ~5M rows ingested in ~90 seconds, dispute query returns
# 318 lines / 71 traces in 4-6 seconds, scan_efficiency around 1e-5.

# 4. Export to Parquet and query via DuckDB
python3 -c "
import clickhouse_connect, pyarrow.parquet as pq, pyarrow as pa, duckdb
ch = clickhouse_connect.get_client(host='localhost')
df = ch.query_df('SELECT * FROM logs ORDER BY service, ts LIMIT 1000000')
pq.write_table(pa.Table.from_pandas(df), 'logs_oct.parquet',
               compression='zstd', row_group_size=131072)
con = duckdb.connect()
print(con.sql('SELECT count(*) FROM \"logs_oct.parquet\" WHERE service=\\'svc-payments\\'').fetchone())
"

Where this leads next

/wiki/full-text-search-for-logs-the-cost-model is the question that index-free postpones rather than answers — when the regulatory query genuinely requires substring-anywhere search across 540 days, what is the cheapest way to add an inverted index back? Quickwit (Tantivy-backed, S3-native) and Loki's bloom-filter shipper are the two production answers in 2026; both add 30–60% storage overhead on top of the Parquet base in exchange for ~100× faster substring queries. The trade-off in this article composes with that one — the right fleet-wide pattern is "index-free for the 99% of cold queries, inverted-index sidecar only for the regulatory 1%".

/wiki/long-term-storage-thanos-cortex-mimir is the metric-side equivalent of this article. The cost-vs-latency-vs-operability triangle that ClickHouse and Parquet trace out for logs has a precise analogue for metrics — Thanos's per-block index, Cortex's chunk-store, and Mimir's compactor + store-gateway. The reader-budget framing from /wiki/tiered-storage-for-metrics-logs-traces is the conceptual frame; this article is the log-side implementation; that one is the metric-side implementation.

/wiki/cost-attribution-and-showback-models is what comes immediately after the storage-shape decision: once Yatrika moved cold-tier logs to Parquet at ₹0.11/GB-month, the platform-team's storage bill dropped from ₹38 lakh to ₹4.2 lakh, but the per-team allocation became opaque (Parquet has no per-team accounting; ClickHouse has only per-table). The next chapter handles the showback layer that re-attributes the savings.

References