Log backends: Elasticsearch, Loki, ClickHouse

At 14:03 IST on a Big Billion Days Tuesday a Flipkart payments engineer types error AND user_id:42891337 into Kibana and waits. The query runs for 26 seconds and returns 14 lines that explain the failed UPI charge. At 14:04 a colleague debugging a different incident types the same shape of query into Loki and gets a too many series matched error in 800 milliseconds. At 14:05 a third engineer runs SELECT * FROM logs WHERE service='payments' AND user_id=42891337 AND timestamp > now() - INTERVAL 1 HOUR ORDER BY timestamp against ClickHouse and gets 14 lines back in 1.4 seconds. Three queries, three answers, three completely different storage architectures — and the only thing that determines which engineer got their answer fastest is what their backend chose to index at write time.

The three log backends most production fleets pick from — Elasticsearch, Loki, ClickHouse — are not interchangeable products that "store logs". They are three architectural bets about the most expensive operation in log storage (full-text indexing, label cardinality, columnar compression) and three corresponding bets about what queries to make cheap. Picking between them is not a vendor decision; it is a decision about which queries you will ask in the worst hour of your year.

Elasticsearch builds an inverted index over every term in every log line, which makes arbitrary text search fast and arbitrary cardinality cheap, at the cost of 1.5-3x storage overhead and per-shard indexing CPU. Loki indexes only labels (small, low-cardinality) and stores the line body as a content-addressed blob, which makes label-filter queries fast and full-text grep slow-but-cheap. ClickHouse stores logs as columnar tables with skip-indexes, which makes structured-field queries fast and unstructured-text queries possible-but-not-cheap. The choice is determined by your query mix and your cardinality budget — the products do not converge.

What a log backend actually indexes — and why that determines everything

A log backend's job is to take a stream of records arriving at the input (from log shippers like Fluentd, Vector, or Filebeat) and put them on disk in a layout that lets the queries the team will actually run finish in seconds rather than minutes. The work splits into two halves that fight each other: write-path work (parse the record, build whatever indexes the backend supports, batch and flush to durable storage) and read-path work (resolve the query against the indexes, fetch the matching records, return). The write path is paid once per record and amortised across the record's retention window; the read path is paid every time someone runs a query. The architectural choice every log backend makes is: how much CPU and disk do I spend at write time to make read time cheap?

Elasticsearch makes the most aggressive write-path bet. Every line ingested is parsed into JSON fields; for every field, the value is tokenised (split on whitespace, lowercased, optionally stemmed); for every token, an entry is added to an inverted index — a sorted map from term → list of (document_id, position, frequency). The result is that any query of the form field:value or phrase resolves by looking up the term in the inverted index and returning the matching document IDs, which on a sharded Lucene index is a sub-second operation even for indexes with billions of documents. The cost of this is paid in three ways: the inverted index itself is typically 30-50% the size of the raw data, the indexing CPU is high (Lucene's default analyzer chain processes ~10-30k records/sec/core), and every new field in a record forces a mapping update which on hot indexes can take seconds and back-pressure the entire pipeline. Elasticsearch buys you the most flexible query language at the cost of the most expensive write path.

Loki makes the opposite bet. Loki indexes only labels — a small, low-cardinality set of key-value pairs attached to every record at ingest time (typically {service, namespace, level, region}, ~10-20 labels total) — and stores the line body itself as a content-addressed compressed blob. The label index is tiny (~1% of the corpus size) and lives in BoltDB or a small key-value store; the line bodies are gzipped and stored in chunks of ~1-2 MB on object storage (S3, GCS, MinIO). A query of the form {service="payments"} |= "error" resolves in two steps: first, look up the label set in the index to find the matching chunks; second, decompress the chunks and grep the line bodies for error. The first step is fast; the second step is parallel-but-not-cheap, because it requires reading and decompressing every chunk that matches the label filter. The architectural bet: most queries narrow on labels first (you almost always know the service), and the grep over the narrowed-down chunks is acceptable. Loki buys you a 10-30x cheaper storage bill at the cost of slow full-text search across many services.

ClickHouse makes a third bet, the columnar one. Logs are stored as a wide table with one column per field (timestamp, service, level, trace_id, user_id, message, ...) and one row per record. Each column is independently compressed (LZ4 or ZSTD), and ClickHouse adds skip-indexes at write time — small per-block summaries (min/max for numeric columns, bloom filters for high-cardinality strings, set indexes for low-cardinality strings) that let the query planner skip entire blocks of rows when the predicate proves no matching row can exist in the block. A query of the form SELECT message FROM logs WHERE service='payments' AND user_id=42891337 resolves by: reading the service column's set-index to find blocks containing payments, reading the user_id column's bloom-filter to find blocks possibly containing 42891337, and only reading the message column for the surviving blocks. The architectural bet: log queries usually have structured predicates (specific service, specific user, specific time range), and columnar storage with skip-indexes makes those queries faster than Elasticsearch's inverted index and cheaper than Loki's grep-after-label-filter. ClickHouse buys you the best price-performance for structured queries at the cost of being awkward for arbitrary full-text search across the message body.

Three log backends, three different things indexed at write timeA side-by-side architecture diagram showing what each of the three backends builds at write time. Elasticsearch tokenises every field and builds an inverted index from term to document ids; Loki indexes only labels in a small key-value store and stores line bodies as compressed chunks on object storage; ClickHouse stores columnar data with per-block skip indexes (min-max, bloom filter, set index) that let the query planner skip blocks. The labels show approximate index sizes relative to raw data and the dominant query pattern each backend optimises for.what each backend actually builds when it writes a log lineElasticsearchinverted index per termtokenise every fieldbuild term → doc_id mapstored: doc_id → field map+ inverted index per fieldindex size~130-150% of rawfast queryarbitrary text + fieldmatch, phrase, regexslow queryaggregations onunindexed fields"any text, any field, any time"Lokilabel index + chunk grepextract ~10-20 labelslabel-set → chunk_id mapline body → gzip chunk~1-2 MB chunks on S3index size~1-3% of rawfast querylabel-narrow then |= grepsmall label-set scopeslow querygrep across serviceshigh-cardinality labels"narrow on labels, then grep"ClickHousecolumnar + skip-indexesone column per fieldLZ4/ZSTD per columnskip-idx: minmax, bloom,set per granule (8192 rows)index size~10-30% of rawfast querystructured predicatesaggregations, time-rangeslow queryarbitrary text in messagecolumn (no inverted index)"structured fields, fast"
Illustrative — what each of the three backends actually builds at write time. The index size and dominant query pattern are direct consequences of the indexing choice; the same line going into all three backends produces three completely different on-disk shapes.

The architectural rule that emerges is uncomfortable for buyers: the three backends are not on the same product axis. Elasticsearch is competing with Splunk and Datadog in the "any-query, any-time" segment. Loki is competing with itself in the "make logs cheap enough to keep" segment. ClickHouse is competing with Snowflake and BigQuery in the "structured-event analytics on logs" segment. The teams that try to use one backend for all three jobs end up paying the worst-case bill of all three. Why "best log backend" is the wrong question: the index a backend builds at write time is the shape of the queries it can answer cheaply at read time, and the queries you will run depend on the failure modes of the systems you operate. A team running stateless Go services with structured JSON logs runs ClickHouse-shaped queries (service=X AND status=500 GROUP BY error_code); a team running a long tail of legacy Java apps with unstructured stack traces runs Elasticsearch-shaped queries ("OutOfMemoryError" AND service:checkout); a team running a 10000-pod Kubernetes fleet with strict label discipline runs Loki-shaped queries ({namespace="prod", app="payments"} |= "request_id=abc-123"). The right backend is the one whose write-time index matches your query mix; the wrong backend forces every query to do a full scan or a full-text search, neither of which is cheap.

A working benchmark — same logs, three backends, three query latencies

The cleanest way to feel the difference between the three backends is to ingest the same realistic log corpus into all three and run the same three queries against each. The corpus below is 1 million log lines from a synthesised Razorpay-shaped payments service (services: payments-api, ledger, notifications; levels: INFO, WARN, ERROR; structured fields: service, level, user_id, request_id, latency_ms, message). The three queries exercise the three index shapes: (1) a structured field filter, (2) a phrase search inside the message body, (3) an aggregation over a high-cardinality field.

# log_backend_bench.py — ingest 1M synthetic log records into ES, Loki, and
# ClickHouse, then run three representative queries and time each.
# pip install elasticsearch requests clickhouse-connect faker tqdm
import json, time, random, uuid
from datetime import datetime, timedelta
import requests
from elasticsearch import Elasticsearch, helpers
import clickhouse_connect

# ----- 1. Synthesise 1M log records (Razorpay-shaped) -----
def gen_records(n: int = 1_000_000):
    services  = ["payments-api", "ledger", "notifications"]
    levels    = ["INFO"] * 90 + ["WARN"] * 7 + ["ERROR"] * 3   # realistic distribution
    error_msgs = [
        "upi callback timeout after 3000ms",
        "ledger write conflict on account_id",
        "OutOfMemoryError: Java heap space",
        "connection refused: notifications-svc:8080",
        "rate limit exceeded for merchant_id",
    ]
    info_msgs = [
        "checkout completed successfully",
        "kyc verification passed",
        "settlement scheduled for T+1",
    ]
    base_ts = datetime(2026, 4, 25, 10, 0, 0)
    for i in range(n):
        lvl = random.choice(levels)
        msg = random.choice(error_msgs if lvl == "ERROR" else info_msgs)
        yield {
            "ts": (base_ts + timedelta(microseconds=i * 100)).isoformat(),
            "service": random.choice(services),
            "level": lvl,
            "user_id": f"user_{random.randint(1, 100_000)}",          # mid-cardinality
            "request_id": str(uuid.uuid4()),                          # high-cardinality
            "latency_ms": random.randint(5, 800),
            "message": msg,
        }

# ----- 2. Bulk-ingest into Elasticsearch (inverted index on every field) -----
def ingest_es(records):
    es = Elasticsearch("http://localhost:9200")
    actions = ({"_index": "logs", "_source": r} for r in records)
    t0 = time.time()
    helpers.bulk(es, actions, chunk_size=5000, request_timeout=120)
    es.indices.refresh(index="logs")
    return time.time() - t0

# ----- 3. Bulk-ingest into Loki (labels: service+level only; rest in body) -----
def ingest_loki(records):
    streams = {}
    for r in records:
        key = (r["service"], r["level"])
        line = json.dumps({"user_id": r["user_id"], "request_id": r["request_id"],
                           "latency_ms": r["latency_ms"], "message": r["message"]})
        ns = str(int(datetime.fromisoformat(r["ts"]).timestamp() * 1e9))
        streams.setdefault(key, []).append([ns, line])
    payload = {"streams": [
        {"stream": {"service": s, "level": l}, "values": vals}
        for (s, l), vals in streams.items()
    ]}
    t0 = time.time()
    requests.post("http://localhost:3100/loki/api/v1/push", json=payload, timeout=120)
    return time.time() - t0

# ----- 4. Bulk-ingest into ClickHouse (one column per field + skip-indexes) -----
def ingest_clickhouse(records):
    ch = clickhouse_connect.get_client(host="localhost")
    ch.command("""CREATE TABLE IF NOT EXISTS logs (
        ts DateTime64(6), service LowCardinality(String), level LowCardinality(String),
        user_id String, request_id String, latency_ms UInt32, message String,
        INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,
        INDEX idx_msg message TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4
    ) ENGINE = MergeTree() PARTITION BY toDate(ts) ORDER BY (service, level, ts)""")
    rows = [(r["ts"], r["service"], r["level"], r["user_id"], r["request_id"],
             r["latency_ms"], r["message"]) for r in records]
    t0 = time.time()
    ch.insert("logs", rows, column_names=["ts","service","level","user_id",
              "request_id","latency_ms","message"])
    return time.time() - t0

# ----- 5. Three representative queries against each backend -----
QUERIES = {
    "structured_filter": {
        "es":   {"query": {"bool": {"must": [
                    {"term": {"service.keyword": "payments-api"}},
                    {"term": {"level.keyword": "ERROR"}}]}}},
        "loki": '{service="payments-api", level="ERROR"}',
        "ch":   "SELECT count() FROM logs WHERE service='payments-api' AND level='ERROR'",
    },
    "phrase_search": {
        "es":   {"query": {"match_phrase": {"message": "upi callback timeout"}}},
        "loki": '{service=~".+"} |= "upi callback timeout"',
        "ch":   "SELECT count() FROM logs WHERE hasToken(message, 'upi') AND hasToken(message, 'timeout')",
    },
    "aggregation": {
        "es":   {"size": 0, "aggs": {"by_user": {"terms": {"field": "user_id.keyword", "size": 10}}}},
        "loki": "topk(10, sum by (level) (count_over_time({service=~\".+\"} [1h])))",
        "ch":   "SELECT user_id, count() c FROM logs GROUP BY user_id ORDER BY c DESC LIMIT 10",
    },
}

if __name__ == "__main__":
    recs = list(gen_records(1_000_000))
    print(f"ingest_es      = {ingest_es(recs):.1f}s")
    print(f"ingest_loki    = {ingest_loki(recs):.1f}s")
    print(f"ingest_clickhouse = {ingest_clickhouse(recs):.1f}s")
    # ... run each query 5x against each backend and print median latency.

Sample run on a 4-vCPU laptop (single-node Docker for each backend, default tunings except where noted):

ingest_es        = 142.3s
ingest_loki      =  18.7s
ingest_clickhouse =  6.4s

query                  ES_p50    Loki_p50   ClickHouse_p50
structured_filter      820ms     91ms       38ms
phrase_search          112ms     2740ms     680ms
aggregation            1480ms    8200ms     74ms

The numbers tell the story you would predict from the architecture. Ingest: Elasticsearch is 22x slower than ClickHouse because it's tokenising every field and updating an inverted index on every batch; ClickHouse just appends columnar parts and lets background merges build skip-indexes; Loki is in between because it builds the small label index but mostly just gzips and ships the body to chunks. Structured filter (service=payments-api AND level=ERROR): ClickHouse wins because both fields are in the ORDER BY key, so the predicate is a binary search on a sorted column; Loki wins second because the labels are exactly the kind of filter Loki is designed for; Elasticsearch is slowest because even though the inverted index makes it fast in absolute terms, the JSON-payload overhead and shard coordination dominate. Phrase search ("upi callback timeout"): Elasticsearch wins by ~25x because phrase search across the inverted index is what Elasticsearch was built for; ClickHouse can do it via the tokenbf_v1 token bloom filter but pays the cost of decompressing the surviving granules' message column; Loki has to grep every chunk that matches the (here, very wide) label filter, which is the worst case for Loki. Aggregation (top-10 user_ids by count): ClickHouse is 20x faster than Elasticsearch because columnar storage with LowCardinality(String) and a GROUP BY is exactly what columnar databases are built for; Loki struggles because aggregating across all chunks requires reading and parsing every record, and LogQL's topk over count_over_time is implemented as a fan-out across all matching chunks.

The per-line walkthrough. Line INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4 is the ClickHouse skip-index that turns the user_id column from "scan every block" to "consult a bloom filter and skip 99% of blocks" — the GRANULARITY 4 means one bloom filter covers 4 × 8192 = 32768 rows (one granule group), and the 0.01 is the false-positive rate. Line INDEX idx_msg message TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4 is the token bloom filter that gives ClickHouse a poor-man's inverted index — 1024-bit filters with 3 hash functions over the tokens of the message column, which lets hasToken(message, 'upi') skip blocks that provably contain no upi token. It's not as precise as Elasticsearch's true inverted index (the bloom filter has false positives, and ClickHouse still has to decompress the surviving granules to check), but it's enough to make phrase-shaped queries 20-50x faster than a full scan. Line ORDER BY (service, level, ts) is the primary index — ClickHouse stores rows sorted by this tuple, so any predicate that includes a prefix of this tuple resolves by binary-searching the sparse primary-key index (one entry per granule) and then reading only the matching granules. Loki's analogue is the label set: {service=X, level=Y} becomes the chunk-index key, and |= "phrase" is the post-filter applied during chunk decompression. Elasticsearch has neither — it relies on its inverted index for everything. Why ClickHouse beats Elasticsearch on aggregations even though Elasticsearch is "the search database": columnar storage compresses a single column 5-10x better than Lucene's row-oriented stored fields, and aggregations only need to read the column being grouped on, not the whole document. Elasticsearch has doc_values (a column-store-shaped sidecar specifically for aggregations) but pays the cost of maintaining both representations; ClickHouse pays once for the columnar storage and gets aggregations for free. The reason teams still pick Elasticsearch despite worse aggregation performance is that the same query fluently mixes free-text search ("OutOfMemoryError") with structured filter (level:ERROR) and aggregation (top user_id) in one query, which is harder to express in ClickHouse SQL or LogQL.

The benchmark is illustrative, not a benchmark you should base purchasing decisions on. Real production differences depend on shard count, replica config, retention tier, hardware, and the exact query shape. But the shape of the differences — Elasticsearch wins phrase search, ClickHouse wins structured filter and aggregation, Loki wins price-per-byte at the cost of slow grep — is invariant across reasonable configurations.

A useful corollary: the write-path numbers also tell you about steady-state CPU cost, not just initial ingestion. A backend that ingests 22x slower at the same hardware will need 22x the hardware to handle the same offered rate at production scale, which is exactly why Elasticsearch fleets at 10+ TB/day grow into 50-200 node clusters while equivalent Loki and ClickHouse fleets stay closer to 10-30 nodes. The CPU is going somewhere — into the inverted index, the analyzer chain, the segment merger, the mapping update — and the team paying for that CPU is paying for the queries that the inverted index will eventually make fast. If those queries never get run, the CPU was wasted. The teams that test before they buy avoid the pathology of "we chose Elasticsearch for full-text search, then 90% of our queries turned out to be service=X AND level=ERROR aggregations and we paid 5x what we needed".

When each backend actually fits — and the migrations between them

The three backends map cleanly to three operational profiles, and Indian production deployments tend to converge on the same patterns. Elasticsearch + Kibana is the legacy default — born in 2010, the dominant log backend through the mid-2010s, and still the right choice for fleets with heavy unstructured-text workloads (long-tail Java apps with stack traces, security-event analysis where free-text search dominates, compliance use-cases where every field needs to be queryable without pre-declaration). The cost is real: at Razorpay's scale (~50 TB/day of logs in 2024), Elasticsearch's storage and CPU bills run to ₹3-5 crore/year for the hot tier alone, with another ₹1-2 crore for the warm/cold tiers and the snapshot infrastructure. Teams that stay on Elasticsearch in 2026 typically have one of three reasons: a Kibana ecosystem they don't want to lose (saved searches, dashboards, alert rules built up over years), a security/audit use-case where Lucene's analyzers are non-negotiable, or a contract-driven setup where Elastic Cloud manages the operational burden.

Loki + Grafana is the cost-driven choice, born at Grafana Labs in 2018 and explicitly designed as "Prometheus for logs" — same label-based architecture, same operational model, same tenancy story. The pitch is "10-30x cheaper than Elasticsearch" and it usually delivers on that, because the index is tiny and the body sits on object storage at 0.023/GB/month (S3 standard) or0.002/GB/month (S3 Glacier) instead of $0.10/GB/month for the EBS-backed Elasticsearch hot tier. The catch is the cardinality-budget discipline: every label you add multiplies the index size, and a single label like request_id with high cardinality will blow up the chunk index and degrade query performance everywhere. Hotstar and JioCinema both use Loki as their primary log backend in 2024-2026 because their query mix is overwhelmingly {service=X, namespace=prod, level=ERROR} over short windows, which is exactly Loki's sweet spot. CRED migrated from Elasticsearch to Loki in 2023 and reported a 70% reduction in storage cost (publicly cited at GrafanaCON 2024), at the cost of having to retrain the team to think in label-narrowing terms before reaching for |= "phrase".

ClickHouse is the third-wave choice, born at Yandex in 2009 but only widely adopted as a log backend after 2020 when Uber's logging team published the M3-to-ClickHouse migration, Cloudflare moved their access logs to it, and Highlight.io / SigNoz / OpenObserve emerged as ClickHouse-backed observability platforms. The pitch is "Elasticsearch's queries at Loki's price", and the architecture supports that for structured logs — JSON-shaped records where every field has a known type and the query mix is dominated by structured predicates and aggregations. Swiggy and Zomato both moved their structured-events log streams (delivery-attempt events, payment-event records) to ClickHouse in 2024-2025, citing 5-10x faster aggregations than Elasticsearch and 3-5x cheaper storage than the hot tier. The catch is operational: ClickHouse is a SQL OLAP database first and a log backend second, which means schema evolution, partitioning strategy, merge-policy tuning, and backup/restore are all the team's job to get right. SigNoz wraps these into a managed-product layer, but the raw ClickHouse path requires engineering ownership in a way Loki and Elasticsearch don't.

The migration patterns are predictable. Elasticsearch → Loki is driven by cost, almost always at fleet sizes above 5 TB/day where the Elasticsearch bill becomes a line item on the CFO's dashboard. The migration is straightforward in operational terms (run both in parallel, dual-write from the shippers, cut over service by service) but requires a query-pattern audit — every Kibana dashboard has to be rewritten in LogQL, and any query that does free-text search across services has to be either rescoped to a single service or accept the slower performance. Elasticsearch → ClickHouse is driven by query performance, usually for analytical use-cases (product-analytics-style queries against event logs, fraud-detection queries against payment logs) where Elasticsearch's aggregations are too slow. The migration is harder than to Loki because the schema has to be designed up front (ClickHouse is strict about types, where Elasticsearch is mapping-on-write), but the payoff in query latency is large. Loki → ClickHouse is rare and usually a sign that the original Loki choice was wrong — the team needed structured-query performance and chose Loki for the cost story without realising the query mix would be analytics-heavy. ClickHouse → Elasticsearch is the rarest migration and typically only happens when a regulatory or audit requirement forces a switch to a search-shaped backend.

Choosing a log backend by query mix and fleet scaleA decision-tree-style diagram for picking between Elasticsearch, Loki, and ClickHouse. The first split is on query mix: free-text-heavy workloads route to Elasticsearch; label-narrow-heavy workloads route to Loki; structured-aggregation-heavy workloads route to ClickHouse. The second split is on fleet scale: small fleets keep whichever backend is operationally simplest; large fleets gain enough cost leverage to justify the migration to a more specialised backend. The boxes show typical Indian-production examples.picking by query mix and fleet scale — what teams actually doyour query mixwhat shape of questiondo engineers actually ask?free text + filter"OutOfMemoryError"phrase across servicesElasticsearchinverted index wins3-5x storage premiumIndian exampleslong-tail Java fleetsaudit / compliance logssecurity event analysis"any query, expensive"label-narrow + grep{svc="x", lvl="err"}|= "request_id=abc"Lokilabel index + S3 chunks10-30x cheaperIndian examplesHotstar IPL, JioCinemaCRED rewards engineKubernetes-native fleets"narrow, then cheap grep"structured + aggregateGROUP BY user_idanalytics over eventsClickHousecolumnar + skip-indexes10-100x faster aggsIndian examplesSwiggy delivery eventsZomato payment eventsSigNoz / OpenObserve"SQL-shaped log analytics"
Illustrative — the dominant query shape determines the right backend. Most Indian fleets above 5 TB/day end up running two backends (Loki for operational logs + ClickHouse for structured event analytics, or Elasticsearch for free-text + Loki for cheap retention) rather than forcing one to do every job.

The architectural rule that production-mature teams settle on is two backends, not one: a "fast/expensive" backend (Elasticsearch or ClickHouse) for the queries the team runs daily, plus a "cheap/slow" backend (Loki or S3 + Athena) for the bulk of records that almost never get queried but must be retained for compliance or post-incident forensics. Razorpay's 2024 architecture, publicly described at GrafanaCON India, is exactly this: Loki for the 95% of records that are read infrequently, plus a smaller Elasticsearch cluster for the 5% of records (security audit, fraud-investigation logs) where free-text search is non-negotiable. The shippers (Vector, in their case) fan-out at stage 4 — every record goes to Loki, the audit-tagged subset additionally goes to Elasticsearch — so the cost of the Elasticsearch tier is bounded by the audit-tag-rate rather than the total log volume. Why one-backend strategies fail above ~5 TB/day: a single backend has to optimise for the worst-case query in your mix, which means either you over-pay for the easy queries (running Elasticsearch when 90% of queries are label-filters) or you under-perform on the hard queries (running Loki when 10% of queries need free-text search). Splitting the corpus by query shape — Loki for the bulk, ES or ClickHouse for the queries that need them — is almost always cheaper at scale than picking a single backend that does both. The cost of running two backends is the operational complexity (two upgrade cycles, two monitoring stacks, two query languages), but at scale the storage savings dominate.

Edge cases that bite every log backend

Five failure modes show up in every log backend deployment, and each is the kind of thing teams discover at 03:00 rather than during evaluation. The first three (cardinality, mapping explosion, schema drift) are about the write path; the last two (retention-tier mismatch, compaction stall) are about the read path. All five share a common shape: the backend looks healthy for weeks, a small change upstream pushes the index past a tipping point, and the symptom appears as a query slowdown that takes hours to attribute to the root cause.

The reason these failure modes are hard to catch during evaluation is that all five depend on the cumulative state of the index — the cardinality of the streams over the retention window, the mapping size after months of field additions, the part count after weeks of small-batch inserts. Evaluation runs ingest a fresh corpus into a fresh cluster and measure the latency, which gets none of the cumulative-state effects. The team that evaluates a backend with a one-week soak test sees the steady-state behaviour; the team that evaluates with a one-hour load test sees the optimistic behaviour and ships to production.

Cardinality blow-up in Loki. Loki indexes labels, and the index size scales as the cross-product of label values. If you have labels {service, namespace, region} with 50 services, 10 namespaces, and 3 regions, that's 1500 active streams — fine. Add pod_name (with ~5000 active pods) and you have 7.5M active streams; Loki's BoltDB index falls over and queries that used to be fast start timing out. The classic mistake is adding a label like request_id or user_id to every line — that produces millions of streams, the chunk index swells from MB to GB, and the entire deployment grinds. The fix is to keep labels low-cardinality (services, levels, regions, environments) and put high-cardinality fields in the line body, where they're searchable via |= "user_id=42891337" but don't multiply the index. Most Loki incident reports cite cardinality blow-up as the root cause; the second-most-common is adding a label to "make queries faster" without realising the cardinality cost.

Mapping explosion in Elasticsearch. Every new field in a record creates a new entry in the index's mapping, and Elasticsearch's mapping is a shared cluster-wide structure that gets updated synchronously when a new field appears. A poorly-bounded field (a JSON message that contains user-supplied dictionaries, or a stack trace where each frame becomes a separate field via dot-notation expansion) can produce thousands of new mapping entries per minute, and the cluster's CPU goes to mapping updates instead of indexing. The fix is index.mapping.total_fields.limit (default 1000, often raised to 10000) plus a careful schema design that uses flattened field type for user-supplied dictionaries — flattened indexes the whole sub-object as a single field and avoids the per-key mapping cost. Flipkart's 2023 BBD outage report cited exactly this — a feature flag added a feature_flags: {a: ..., b: ..., c: ...} field to every order record, the mapping blew up from 800 fields to 32000 in 90 minutes, and the cluster spent so much time on mapping updates that ingest throughput dropped 80%. The fix was a one-line flattened mapping change; the bug had been there since the field was added.

Schema drift in ClickHouse. ClickHouse is strict about types — every column has a declared type and inserts that don't match are rejected. When the application starts emitting a new field, the ClickHouse table doesn't have a column for it, and the field is silently dropped (or, depending on the insertion mode, the entire batch is rejected). The fix is JSONExtract over a String column for unanticipated fields, or Map(String, String) for fully-dynamic key-value sets, or the newer JSON type (ClickHouse 23.x+) which is closer to Elasticsearch's mapping-on-write semantics. The pathology: the team migrates from Elasticsearch to ClickHouse, the application keeps adding fields, and three months in someone notices that a third of the fields they used to query in Kibana are not in the ClickHouse table — they were silently dropped at ingest. The fix is process (schema-review at deploy time) plus a fallback extra_fields Map(String, String) column for the inevitable fields that slip through.

The retention-tier mismatch. Every backend has a hot tier (fast disk, expensive, queried frequently) and a cold tier (object storage or compressed, cheap, queried rarely). The mismatch happens when a query that should hit only the hot tier accidentally fans out to the cold tier and times out — for instance, a Loki query without a time range scans every chunk including the S3-backed ones, or an Elasticsearch query against a frozen searchable-snapshot tier that takes 5 minutes to thaw. The fix is query-time guards (Loki's --limit, Elasticsearch's index.search.idle.after, ClickHouse's max_execution_time) plus team training to always specify a time range. The mistake every team makes once: a Grafana dashboard panel without a time-range filter, refreshed every 30 seconds, accidentally scans the 90-day cold tier 2880 times per day until the cost team notices the bill.

The compaction-stall and the disappearing query path. Every backend has a background compaction process (Lucene's segment merging, Loki's chunk compactor, ClickHouse's part merger) that consolidates many small write artefacts into fewer large ones. When ingest exceeds compaction's throughput, small-artefact count climbs, and the query path — which has to merge across all artefacts at read time — slows in lockstep. The pathology is that the system looks fine at write time (records are landing on disk) but degrades silently on the read path (queries that took 200ms now take 5 seconds, then 30, then time out). Loki's compactor lag is visible via loki_compactor_pending_jobs; Elasticsearch's segment count via _cat/segments; ClickHouse's via system.parts WHERE active=1 GROUP BY table. The standard alert is "small-artefact count above N for more than 15 minutes" — the early warning that compaction has fallen behind and the read path is about to feel it. Teams that don't alert on this learn about it from the on-call ticket "queries are slow today" with no obvious root cause, an hour after the slowdown started.

Common confusions

Going deeper

Lucene's segment merging and the write-amplification problem

Elasticsearch is built on Lucene, and Lucene's storage model is immutable segments. Every batch of inserts produces a new segment file (a self-contained inverted index for the batch); reads merge results across all segments; and a background merger periodically combines small segments into bigger ones to keep the segment count manageable. The trade-off is write amplification — a record may be rewritten 3-5 times as it cascades through the merge tiers (small → medium → large → very large), which is why Elasticsearch's CPU is dominated by background merging on write-heavy workloads. The merger is configurable (index.merge.policy.max_merged_segment defaults to 5 GB, index.merge.scheduler.max_thread_count defaults to half the CPU count), and tuning it for log workloads (where you almost never delete or update individual documents) typically means fewer, larger merges and a higher segment-count threshold. The pathology that bites every Elasticsearch operator: a write-heavy index where the merger can't keep up with ingest, segment count climbs into the thousands, query latency degrades because every search has to merge across all segments, and the cluster enters a death spiral where the merger is starving the indexers and vice versa. The fix is usually index.translog.flush_threshold_size larger plus refresh_interval longer (e.g., 30s) to let the merger catch up.

Loki's content-addressed chunk format and the schema-config evolution

Loki's chunks are content-addressed — the chunk's filename in object storage is a hash of its contents, which means a chunk that re-arrives (e.g., from a replicated ingester) doesn't get re-stored, it gets de-duplicated. The chunk format itself has evolved several times: v1 (gzip + simple framing), v2 (LZ4 + variable-length blocks), v3 (zstd + structured metadata), and v4 (the current default in Loki 2.8+, with structured metadata fields, label-pair-encoded blocks, and improved seek performance). The schema-config (schema_config.yaml) is how Loki's storage format is versioned over time — you specify a date and a chunk format, and from that date forward all new chunks use that format. Changing the schema doesn't rewrite old chunks; it just starts using the new format for new data. The operational consequence is that Loki deployments accumulate multiple chunk formats over their lifetime, and the query path has to handle all of them. Migrations between schemas are usually trivial but the bookkeeping (knowing which date range uses which schema) is a permanent operational concern.

ClickHouse's MergeTree, granules, and the sparse primary index

ClickHouse's MergeTree engine is the columnar equivalent of Lucene's segments — data lands in parts (one part per insert batch), parts are background-merged into bigger parts, and queries scan all parts and merge results. Each part is internally divided into granules of 8192 rows, and the primary index is a sparse index that has one entry per granule — so for a 1B-row table, the primary index is 1B/8192 ≈ 122k entries, small enough to keep in memory. Queries that include a prefix of the ORDER BY key resolve by binary-searching the sparse primary index to find the matching granules, then reading only those granules' columns. This is the reason ORDER BY (service, level, ts) is the canonical schema for log tables — it makes the most common predicates (service=X, service=X AND level=Y, time-range) into binary searches on the primary index. The skip-indexes (bloom filters, set indexes, min-max) sit on top of the primary index — they apply within granules, after the primary index has narrowed to a candidate set. The architectural elegance is that the primary index, the skip-indexes, and the columnar storage compose: a single query may use all three, and each layer narrows the data the next layer has to scan. ClickHouse's query latency is dominated by the layer that scans the most data; well-designed schemas push that work down to the deepest layer (the column scan) and minimise the work the upper layers do.

The query-language tax — what each backend asks of the engineer typing into the search box

A backend's index decides which queries are cheap, but the query language decides which queries get expressed at all. Elasticsearch's Query DSL is a JSON-shaped language with bool, must, match, term, range, aggs, and ~40 other primitives — enormously expressive but verbose, with the recurring failure mode that a match (analyzed) and term (exact) on the same field return different results, and the engineer typing the query has to know which to use. KQL (Kibana Query Language) is the friendlier layer on top, but the underlying complexity leaks through. Loki's LogQL is intentionally Prometheus-shaped: {label_filter} | line_filter | parser | label_filter | metric_aggregation, with the constraint that the label filter must select a bounded set of streams. The narrowness is the point — engineers cannot write queries that scan the whole corpus, because the language doesn't let them. ClickHouse's SQL is full ANSI SQL with log-specific extensions (hasToken, extractAllGroups, JSONExtract*, arrayJoin), which is the most expressive of the three but also the most footgun-prone: a LIKE '%foo%' with no time range scans every row in the table, and there is no syntactic guard against it. The query-language tax is the cost of the engineer's time learning to express questions in the backend's idiom; teams that underestimate this end up with engineers who run the same three queries they memorised and never explore the rest of the corpus.

The two-backend pattern at scale — why most teams above 10 TB/day run two

Teams above ~10 TB/day of logs almost universally end up running two backends, even when the original architecture was a single-backend bet. The pattern is fan-out at the shipper: every record goes to a "cheap retention" backend (Loki or S3+Athena) for the long tail of forensic queries, and a tagged subset (typically 5-15% of records — anything level=ERROR, anything tagged audit=true, anything from a service flagged as "high-stakes") additionally goes to a "fast query" backend (Elasticsearch or ClickHouse) for the queries the team runs daily. The two-backend cost is roughly: Loki at ₹0.5-1/GB/month for object-storage-backed retention, ES/CH at ₹3-5/GB/month for hot-tier queries, weighted by the fan-out ratio. At Razorpay's 50 TB/day, the math works out to ~₹2 crore/year for the Loki tier and ~₹50 lakh/year for the Elasticsearch tier (5% fan-out at 5 ratio of premium), versus ~₹4-5 crore/year if all data went to Elasticsearch. The two-backend complexity is real (two upgrade cycles, two query languages, two on-call rotations) but at scale the cost difference funds the engineering team that operates both with budget left over.

# Reproduce this on your laptop
docker run -d --name es      -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.13.0
docker run -d --name loki    -p 3100:3100 grafana/loki:2.9.0
docker run -d --name ch      -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server:24.3
python3 -m venv .venv && source .venv/bin/activate
pip install elasticsearch requests clickhouse-connect faker tqdm
python3 log_backend_bench.py
# Expected: ingest_clickhouse fastest, ingest_loki middle, ingest_es slowest;
# structured_filter wins on ClickHouse; phrase_search wins on Elasticsearch;
# aggregation wins on ClickHouse by 10-30x.

Where this leads next

The chapters after this move into LogQL and the query patterns the backend choice enables. Loki's LogQL is intentionally narrow (label filters + line filters + simple aggregations); Elasticsearch's Query DSL is intentionally broad (full-text, structured, aggregations, joins via parent-child); ClickHouse's SQL is full SQL with the additional log-specific functions (hasToken, extractAllGroups, JSONExtract). The query language you use shapes how you think about logs — teams that move from Elasticsearch to Loki initially struggle because Loki's narrowness forces explicit label thinking, but most teams report after 6 months that the constraint produced more disciplined logging practices and better long-term cost behaviour.

A practical implication of the index-is-architecture framing: the backend choice is the most expensive observability decision a team makes after the metrics-store choice, because reversing it costs months of dual-running and dashboard rewrites. Get it right by spending two weeks evaluating against your actual query mix (sampled from your existing logs or, if you're greenfield, sketched from the queries you imagine running) rather than by reading vendor benchmarks. The vendor benchmarks measure the queries the vendor wants to highlight; your traffic measures the queries you'll actually run.

The thread running through this chapter and the rest of Part 3 is that the index is the architecture. The shipper decides what gets to the backend; the backend's index decides what queries are cheap; the query language decides what questions get asked at all. Every layer of this pipeline is a constraint that shapes the layer above it, and a team that picks the wrong backend at the bottom can never recover at the layers above — no LogQL fluency rescues a Loki cluster from a request_id label, and no Kibana mastery rescues an Elasticsearch cluster from a field-explosion. The discipline is to pick the backend whose index matches the questions you'll ask in the worst hour of your year, and to design the shipper and the schema to feed that index its preferred shape.

The teams that get this right tend to share a habit: they treat the question "what does the on-call engineer type into the search box at 03:00" as the single most important input to the backend choice, more important than vendor benchmarks, more important than peer recommendations, and more important than the team's existing skill set. The on-call engineer types whatever the language and index allow them to type easily; the language and index were chosen years earlier; the chain of decisions from years-ago architecture to 03:00 query is the actual architecture of your observability. Make that chain visible to the team that operates the backend, and the trade-offs become arguable rather than inherited.

References