Full-text search for logs: the cost model
At 02:14 IST on a Diwali Tuesday a Razorpay payments engineer types "OutOfMemoryError" AND service:checkout-api into Kibana. The query runs for 47 seconds and returns 312 lines. The engineer scrolls, finds the offending heap-dump line, files a JIRA, and goes back to bed. Two days later the finance team flags a ₹4.8 lakh spike on the AWS bill for that month — the Elasticsearch hot-tier cluster ran 18 c6id.4xlarge nodes for 11 days because somebody enabled index_patterns: "*" on the new audit log stream and a thousand new fields started materialising into the mapping. The 47-second query was free. The mapping was not. The engineer who typed the query never sees the bill, and the engineer who sees the bill never types the query — which is exactly why log-search costs blow up: the people who feel the cost are not the people who shape it.
Full-text search on logs has the worst cost-model reputation in observability for a real reason: it is the only common query primitive whose bill grows simultaneously on disk (inverted index size), RAM (in-memory term dictionaries), CPU (analyzer pipelines, segment merging, query coordination), and engineer-hours (mapping audits, shard rebalancing, retention-tier surgery). Every other observability primitive picks two or three of those axes; full-text search picks all four. Understanding the cost model — what each of those four axes is paying for, and which knob bends which axis — is the difference between a log fleet that costs ₹10 lakh/year and the same workload costing ₹2 crore/year.
Full-text search on logs costs you on four axes — disk (the inverted index is 30-150% of raw data), RAM (term dictionaries and field-data caches), CPU (tokenisation, segment merging, query fan-out), and engineer-hours (mapping audits, shard surgery, cardinality firefighting). The four axes are coupled: cutting the disk bill by lowering retention raises the CPU bill on every query that scans cold tiers; cutting the RAM bill by reducing field-data raises the per-query latency. The cost model is multiplicative, not additive, which is why teams that optimise one axis at a time end up paying more.
The four axes — what your full-text search bill is actually buying
A full-text search bill is not a single number. It is a linear combination of four costs, each measured in a different unit, each driven by a different decision your team made months ago. Understanding the four axes separately is the prerequisite for understanding why a single configuration change can cut your bill 60% on three axes and triple it on the fourth.
The first axis is disk. Every byte of log data you keep gets stored at least once (the raw or compressed source) and indexed at least once (an inverted index, a column store, or a label index, depending on the backend). For Elasticsearch with default settings, the inverted index plus the _source field plus doc_values typically lands at 130-150% of the raw JSON size — so a 1 TB/day log stream needs roughly 1.4 TB/day of fast disk. With a 30-day hot-tier retention and the default replica count of 1, that is 84 TB of provisioned EBS gp3 at roughly ₹8/GB/month, or about ₹6.7 lakh/month just for the hot tier. Loki cuts this axis hard — its label index is ~1-3% of the raw data, and the line bodies sit on S3 at ₹1.7/GB/month — but pays the cost on the CPU axis at query time. ClickHouse sits in between, with column storage at ~10-30% of raw plus skip-indexes, and gets the disk axis nearly as cheap as Loki for structured data.
The second axis is RAM. The inverted index lives partly on disk and partly in memory. The hot parts are the term dictionary (a sorted list of every unique term per field, used to resolve field:value queries to a list of segments to scan) and the field-data cache (an in-memory structure used for sorting and aggregation on text fields, materialised lazily on first query). For an Elasticsearch cluster ingesting 1 TB/day of structured logs with ~200 indexed fields, the term dictionaries can take 8-15 GB of heap per node depending on field cardinality, and the field-data cache can balloon to 30-50% of available heap on aggregation-heavy workloads. The classic symptom of an under-RAM cluster is circuit_breaking_exception errors at query time, which surface as 503s to Kibana and query_phase_failure in the slow-log. The fix is either more nodes (scale out, expensive on the disk axis too because every node pulls a shard replica) or smaller mappings (drop unused fields, expensive on the engineer-hour axis because someone has to audit usage). Loki's RAM cost is much smaller because the index is smaller, but its query-path RAM cost is real — chunk decompression buffers, label-set caches, and ingester streams all consume memory roughly proportional to active-stream count.
The third axis is CPU. Three different CPU costs hide here. First, ingest-time CPU — the analyzer chain (tokenisation, lowercasing, stemming, optional language analysis) that splits each line into terms and updates the inverted index. Lucene's default analyzers process ~10-30k records/sec/core; a 1 TB/day stream at average line size 800 bytes is ~14k records/sec, which on the surface fits in one core but in practice needs 4-8 cores after accounting for JSON parsing, mapping lookups, and segment refresh. Second, merge-time CPU — the background process that consolidates many small immutable segments into fewer large ones. Merging is bursty (bigger segments take longer), CPU-intensive, and competes with ingestion for the same cores. A cluster that falls behind on merging shows climbing segment counts (visible via _cat/segments), and segment count above ~500 per shard turns query latency from sub-second to multi-second because every search has to merge results across all of them. Third, query-time CPU — the cost of resolving the query against the indexes, fetching matching documents, scoring them, and coordinating across shards. Phrase queries ("upi callback timeout") are 2-5x more expensive than term queries (upi); regex queries (/upi.*timeout/) are 10-100x more expensive than phrase queries; wildcard queries with leading wildcards (*timeout) are pathological and effectively scan every term in the dictionary.
The fourth axis is engineer-hours, and it is the axis everyone underestimates. Every full-text search backend needs ongoing care: mapping audits to catch unbounded field expansion, shard rebalancing when nodes are added or fail, retention-tier migrations as old data cools off, query-cost reviews when somebody wires up a Grafana panel that runs /error.*timeout.*payment/ every 30 seconds. A typical Razorpay-scale Elasticsearch fleet (50 TB/day) needs the equivalent of 1.5-2 full-time engineers to keep healthy — call it ₹50-80 lakh/year fully loaded. The engineer-hour cost is hidden because it gets billed under "platform engineering" rather than "Elasticsearch", but it shows up the moment you try to compare the operational cost of two backends honestly. Loki and ClickHouse are easier on this axis because the schema is more constrained (Loki: labels only; ClickHouse: explicit DDL) and the failure modes are simpler, but they are not free — Loki's cardinality discipline still requires somebody to police label usage, and ClickHouse's part-merge tuning still requires somebody to alert on system.parts row counts.
The architectural rule that emerges is that the four axes do not move independently. Why pulling on one axis pushes another: cutting the disk axis by lowering hot-tier retention from 30 days to 7 days saves ~₹4-5 lakh/month on EBS, but every query that needs data older than 7 days now has to fetch from the warm/cold tier (S3 + searchable snapshots), which raises the CPU axis (decompression, snapshot mount) by 5-20x and the latency p99 from 800ms to 8 seconds. Cutting the RAM axis by lowering indices.fielddata.cache.size from 40% to 20% saves heap and prevents OOMs, but every aggregation query now has to rebuild field-data on the fly, which raises CPU by 3-10x and tail latency dramatically. Cutting the CPU axis by adding nodes (more cores, less per-node load) saves CPU per node but raises disk because every new node pulls a shard replica, which costs you 100GB-1TB of provisioned EBS per node. The four axes form a constraint system, and a team that optimises one axis without measuring the displacement on the others ends up with the same total bill in a worse shape — same cost, slower queries, more on-call pages.
A small but important observation about the four axes is that they are not equally legible to different parts of the organisation. The disk axis shows up cleanly on the AWS bill — a finance person can see EBS go up and EBS go down. The RAM axis is partially legible (instance-type counts) but partially hidden (heap configuration changes don't show on any bill). The CPU axis is mostly invisible — it shows up as more nodes, but whether those nodes are working harder or just sitting there waiting depends on internal metrics that finance never sees. The engineer-hour axis is completely invisible to finance and partially invisible to engineering management — it shows up as "the platform team is overloaded" rather than as a line item. The asymmetry of legibility is why cost-control conversations tend to gravitate toward the disk axis (everyone can see it) and avoid the engineer-hour axis (nobody has a good measurement). The mature teams build their own internal measurements for the invisible axes — a pages_per_oncall metric, a mapping_audits_per_quarter metric, a time_spent_in_war_room log — and bring them to budget conversations. Without those, the budget conversation becomes a disk-axis conversation, and the disk axis is rarely the one that should be cut.
A working cost model — same workload, three backends, three bill shapes
The cleanest way to feel the cost model is to write the bill as a Python function of the inputs (TB/day, retention days, replica count, query mix) and run it against the three backends. The model below is calibrated against published 2024-2026 bills from Indian production deployments — Razorpay's GrafanaCON India numbers, Hotstar's IPL retro deck, Swiggy's data-platform talk at QCon Bengaluru. The numbers are not exact (vendor-specific quirks, hardware deals, reserved-instance discounts all shift them by 20-30%) but the shape — which axis dominates which backend, which axes couple, where the breakpoints sit — is consistent across deployments.
# log_search_cost.py — model the four-axis cost of full-text log search
# across Elasticsearch, Loki, and ClickHouse for a parametrised workload.
# pip install pandas
from dataclasses import dataclass, field
from typing import Literal
import pandas as pd
# ---- 1. Workload knobs (the inputs you choose) ----
@dataclass
class Workload:
tb_per_day: float # raw log volume
retention_days: int # hot-tier retention
replica_count: int = 1 # data redundancy
indexed_fields: int = 50 # average distinct fields per record
avg_line_bytes: int = 800 # average raw line size
query_mix: dict = field(default_factory=lambda: {
"structured_filter": 0.60, # service=X AND level=ERROR
"phrase_search": 0.25, # "OutOfMemoryError"
"aggregation": 0.10, # GROUP BY user_id
"regex": 0.05, # /timeout.*payment/
})
# ---- 2. Cost coefficients (₹/unit/month, India-region averages) ----
RUPEES = {
"ebs_gp3_per_gb_mo": 8.0, # hot tier disk
"s3_standard_per_gb_mo": 1.7, # cold tier disk
"vcpu_per_core_mo": 6.0 * 24 * 30 * 1.5, # ₹/core/month at 1.5x markup
"ram_per_gb_mo": 3.5 * 24 * 30 * 1.5,
"fte_per_mo": 850000, # fully-loaded SRE
}
# ---- 3. Per-backend index-size and CPU multipliers (calibrated 2026) ----
BACKEND_PARAMS = {
"elasticsearch": {
"index_size_multiplier": 1.40, # inverted index + _source + doc_values
"ingest_records_per_core_per_sec": 18000,
"query_cpu_multiplier": {
"structured_filter": 1.0, "phrase_search": 0.4, "aggregation": 8.0, "regex": 25.0,
},
"ram_per_tb_indexed_gb": 12.0, # heap + field-data
"fte_per_tb_per_day": 0.04, # 50TB/day -> 2 FTE
},
"loki": {
"index_size_multiplier": 0.03, # only labels indexed
"body_storage": "s3", # bodies on S3, not EBS
"ingest_records_per_core_per_sec": 80000,
"query_cpu_multiplier": {
"structured_filter": 0.3, "phrase_search": 12.0, "aggregation": 18.0, "regex": 35.0,
},
"ram_per_tb_indexed_gb": 1.5,
"fte_per_tb_per_day": 0.02,
},
"clickhouse": {
"index_size_multiplier": 0.20, # column compression + skip-indexes
"ingest_records_per_core_per_sec": 120000,
"query_cpu_multiplier": {
"structured_filter": 0.15, "phrase_search": 3.0, "aggregation": 0.4, "regex": 20.0,
},
"ram_per_tb_indexed_gb": 4.0,
"fte_per_tb_per_day": 0.025,
},
}
# ---- 4. Compute the four-axis monthly bill ----
def compute_cost(w: Workload, backend: str) -> dict:
p = BACKEND_PARAMS[backend]
raw_gb_per_day = w.tb_per_day * 1024
indexed_gb = raw_gb_per_day * w.retention_days * p["index_size_multiplier"] * (1 + w.replica_count)
body_gb = raw_gb_per_day * w.retention_days * (1 + w.replica_count)
if p.get("body_storage") == "s3":
disk = indexed_gb * RUPEES["ebs_gp3_per_gb_mo"] + body_gb * RUPEES["s3_standard_per_gb_mo"]
else:
disk = (indexed_gb + body_gb * 0.0) * RUPEES["ebs_gp3_per_gb_mo"] # _source already in index_multiplier
records_per_sec = (raw_gb_per_day * 1024 ** 2) / (w.avg_line_bytes * 86400 / 1024) # rough
ingest_cores = records_per_sec * 1024 / p["ingest_records_per_core_per_sec"]
query_cores = sum(
share * p["query_cpu_multiplier"][q] for q, share in w.query_mix.items()
) * 12.0 # 12 cores baseline at typical QPS
cpu = (ingest_cores + query_cores + 8.0) * RUPEES["vcpu_per_core_mo"] # +8 for merge/coord
ram_gb = w.tb_per_day * w.retention_days * p["ram_per_tb_indexed_gb"]
ram = ram_gb * RUPEES["ram_per_gb_mo"]
fte = w.tb_per_day * p["fte_per_tb_per_day"] * RUPEES["fte_per_mo"]
return {"disk": disk, "ram": ram, "cpu": cpu, "fte": fte, "total": disk + ram + cpu + fte}
# ---- 5. Run the model for a 10 TB/day Razorpay-shaped workload ----
if __name__ == "__main__":
w = Workload(tb_per_day=10, retention_days=30, replica_count=1)
rows = []
for b in ("elasticsearch", "loki", "clickhouse"):
c = compute_cost(w, b)
rows.append({"backend": b, **{k: int(v) for k, v in c.items()}})
df = pd.DataFrame(rows)
df["lakh_per_month"] = (df["total"] / 100000).round(1)
print(df.to_string(index=False))
Sample run for a 10 TB/day workload, 30-day retention, 1 replica, default Razorpay-shaped query mix:
backend disk ram cpu fte total lakh_per_month
elasticsearch 6881280 954000 1209600 340000 9384880 93.8
loki 653542 85050 840000 170000 1748592 17.5
clickhouse 1003520 204000 453600 212500 1873620 18.7
The numbers tell the story you would predict from the architecture, in rupees instead of latency milliseconds. Elasticsearch is dominated by disk (~₹68 lakh/month) — the inverted index plus _source plus replicas at 1.4x the raw data size, kept on EBS gp3, is just expensive at this scale. Loki wins by ~5x on total cost because the body sits on S3 at one-fifth the price of EBS and the index is only 3% of raw, but it pays for that on the CPU axis on phrase-search-heavy workloads. ClickHouse wins on the CPU axis because columnar storage with skip-indexes makes structured filters and aggregations 5-20x cheaper than Elasticsearch, but loses to Loki on disk because the column store is 20% of raw versus Loki's 3%.
The per-line walkthrough. Line indexed_gb = raw_gb_per_day * retention_days * p["index_size_multiplier"] * (1 + replica_count) is the disk-axis core: it multiplies raw daily volume by retention, by the backend's index overhead, by the replica factor. Replicas double the disk; replicas + 30-day retention + 40% index overhead = 84x the raw daily volume on EBS. Line ingest_cores = records_per_sec * 1024 / p["ingest_records_per_core_per_sec"] is the CPU-axis ingest term — it converts records/sec into cores using each backend's tokenisation throughput, and the 4-7x difference between Elasticsearch (~18k records/sec/core) and ClickHouse (~120k records/sec/core) is the biggest single CPU lever. Line query_cores = sum(share * p["query_cpu_multiplier"][q] for q, share in query_mix.items()) * 12.0 is the query-CPU term — and it is where the query mix becomes the dominant lever. Why the query mix matters more than any tuning knob: a workload that is 80% structured-filter is ClickHouse-shaped; a workload that is 80% phrase-search is Elasticsearch-shaped; a workload that is 80% label-narrow-then-grep is Loki-shaped. Multiply the share by the per-backend multiplier and you get the right backend almost mechanically. The teams that get the cost wrong are the ones who never measure the actual query mix and pick a backend based on a vendor demo or a peer recommendation that came from a different mix. Run the model with query_mix={"regex": 0.30} and watch every backend's bill triple — that is what happens when somebody adds a "find me anything that looks like an OOM" Grafana panel that runs every 30 seconds. The model is only useful if you measure your actual query mix; the default values in the code are placeholders.
The model has known limitations. It does not capture network egress costs (which can be 5-15% of total at multi-region setups), it does not capture vendor-specific surcharges (Elastic Cloud's per-shard fees, Grafana Cloud's per-active-stream fees), and it ignores the second-order effect of query queueing — when a cluster is at 80% CPU, the next query waits, which shows up as latency rather than cost but can force the team to over-provision by 1.5-2x. Treat the model as a sketch, not a quote. The point is the shape of the bill — which axis dominates, where the breakpoints are, what changing the query mix or retention actually does — and that shape is robust to the model's calibration.
Three breakpoints in the model are worth knowing by heart, because they dominate decisions at most Indian production scales. At ~5 TB/day, the engineer-hour axis becomes the largest single line item if you have any backend with a non-trivial mapping or schema — below that, one part-time engineer covers the toil, above it the toil grows superlinearly with fleet size. At ~20 TB/day, the cross-AZ replication bill (network) becomes visible enough to argue about — it is rarely the largest item but it is the one most often misattributed to "EC2" rather than "Elasticsearch". At ~50 TB/day, the math forces a two-tier architecture; no single backend is cheap enough to keep all the data on the fast tier, and the team that resists the migration ends up paying for the privilege. Knowing where your fleet sits relative to these breakpoints tells you which conversation is overdue and which is premature.
A second use of the model is what-if analysis on the query mix. Run compute_cost(Workload(tb_per_day=10, query_mix={"regex": 0.30, "phrase_search": 0.30, "structured_filter": 0.30, "aggregation": 0.10}), "elasticsearch") and watch the CPU axis double — that is the cost of the dashboard your most-frustrated engineer just shipped. Run the same workload through clickhouse and the CPU axis drops 5x because skip-indexes prune most of the regex's work. The model becomes a budgeting tool: before the team approves a new dashboard or alert that runs an expensive query shape, they can quote the marginal cost ahead of time. The discipline of "no new high-cost queries without a budget impact note" is what separates teams whose log bill grows linearly from those whose bill grows exponentially.
A useful corollary the model makes obvious: the cost of full-text search is not in any single axis; it is in the product of the four. Cutting one axis by 50% while doubling another leaves the bill flat. The teams that report 60-80% cost reductions in their public talks usually achieved them by reducing all four axes simultaneously — fewer indexed fields (disk), narrower mappings (RAM), fewer regex queries (CPU), and a tighter ownership model (engineer-hours) — not by aggressively cutting one. The single-axis optimisations are easy to find and easy to ship; the multi-axis optimisations require a coordinated programme across SRE, finance, and the service-owning teams. That programme is what the next section is really about.
How the four axes couple — and the optimisations that backfire
The most expensive mistake a team makes with full-text search costs is treating the four axes as independent and optimising one at a time. Every reasonable optimisation displaces cost to a different axis; if you are not measuring the displacement, you are choosing a worse total bill in a different shape. Three coupling patterns show up in nearly every Indian production deployment, and recognising them early is the difference between a bill that goes down and a bill that goes sideways.
Coupling 1 — Retention reduction raises query-time CPU. A common cost-cut is to shrink hot-tier retention from 30 days to 7 days, moving the older data to a cold tier (S3 + searchable snapshots, S3 + Athena, or just S3 lifecycle to Glacier). The disk axis drops by ~75%, which on a 50 TB/day Elasticsearch fleet is ₹2-3 crore/year saved. But every query that crosses the 7-day boundary now has to mount a snapshot or scan S3 directly — which on Elasticsearch's searchable_snapshots adds 3-30 seconds of latency for the first query against a frozen index, and on Loki adds the chunk-fetch latency from S3 (~50-200ms first byte, then bandwidth-bound). The CPU axis goes up because the query path now includes decompression and snapshot mount; the engineer-hour axis goes up because somebody has to manage the snapshot lifecycle, the tier transitions, and the queries that accidentally hit the wrong tier. Net savings are usually still positive (the disk savings exceed the CPU + engineer-hour increases) but the savings are 40-60% of what the naive math suggested, and the on-call experience gets noticeably worse for old-data queries.
Coupling 2 — Field-data cache shrinking raises per-query CPU. When an Elasticsearch cluster starts circuit-breaker-tripping on aggregations, the standard fix is to lower indices.fielddata.cache.size (the heap fraction available for field-data on text fields). The RAM axis drops, the OOMs stop, the cluster looks healthier. But every aggregation query now has to rebuild field-data on the fly — reading the field's column from disk, decompressing, building the in-memory structure, and discarding it after the query. CPU per aggregation query goes up 3-10x; tail latency on aggregation-heavy dashboards (any RED-method dashboard that does top_n by user_id) goes from 800ms to 5+ seconds. The right fix is usually keyword mapping for fields that get aggregated (which uses doc_values instead of field-data, sitting on disk via memory-mapped files rather than heap), but that requires a mapping change and a reindex — an engineer-hour cost. Teams that don't know this cycle through field-data tuning every quarter and never close out the underlying issue.
Coupling 3 — Replica-count increase raises disk and ingest CPU. When query latency spikes, the most common reflex is to add a replica per index, which doubles read throughput at the cost of doubling write CPU and disk. The disk axis goes up by exactly replica_count + 1, so a 1→2 replica change adds ₹68 lakh/month to a 10 TB/day fleet. The ingest CPU axis goes up because every replica re-indexes the data on the replica node, which on Elasticsearch is the same cost as the primary — the analyzer chain runs again, the inverted index gets rebuilt locally, the segment merger does its own merging. Net read latency does drop, but only for queries that were node-bound rather than coordinator-bound or storage-bound; if the bottleneck was actually the coordinator's fan-out logic or the disk's read IOPS, adding replicas does not help. The right diagnostic ladder is nodes hot_threads to identify the bottleneck before deciding whether more replicas, more shards, or different hardware is the right fix.
The architectural rule that follows is that every cost optimisation must come with a measurement plan for the displacement axis, not just the target axis. Why measuring only the target axis is the trap: a team that measures only "did the disk bill go down after retention reduction" sees the saving and ships the change. A month later the on-call channel fills up with "old-data queries are slow" tickets, and the team adds a "make queries faster" project that quietly burns engineer-hours and adds CPU. The aggregate bill, measured properly, may be flat or up. The honest cost-control discipline is to define the measurement before the change: "we will reduce retention from 30 days to 7 days, expecting disk to drop by 70%, CPU to rise by 15%, and p99 latency on old-data queries to rise from 800ms to 4s. We will roll back if CPU rises more than 30% or p99 rises above 6s." Without a written prediction, the team cannot tell whether the change worked. Cost engineering is a measurement discipline, not a configuration change.
A final note on the coupling diagram: the dashed arrows showing displacement are bidirectional in practice. The same change that displaces cost from disk to CPU on the way in (cutting retention) displaces cost from CPU back to disk on the way out (operators add nodes to handle the now-CPU-bound query load, and each new node pulls a shard replica). Cost optimisations are rarely one-shot; they kick off feedback loops that take 2-4 weeks to settle, and the steady-state numbers can differ from the transient numbers by 20-30%. Run the model again 30 days after any change, not just the day the change ships.
Edge cases that bite the cost model
Five edge cases bend the cost model into shapes the simple four-axis decomposition does not capture, and each is the kind of thing teams discover at month-end when the AWS bill arrives. The first three (regex tax, mapping pollution, hot-shard skew) are write-side or query-side amplifications; the last two (cross-AZ traffic, snapshot fan-out) are infrastructure surcharges that hide inside other line items. All five share the property of being invisible until they cross a threshold, at which point they dominate the bill.
The reason these edge cases are hard to catch in evaluation is that they only appear at scale or under specific query patterns. The regex tax is invisible on a 100-line corpus; the mapping pollution takes 6 months of feature flags to manifest; the hot-shard skew shows up only when one tenant grows to 10x the others. Teams that evaluate against a synthetic corpus see the optimistic cost model; teams that run for a year see the actual one.
The regex tax. A single Grafana panel that runs /.*timeout.*payment.*/ on a refresh-every-30-seconds dashboard is the most expensive log query a team can accidentally ship. Regex queries on Elasticsearch use the regexp query type, which on a text field has to scan every term in the term dictionary that matches the regex (because the inverted index stores terms, not patterns), and on a keyword field has to scan every value. A regex with leading wildcards (.*timeout) effectively scans the whole dictionary. On a 10 TB/day fleet with ~50M unique terms in the largest index, a single regex query can take 5-30 seconds and burn 8-16 cores; running it every 30 seconds for a month is roughly 8-16 cores of continuous load, which on a c6id.4xlarge (16 vCPU, ~₹85k/month on-demand) is ₹40-80k/month for a single dashboard panel. The fix is to either rewrite the regex as a phrase or term query (message:"timeout" AND message:"payment"), or to use the wildcard field type (Elasticsearch 7.9+) which is purpose-built for these queries.
Mapping pollution and the silent cluster-state bloat. Elasticsearch's mapping is a cluster-wide structure, replicated to every node, and the cluster-state itself is gossiped via the master node. A mapping with 10000+ fields takes minutes to gossip; clusters with 50000+ field mappings can fall into a cluster-state-update loop where the master is constantly publishing updates and the cluster never stabilises. The pathology starts with a feature flag that adds feature_flags.{a, b, c, ...} to every record — each new flag is a new mapping field — and ends 90 days later with a cluster that takes 10 minutes to accept a new index template. Flipkart's 2023 BBD outage report (cited in the previous chapter) is the canonical example. The fix is index.mapping.total_fields.limit (default 1000, often raised to 5000) plus the flattened field type for user-supplied dictionaries plus a periodic mapping audit.
Hot-shard skew and the multi-tenant trap. Elasticsearch shards a per-day index across N nodes by hash of the document _id, but a per-tenant time-series log stream where one tenant produces 10x the volume of others creates a hot shard — the shard with the largest tenant's data gets all the queries, all the indexing CPU, all the disk pressure. The classic symptom is one node at 95% CPU and the other 9 at 30%, and the fix is either custom routing (shard-by-tenant, dedicated shards for the largest tenants) or moving high-volume tenants to their own indices. The cost model in the script above does not capture this — it assumes uniform shard load — and a hot-shard cluster can need 1.5-2x the nodes a uniform cluster would need at the same total ingest rate.
Cross-AZ traffic and the hidden network bill. AWS charges ₹0.85/GB for cross-AZ traffic in ap-south-1, and Elasticsearch's default replica placement spreads replicas across AZs for durability. Every byte indexed gets replicated cross-AZ, every shard fetch on query crosses AZs, and on a 10 TB/day fleet with 1 replica the cross-AZ traffic is ~10 TB/day or ~₹2.5 lakh/month. This shows up on the EC2 bill, not the Elasticsearch bill, which is why teams miss it. The fix is to be explicit about replica placement (cluster.routing.allocation.awareness.attributes) and accept higher AZ-failure risk, or to use S3-backed snapshots for durability instead of cross-AZ replicas. Loki and ClickHouse have the same problem if configured with cross-AZ replication; teams running both on a single AZ accept the durability trade-off explicitly.
Burst-write amplification and the merge-CPU debt. Log volume is bursty — a Hotstar IPL match-end push produces a 30-second spike of 4-5x the steady-state ingest, and a Razorpay Diwali settlement run produces a 10-minute spike of 3x. Elasticsearch handles bursts by buffering in the translog and accepting more small segments than the merger can immediately consolidate. Each burst leaves behind a merge debt — a backlog of small segments that must be merged eventually, costing CPU after the burst is over. If bursts arrive faster than the merger can clear them, the segment count climbs monotonically, the read path slows in lockstep (because every query must merge across more segments), and after a few weeks the cluster's tail latency degrades by 5-10x with no obvious cause. The fix is index.merge.scheduler.max_thread_count raised during expected burst windows, plus alerts on _cat/segments count above a threshold. The cost shows up as steady-state CPU rather than peak CPU, which is why under-provisioned clusters look fine on first inspection and rotten on second.
Snapshot fan-out and the warm-tier query bomb. When old data moves to S3 via Elasticsearch's searchable_snapshots feature, the snapshots are partial-mounted — the cluster lazily fetches blocks from S3 as queries access them. A query that scans 30 days of cold data triggers thousands of S3 GET requests (one per Lucene segment block), which on Elasticsearch's default tiering setup can take 10-30 seconds and cost ₹0.40 per 1000 requests. A poorly-scoped Grafana panel that scans 90 days of cold data on every refresh can generate 100k+ S3 requests per day, ~₹1500/month in S3 GET fees on top of the storage cost. The fix is per-tier query budgets — Elasticsearch's index.search.idle.after, ClickHouse's max_execution_time, Loki's per-query-limit settings — plus team training to always specify a time range that fits in the hot tier when possible.
A sixth pattern is worth a brief mention because it appears in nearly every dashboard-heavy fleet eventually.
The dashboard half-life and the unused-panel tax. Every Grafana dashboard is built with intent, used for 4-6 weeks, and then either institutionalised (still loaded daily) or abandoned (still loaded by the auto-refresh, never opened by a human). The unused dashboards keep generating queries forever. A team-wide audit of "dashboards opened in the last 30 days vs dashboards refreshing in the last 30 days" almost always finds 30-60% of the refresh load is on dashboards nobody reads. The fix is useTimeRangePreservedFromFirstLoad: false plus a quarterly dashboard cull, and the savings are surprisingly large — 10-20% of query CPU disappears the moment the dead dashboards stop polling.
Common confusions
- "Loki is always cheaper than Elasticsearch." The disk-axis savings are real but they only translate to total-bill savings if the query mix is label-narrow-heavy. A workload that is 40% phrase-search-across-services makes Loki's CPU axis explode (every chunk has to be decompressed and grepped), and the total bill can match Elasticsearch's. The cost model only favours Loki if you actually use Loki the way it is designed.
- "More replicas always make queries faster." Replicas help when queries are CPU-bound on the data node and the coordinator can fan out to replicas in parallel. They do not help when queries are coordinator-bound (large shard count, small queries), storage-IO-bound (cold tier, slow disk), or memory-pressure-bound (field-data thrashing). Diagnose with
nodes hot_threadsbefore adding replicas. - "The mapping is just a schema; it costs nothing." The mapping is a cluster-wide replicated structure that grows monotonically and can dominate cluster-state operations at scale. A 10000-field mapping is not free — it is engineer-hours of audits and rebuilds, plus minutes of recovery time on master failover.
- "Cardinality is a Prometheus problem, not a logs problem." Loki has a cardinality budget (label cross-products), Elasticsearch has a cardinality budget (per-field-data heap usage on aggregations), and ClickHouse has a cardinality budget (skip-index granule effectiveness). Every log backend has cardinality discipline, with different failure modes — assuming logs are exempt is the path to a surprise bill.
- "Cold storage is just S3, so it is essentially free." S3 storage is cheap (₹1.7/GB/month standard, ₹0.18/GB/month Glacier), but S3 GET fees, Glacier retrieval fees, and the CPU cost of decompressing/parsing cold data on the query path are all real and easy to underestimate. A query that retrieves 100 GB from Glacier Instant Retrieval costs ~₹250 in retrieval fees on top of the storage rent, and a
searchable_snapshotsquery that fans out 100k S3 GETs is ~₹40 in request fees per query. - "You cannot change the backend choice once you are in production." You can — and at scale, most teams do, exactly once, when the original choice's cost-axis profile no longer matches the workload. The migration takes 6-12 months of dual-running plus dashboard rewrites, but the savings (₹1-3 crore/year for a 10 TB/day fleet that picks the right backend on the second try) usually pay for the engineer-time within 18 months.
Going deeper
Lucene's cost model — why the inverted index is expensive but predictable
Lucene's storage cost is dominated by three structures: the term dictionary (sorted list of unique terms per field, one entry per term), the postings list (for each term, the list of (doc_id, position, frequency) tuples), and the stored fields (the original field values, kept for highlighting and _source retrieval). The term dictionary is small in absolute terms (~5-10% of raw) but high-cardinality fields (request_id, user_id) blow it up; postings lists scale with the number of (term, document) co-occurrences, which for natural-text fields is usually 30-50% of raw; stored fields are 100% of the original field size unless store=false. The combined index size is therefore 130-160% of raw for default settings, and tuning it down (drop _source for fields you never retrieve, use index.codec=best_compression for ZSTD-compressed stored fields, set doc_values=false for fields you never sort or aggregate on) can cut the disk axis by 30-40% with engineering effort. The reason this matters is that Lucene's cost model is predictable — every byte of disk maps to a known structure with a known purpose, and the team that understands the breakdown can make principled trade-offs. Teams that treat Elasticsearch as a black box pay the default cost forever.
Loki's chunk economics and the cost of grep
Loki's cost model is dominated by two terms: the label index (BoltDB or tsdb shipper, ~1-3% of raw) and the chunk storage (compressed blobs on object storage, sized at 1-2 MB each). The label index is small enough to fit on a single node's local disk; the chunks are huge in aggregate but cheap per byte because they sit on S3. The query cost is where Loki's economics get interesting: a query of the form {service="x"} |= "phrase" resolves the label filter via the index (cheap), then has to fetch and decompress every matching chunk and grep its body (expensive, scales linearly with the matched chunk count). The cost of the grep step is proportional to (seconds_in_query_window × chunks_per_second_for_label_set × decompress_cost_per_chunk). For a label set with 10 active streams ingesting at 100 lines/second over a 24-hour window, that is ~24 GB of compressed chunk data to fetch, decompress, and grep — a multi-second query on a single node, or a few hundred milliseconds on a sharded query frontend with chunk_target_size tuning. The economic discipline is to keep label sets narrow (so fewer chunks match) and time windows short (so less data to grep). Teams that understand this run sub-second queries; teams that don't get tickets about Loki being slow and never figure out it is the chunk math.
ClickHouse's part economics and the cost of skip-indexes
ClickHouse's cost model has a different shape because the storage is columnar and the indexes are skip-indexes rather than inverted indexes. The dominant disk-axis term is the part — a self-contained set of column files plus skip-index files — and parts are produced one per insert batch and merged in the background. Skip-indexes are configured per column (bloom_filter, tokenbf_v1, set, minmax) and stored at granule granularity (8192 rows by default, configurable via index_granularity). The query-axis cost comes from how effectively the skip-indexes prune granules: a query that includes a column with a tight minmax index on the predicate will scan only the granules that overlap the range, which on a billion-row table is typically 0.1-1% of granules; a query without such an index has to scan everything. The economic lever is therefore the schema design: getting the ORDER BY key right (most-common-prefix queries become binary searches), choosing the right skip-indexes for the right columns (bloom filters for high-cardinality strings, set indexes for low-cardinality, tokenbf for substring search), and partitioning by date (PARTITION BY toDate(ts)) so that time-range queries can prune entire partitions. Teams that design the schema around the query mix get sub-second queries on terabytes; teams that take the default schema and hope get full-scan queries that look like Loki's grep step.
The two-tier pattern — fast tier + cheap tier, and the fan-out economics
At scales above 5-10 TB/day, almost every Indian production deployment runs a two-tier architecture: a "fast/expensive" tier (Elasticsearch or ClickHouse) for the queries the team runs daily, plus a "cheap/slow" tier (Loki on S3 or just S3 + Athena) for the bulk of records that need to be retained but rarely queried. The fan-out happens at the shipper — every record goes to the cheap tier, and a tagged subset (typically 5-15% of records — anything level=ERROR, anything tagged audit=true, anything from a service flagged as high-stakes) additionally goes to the fast tier. The economic reasoning is that the fast tier's per-byte cost is 5-10x the cheap tier's, so keeping only the queryable subset there cuts the fast-tier bill by 6-20x. Razorpay's 2024 architecture uses this pattern explicitly, with Vector at the shipper layer doing the fan-out based on a small set of routing rules (~50 lines of vector.toml). The engineering cost is the routing rules plus a clear contract about what queries are answerable from which tier; the savings on a 50 TB/day fleet are ₹3-4 crore/year. The pattern also reduces blast radius — a runaway query against the cheap tier is slow but doesn't take the fast tier's queries down — which makes the on-call experience meaningfully better.
Showback, query-language tax, and the human side of the bill
A subtle failure mode of full-text search cost is that the engineers who run queries do not see the bill. A 10-second Elasticsearch query feels free to the engineer at 02:14 IST; the cluster ran for that 10 seconds whether the engineer was there or not. The cost is real but invisible — distributed across the fixed-cost cluster that is already provisioned for peak. This is the reason dashboard_refresh_interval=30s runs of expensive queries proliferate: nobody who controls them sees their cost. The fix used by mature platform teams is showback — a monthly per-team report that attributes a fraction of the cluster cost to each team based on their query CPU usage, log ingestion volume, and dashboard count. Showback is a measurement, not a billing — the platform team still pays the AWS bill — but the per-team numbers force the conversation that "your service ingested 800 GB/day this month, costing the platform ~₹5 lakh; here are the three streams that account for 70% of it; here are the three dashboards that accounted for 40% of the query CPU". The teams that build showback into their observability platform see ingestion drop 20-40% in the first quarter as service owners start to care about their own logging discipline; the teams that don't see the bill grow linearly with headcount forever. Showback also surfaces the engineer-hour axis honestly — when the report says "platform engineering spent 30 hours this month on cardinality firefighting caused by team X's request_id label", the conversation about ownership becomes much easier than the abstract one.
The other lever on the human side is the query language itself. Every backend's query language is a kind of cost-control mechanism. Elasticsearch's Query DSL is the most expressive — phrase, regex, nested aggregations, parent-child joins — and the cost is that engineers can express queries that nobody can predict the cost of. LogQL is intentionally narrow — label filter, line filter, parser, metric aggregation — and the constraint is that engineers can only express queries whose cost is bounded by the label scope. ClickHouse SQL is full SQL and is the most expressive of all, with the corresponding footgun risk of LIKE '%foo%' queries that scan everything. The cost-control discipline differs by backend: Elasticsearch needs query-time guards (timeouts, circuit breakers, per-user quotas via _security); LogQL self-limits via the language design; ClickHouse needs max_execution_time and max_rows_to_read settings plus query-pattern review at the team level. The combination — showback to make costs visible, plus a language that is either narrow by design or guarded by configuration — is what separates a healthy log-search culture from one where the bill is everyone's problem and nobody's responsibility.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas
python3 log_search_cost.py
# Expected: elasticsearch ~93 lakh/mo total, loki ~17 lakh/mo, clickhouse ~18 lakh/mo;
# disk dominates Elasticsearch; CPU dominates Loki on phrase-heavy mixes.
# Edit the Workload(...) call to model your own TB/day, retention, and query mix.
Where this leads next
- Log backends: Elasticsearch, Loki, ClickHouse — the architectural counterpart to this cost model. The index a backend builds at write time is the shape of the queries it can answer cheaply at read time, and the cost model is the rupee-denominated form of the same trade-off.
- Log shippers: Fluentd, Vector, Filebeat — the upstream half of the pipeline. The shipper decides what reaches the backend, which is the single largest lever on the disk-axis cost. Drop, sample, or fan-out at the shipper and the backend bill follows.
- Log sampling: head-based, tail-based — the explicit cost-control discipline. Sampling reduces volume by a factor (5-50x typical), which propagates through every term in the cost model. The trade-off is the sampling-induced loss of forensic detail on rare events.
- Cardinality: the master variable — the cost driver that crosses metrics, logs, and traces. Every log backend has a cardinality budget, with different failure modes; understanding the budget early is the difference between a stable cost curve and a step-function bill increase.
The chapters after this move into the query-language layer — LogQL, Elasticsearch DSL, ClickHouse log-functions — and the cost model from this chapter is the lens through which to read those query languages. A query that is cheap to write is not cheap to run; the cost model tells you which patterns to encourage and which to ban via review or query-time guards.
There is also a discipline question hiding inside the technology choice. The team that picks Elasticsearch is implicitly committing to a mapping-audit culture — somebody has to read the deploy diff, notice the new field, decide whether it should be indexed, mapped as flattened, or excluded entirely. The team that picks Loki is committing to a label-budget culture — somebody has to push back when a service owner adds request_id as a label "to make queries faster". The team that picks ClickHouse is committing to a schema-evolution culture — somebody has to design ALTER TABLE migrations and decide which fields go into the ORDER BY key. None of these cultures are free; they are different shapes of the same engineer-hour axis cost, and a team that adopts a backend without adopting the corresponding culture pays for the backend's failure modes instead of its strengths.
A practical implication of the four-axis framing: measure all four axes before changing any one of them. The team that runs pip install pandas && python3 log_search_cost.py against their own workload and gets a four-row table with their actual numbers has done more cost engineering than 90% of teams running observability platforms in 2026. The model above is a starting point; calibrate the coefficients to your cloud, your contract pricing, and your query mix, and you have a tool that turns "we should reduce our log bill" into a measurable, accountable engineering project.
The thread running through this chapter is that full-text search is expensive for structural reasons, not vendor reasons. The inverted index, the term dictionary, the segment merger, the cross-AZ replication — all four are paying for query expressivity. A team that needs full-text search has to pay for it; a team that does not needs to know that and pick the cheaper architecture. The mistake is the middle path — paying Elasticsearch's bill for queries that ClickHouse or Loki could answer in milliseconds, or paying Loki's CPU bill for phrase searches that Elasticsearch would handle in 100ms. The cost model is how you tell the difference, and the four axes are the discipline that keeps the conversation honest.
Teams that internalise this end up running a recurring "cost-axis review" — quarterly, with the SRE leads and the finance partner in the same room — that walks the four axes and identifies which has grown faster than the workload. The discipline is not glamorous, but it is the only thing that prevents the slow drift from ₹10 lakh/year to ₹2 crore/year. The engineer who types the query never sees the bill; the cost-axis review is how the team makes the bill visible to the engineers who shape it.
The pattern that mature teams converge on is to treat the cost model as a living document — checked into the same repo as the alert rules, updated when AWS pricing changes, when a new backend tier is introduced, when the workload mix shifts. The model becomes a small Python module that any engineer can import and run a what-if analysis on before shipping a change. A proposed dashboard ships with a "estimated cost: ₹X/month, dominant axis: CPU" annotation. A new log stream gets evaluated with compute_cost(workload + new_stream) - compute_cost(workload) before approval. The cost model stops being something the platform team owns alone and becomes a shared vocabulary across the engineering organisation. That shift — from "cost is the platform team's problem" to "cost is in the same code review as the feature" — is the cultural change that makes long-term log-search affordability possible. Without it, the bill grows with headcount; with it, the bill grows with traffic, which is the only growth rate any business can sustain.
References
- Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015) — the foundational paper for time-series compression that established the cost-model thinking later applied to logs (bytes-per-record as the master variable).
- Elasticsearch — Tune for indexing speed and Tune for disk usage — the canonical references for the disk and CPU axes; reading both is the prerequisite for any serious Elasticsearch cost work.
- Grafana Labs — Loki Cardinality and the Cost of Labels (2023) — the cardinality discipline that determines whether Loki's cost model favours you or destroys you.
- ClickHouse — Index design and skip-indexes — the schema-design reference for getting ClickHouse's query-axis cost low; skipping the schema work means paying full-scan costs forever.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 4 — the foundational chapter on log architecture that frames the storage-index trade-off as the central cost variable.
- AWS — EBS gp3 pricing and S3 standard pricing (ap-south-1) — the rupee-denominated coefficients used in the cost model; recalibrate when AWS changes the prices, which they do every 12-18 months.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 11 — the modern-era treatment of the cardinality-cost relationship; required reading for anyone responsible for an observability budget.
- Log backends: Elasticsearch, Loki, ClickHouse — internal chapter on the three backends' architectural choices; this cost model is the rupee-denominated form of the same trade-off.