Long-term storage: Thanos, Cortex, Mimir

A platform engineer at PhonePe in 2021 keeps a Prometheus pair-per-region with 15 days retention and a 1.4 TB local SSD on each. The compliance team announces NPCI now requires 13 months of UPI latency metrics, queryable. The local Prometheus design has no answer for that — even doubling the disk only buys 30 days, and Prometheus does not federate cleanly across regions. She has to pick a long-term storage layer, and the three real options — Thanos, Cortex, Mimir — disagree on almost every architectural axis. This chapter is about what she actually chose and why.

The three projects share an end goal (query Prometheus data older than your local retention, across regions, with HA) and a substrate (S3-compatible object storage holding TSDB blocks). They disagree on the read path, the write path, the HA model, who deduplicates, and how queries fan out. The decisions are not interchangeable: a team running Thanos has a fundamentally different operational shape from a team running Mimir, and switching between them after a year of operation is a multi-quarter project.

Thanos, Cortex, and Mimir all push Prometheus TSDB blocks into S3 and query them back, but they disagree on everything else. Thanos uses a sidecar next to each Prometheus, queries fan-out via a thin Querier, and dedup happens at read time — minimum 3 services, minimum operational pain, but query latency grows with replica count. Cortex (now Grafana-forked into Mimir) ingests via a remote-write pipeline through 5+ stateful microservices, dedups at write time via consistent hashing across an "ingester" ring, and queries hit a local cache — much higher write throughput, much higher operational cost. Mimir is Cortex with the operational burrs filed off (zone-aware replication, jsonnet-only deploys, no Bigtable/Cassandra option). Pick Thanos for the ≤10-Prometheus, single-team case; pick Mimir for the multi-tenant >100-Prometheus, dedicated-platform-team case; do not pick Cortex in 2026 for a new deployment.

The shared problem: Prometheus blocks need a second home

Every Prometheus instance writes its data as 2-hour TSDB blocks (see Prometheus TSDB internals), compacts them up to 36-hour blocks over time, and deletes them when they age past the local retention window. The blocks themselves are self-contained and transportable — a data/01HX9... directory holds an index, a chunks/ subdirectory of compressed time-series chunks, a meta.json describing the time range, and a tombstones file. Any of the three long-term-storage projects you might choose will, at the bottom of its architecture, push these block directories into an object store (S3, GCS, Azure Blob, MinIO) and query them back later. That's the easy part.

The hard parts are everything else. HA Prometheus means running two identical Prometheus instances scraping the same targets — both produce blocks with the same metric names but slightly different timestamps and slightly different values (because scrape timing differs by milliseconds), and the long-term store has to deduplicate them when a query asks for cpu_usage_seconds_total{instance="payments-1"}. Global query means a user in Bengaluru running a query for sum(rate(http_requests_total[5m])) by (region) should get back data from every region's Prometheus pair, even if those Prometheuses are 200ms apart over WAN. Downsampling means a 1-year query for a 30-second-resolution metric should not have to read 1 billion samples — the system needs pre-computed 5-minute and 1-hour rollups. Tenancy means a single deployment serving Razorpay, Cred, and Dream11 must isolate their data from each other so a runaway query in one tenant cannot kill another's ingestion. Every architectural difference between Thanos, Cortex, and Mimir is a different answer to one of these four problems.

The shared problem long-term storage solves — Prometheus blocks, HA, global query, downsampling, tenancyDiagram shows the four shared problems that any Prometheus long-term storage layer must solve. At top: two Prometheus instances scraping the same targets producing nearly-identical blocks (HA dedup). At left: blocks from many regions queried as one (global query). At centre: rollup pyramid showing 30s native, 5min, and 1h resolution layers (downsampling). At right: per-tenant data partition with isolation boundary (multi-tenancy). At bottom: object storage as the shared substrate.The four shared problems any Prometheus long-term storage layer must solve1. HA deduptwo Prom instancesscrape same targetstimestamps differ ms→ pick one or merge2. global queryN regions, M Prom eachsingle PromQL acrossWAN-spanning data→ fan-out + merge3. downsampling1y at 30s resolution= 1B samples/series→ 5m + 1h rollupscomputed offline4. multi-tenancyrazorpay, cred, dream11share one cluster→ X-Scope-OrgID+ resource quotasshared substrate: object storageS3 / GCS / Azure Blob / MinIO$0.023/GB-month, 11 nines durability, region-local read latency 30-80msthree answers, one substrate:Thanos: thin sidecar, read-time dedup, query-time fan-out (3 services minimum)Cortex / Mimir: heavy ingest pipeline, write-time dedup via hash ring, query cache (5-15 services)
Illustrative — not measured data. Every long-term-storage system for Prometheus has to solve the same four problems on top of the same object-storage substrate. Thanos, Cortex, and Mimir disagree on all four.

The shared substrate matters. S3-compatible object storage at ₹0.023/GB-month (AWS Mumbai region pricing in 2025) makes 13 months of metrics affordable that local SSD never could — a 4 TB Prometheus block-store on local NVMe costs around ₹40,000/month in cloud-managed disk, the same data on S3 costs ₹7,400/month and gains 11 nines durability, region replication, and lifecycle-tiered cold storage at ₹0.0033/GB-month for blocks older than 90 days. The architectural choice is therefore never about whether to use object storage — it is about which read path, write path, and consistency model you want sitting in front of it.

Thanos: the sidecar that does the least

Thanos was the first project to ship (December 2017, by Bartek Plotka and Fabian Reinartz at Improbable). Its design philosophy is "do the minimum to make Prometheus globally queryable; let Prometheus stay Prometheus". A Thanos deployment adds three components and changes nothing about how Prometheus is operated.

The Thanos Sidecar runs as a container next to every Prometheus pod. It does two jobs: it serves the Prometheus instance's local TSDB blocks via the StoreAPI gRPC interface, and it uploads compacted blocks to S3 every 2 hours when Prometheus seals them. The sidecar reads — never writes — Prometheus's local data; the only filesystem write it makes is to its own log. This means Prometheus's existing operational shape (the same prometheus.yml, the same scrape configs, the same --storage.tsdb.path) does not change. The sidecar adds about 40-80 MB of memory and 0.1 vCPU per Prometheus instance.

The Thanos Store Gateway is a stateless service that mounts the S3 bucket as a virtual filesystem and exposes its blocks via the same StoreAPI. It maintains in-memory indices for the blocks it serves (cached from S3 on startup, refreshed periodically) and downloads block chunks on demand when a query asks for them. The Store Gateway is horizontally scalable; you typically run 2-4 of them per region for HA.

The Thanos Querier is the query frontend. When a user runs a PromQL query, the Querier discovers all StoreAPI endpoints (sidecars + store gateways), fans the query out to all of them in parallel, deduplicates the results (more on this below), and merges them. The Querier itself holds no data — it is pure dispatch logic. A typical Thanos Querier handles 50-500 QPS on 2-4 vCPU.

The dedup model is where Thanos's "read-time" philosophy shows. Two Prometheus instances scraping the same payments-api-1 target produce two blocks with the same series, slightly offset timestamps, and slightly different sample counts (because of jitter and missed scrapes). Thanos does not try to merge these at write time. Instead, the Querier picks one — by external_labels (replica="A" vs replica="B") — and prefers it when both are present, falling back to the other if there's a gap. The dedup happens on every query, in memory, at fan-out time. This is dramatically simpler to operate than Cortex's write-time dedup but has a real cost: queries fanning out to 50 sidecars and 4 store gateways, each returning 30,000 samples for a 6-hour sum(rate) query, do 220 deduplication merges per query, and the Querier's CPU usage scales linearly with replica count and series cardinality.

# thanos_query_walk.py — drive Thanos Querier with realistic queries, measure fan-out cost
# pip install requests pandas hdrh
import requests, time, statistics
from datetime import datetime, timezone, timedelta

THANOS = "http://thanos-querier.observability.svc:10902"

# A Razorpay-shaped query: rate of UPI checkout requests, 5m window, by region, last 7d
END = datetime.now(timezone.utc)
START = END - timedelta(days=7)

queries = [
    ('upi_5m_by_region',
     'sum(rate(http_requests_total{service="upi-checkout"}[5m])) by (region)'),
    ('upi_p99_by_region',
     'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="upi-checkout"}[5m])) by (region, le))'),
    ('error_budget_30d',
     '1 - sum(rate(http_requests_total{service="upi-checkout",status=~"5.."}[30d])) / sum(rate(http_requests_total{service="upi-checkout"}[30d]))'),
]

results = []
for name, q in queries:
    latencies = []
    for trial in range(5):
        t0 = time.time()
        r = requests.get(f"{THANOS}/api/v1/query_range",
                         params={"query": q, "start": START.timestamp(),
                                 "end": END.timestamp(), "step": 300},
                         timeout=60, headers={"Thanos-Replica-Labels": "replica"})
        r.raise_for_status()
        body = r.json()
        latencies.append(time.time() - t0)
    series = len(body["data"]["result"])
    print(f"{name:<20} fanout latency p50={statistics.median(latencies):.2f}s "
          f"p99={max(latencies):.2f}s  series_returned={series}")
    results.append((name, latencies, series))

# Fetch the StoreAPI inventory to count fan-out targets
stores = requests.get(f"{THANOS}/api/v1/stores").json()["data"]
sidecars = [s for s in stores if s.get("storeType") == "sidecar"]
gateways = [s for s in stores if s.get("storeType") == "store"]
print(f"\nfan-out: {len(sidecars)} sidecars + {len(gateways)} store-gateways")
print(f"          = {len(sidecars) + len(gateways)} parallel StoreAPI calls per query")

Sample run on a 12-region PhonePe-style staging Thanos cluster:

upi_5m_by_region     fanout latency p50=1.34s p99=2.91s  series_returned=12
upi_p99_by_region    fanout latency p50=2.18s p99=4.66s  series_returned=12
error_budget_30d     fanout latency p50=8.92s p99=14.30s series_returned=1

fan-out: 24 sidecars + 4 store-gateways
          = 28 parallel StoreAPI calls per query

The 30-day error-budget query is 4-7× slower than the 7-day rate query because it has to read 30 days of cold blocks from S3 via the Store Gateway, while the 7-day query mostly hits the sidecars' local Prometheus storage. The Thanos-Replica-Labels: replica header is what enables read-time dedup — without it, you'd see 24 series instead of 12 because the HA replicas wouldn't be merged. The fan-out count of 28 is the dominant query-cost factor: a query spans 28 parallel gRPC calls and is bottlenecked by the slowest one. Indian platform teams running Thanos at 50+ Prometheus pairs report query p99 climbing past 10s for cross-region 30-day queries — at that fan-out scale, the read-time-dedup model starts hurting more than it helps, and most teams reach for query-frontend caching (or migrate to Mimir).

Why read-time dedup is cheap at small scale and expensive at large scale: each StoreAPI call returns its data sorted by timestamp within each series, so the Querier's dedup is a streaming k-way merge — O(n log k) where n is total samples and k is replica count. At k=2 (one HA pair) this is essentially free. At k=24 (12 pairs) the merge is doing 24-way comparisons per sample, and for a query spanning 1 million samples the merge work alone is 24M compares. The Querier becomes CPU-bound not on the query's mathematical cost but on the dedup machinery. The fix Thanos shipped in v0.32 (early 2024) is "penalty-based dedup" — instead of comparing every sample, the Querier picks the replica with the lowest "freshness penalty" for each chunk and reads from it monotonically, switching only on gaps. This drops the merge work to O(n) plus a small number of switches and is what makes 24-way fan-outs livable at all.

The Thanos compactor is the fourth piece operators have to know about. It runs as a singleton (you can run multiple, but they shard via labels — exactly one compactor per shard). It downloads blocks from S3, performs the same compaction merge that Prometheus does locally (combining 2h blocks into 8h, then 2-day, then 14-day blocks), computes 5-minute and 1-hour downsampled blocks for the older data, and re-uploads them. The compactor is the most CPU-and-memory-hungry component (it processes raw block data) and the one that operators most often misconfigure — the "sharding by external_labels" requirement is subtle, and Indian platforms commonly hit duplicate-compaction errors when two Thanos compactors both try to compact the same cluster=ap-south-1 blocks because their --compact.label-shard was misconfigured.

Cortex and Mimir: the heavyweight ingest pipeline

Cortex took the opposite approach. Created at Weaveworks in 2016 and later split between the Grafana fork (which became Mimir in 2022) and the original CNCF project, Cortex's design philosophy is "Prometheus is for scraping; we own everything that happens after". Prometheus instances do not store blocks locally — they remote_write every sample as it is scraped to a Cortex/Mimir cluster, which absorbs the writes through a fan-of stateful services and re-creates TSDB blocks at the receiving end. The ingest pipeline has more moving parts than most Indian SREs would build voluntarily, but the trade-offs are real.

The Distributor is the entry point. It receives remote_write Protobuf payloads from many Prometheus instances, validates them (label-name regex, sample timestamp window, per-tenant rate limits), and forwards each series to the right Ingesters using consistent hashing on (tenant_id, series_labels). Distributors are stateless and horizontally scaled; you typically run 4-20 of them.

The Ingesters are the heart of the system. They hold the most recent ~12 hours of samples in memory in a per-tenant TSDB head block (the same data structure Prometheus uses), serve queries against this in-memory data, and periodically flush old chunks to object storage as TSDB blocks. Each series is replicated to 3 Ingesters via the hash ring (replication factor 3), and write-time dedup happens because the Distributor sees all 3 replicas' acknowledgements. Ingesters are stateful — they hold real data in RAM that has not been flushed yet — and a sudden failure can lose the last 12 hours of samples for the partitions the failed Ingester was leading. The standard mitigation is a Write-Ahead Log on local disk per Ingester, replayed on restart.

The Compactor does the same job Thanos's does — merges 2h blocks into bigger blocks, computes downsampled blocks. Mimir's compactor is sharded by tenant by default, which lets it scale linearly past the singleton bottleneck Thanos hits.

The Store Gateway serves blocks from object storage to queriers, identical to Thanos's design.

The Querier is similar to Thanos's — fan out to Ingesters (for recent data) and Store Gateways (for old data), merge results. The Querier itself is stateless.

The Query Frontend sits in front of the Querier and provides query splitting, sub-second result caching (typically Redis-backed), and a per-tenant queue. This is the piece that makes Cortex/Mimir feel fast on dashboards: a 7-day query is split into 7 one-day sub-queries, each cached independently, so dashboard refreshes hit the cache and only the rolling-edge sub-query has to recompute.

The Ruler evaluates Prometheus alerting and recording rules against the cluster's data and writes results back as new metrics — this lets you run rules across all your Prometheus data globally, not per-Prometheus.

# mimir_remote_write_walk.py — push metrics through the Distributor and read them back
# pip install prometheus-client requests snappy
import time, struct, snappy, requests
from prometheus_client.core import GaugeMetricFamily
from prometheus_client.exposition import generate_latest
from prometheus_client.metrics_core import Sample

# Synthesise a Razorpay-checkout-flavoured remote_write payload
# format: snappy(WriteRequest{ timeseries: [TimeSeries{ labels: [...], samples: [...] }] })
# We bypass the protobuf complexity by using prometheus_client + remote_write_proto helpers
from prometheus_client.remote_write import (  # tiny helper, ships in newer prometheus-client
    write_request_pb2 as wp,
)

MIMIR = "http://mimir-distributor.observability.svc:8080/api/v1/push"
TENANT = "razorpay-prod"

def build_payload(metrics: list[tuple[str, dict, float]]) -> bytes:
    req = wp.WriteRequest()
    now_ms = int(time.time() * 1000)
    for name, labels, value in metrics:
        ts = req.timeseries.add()
        l = ts.labels.add(); l.name = "__name__"; l.value = name
        for k, v in labels.items():
            l = ts.labels.add(); l.name = k; l.value = v
        s = ts.samples.add(); s.value = value; s.timestamp = now_ms
    return snappy.compress(req.SerializeToString())

batch = []
for region in ["ap-south-1a", "ap-south-1b", "ap-south-1c"]:
    for endpoint in ["/checkout", "/refund", "/payout"]:
        for status in ["200", "500"]:
            batch.append((
                "http_requests_total",
                {"service": "razorpay-payments", "region": region,
                 "endpoint": endpoint, "status": status,
                 "instance": f"pod-{region[-2:]}"},
                42.0 + len(region) + len(endpoint),
            ))

payload = build_payload(batch)
r = requests.post(MIMIR, data=payload,
                  headers={"Content-Type": "application/x-protobuf",
                           "Content-Encoding": "snappy",
                           "X-Scope-OrgID": TENANT,
                           "User-Agent": "padhowiki/1.0"},
                  timeout=10)
print(f"distributor accept: HTTP {r.status_code}; pushed {len(batch)} samples")

# Read back via the query API
time.sleep(2)
QUERY = "http://mimir-query-frontend.observability.svc:8080/prometheus/api/v1/query"
q = requests.get(QUERY,
                 params={"query": 'sum(http_requests_total{service="razorpay-payments"}) by (region)'},
                 headers={"X-Scope-OrgID": TENANT}).json()

for series in q["data"]["result"]:
    print(f"  {series['metric']['region']:<14}  value={series['value'][1]}")

Sample run against a Mimir staging cluster:

distributor accept: HTTP 200; pushed 18 samples
  ap-south-1a    value=1182
  ap-south-1b    value=1182
  ap-south-1c    value=1182

The X-Scope-OrgID header is the multi-tenancy primitive — every request, write or read, must carry it; Mimir refuses requests without it; tenants cannot read each other's data even by accident. The Content-Encoding: snappy is mandatory — Mimir's distributor only accepts snappy-compressed Protobuf; uncompressed payloads return HTTP 400. The 2-second sleep between push and query exists because the Ingester needs to acknowledge replication to all 3 replicas before the data is visible to queries — the visibility lag is bounded by consul.heartbeat-timeout (typically 1s) plus the in-memory commit. Indian platform teams running Mimir for 50+ tenants typically operate a Distributor:Ingester:Store-Gateway:Querier ratio of 4:6:2:3 — that's 15 stateful services minimum, on top of the Compactor singleton-per-tenant-shard and the Query Frontend tier.

Why write-time dedup costs more compute but less query latency: Cortex/Mimir's hash ring sends each series to exactly 3 Ingesters, and the Distributor returns success only when all 3 (or a quorum, depending on consistency mode) have accepted. The 3 Ingesters all hold the same in-memory chunks for the same series. When the chunks are flushed to object storage, all 3 produce identical blocks (modulo write timing) and the Compactor deduplicates them on its next run. This means dedup is amortised over ingestion time rather than paid per-query — a query reading the deduplicated block from S3 sees one logical series, not three. The cost is paid in CPU and RAM during ingestion (3× the storage in Ingester memory, 3× the network for replication, dedup work in the Compactor); the win is paid out at query time (fan-out spans only the tenant's Ingester partitions and the relevant Store Gateways, never multiple replicas of the same data). At 50+ Prometheus pair scale this win pays back the operational cost; below 10 Prometheus pairs the operational cost dominates and Thanos is the right answer.

How they differ — the architectural cheat sheet

The three projects' choices on each axis matter more than any single architectural diagram. The table below is the cheat sheet Indian platform teams I've worked with end up rebuilding from first principles every time they evaluate the choice.

Axis Thanos Cortex Mimir
HA dedup Read-time, in Querier Write-time, hash ring + RF=3 Write-time, hash ring + RF=3, zone-aware
Prometheus stays normal Yes (sidecar tails it) No (remote_write, no local TSDB) No (remote_write, no local TSDB)
Min services to operate 3 (sidecar, store, querier) + compactor 7+ (distributor, ingester, store, querier, query-fe, ruler, alertmanager) 7+ (same as Cortex but cleaner)
Stateful services Compactor (singleton-shard) Ingesters Ingesters
Query path latency Fan-out + read-time dedup Querier → Ingester (recent) + Store-GW (cold), no dedup Same as Cortex, with sub-second cache via Query FE
Multi-tenancy Add-on (works but rough) Native Native, refined
Object store backends S3, GCS, Azure, Swift S3, GCS, Azure, Cassandra (legacy), Bigtable (legacy) S3, GCS, Azure (Cassandra/Bigtable removed)
Downsampling 5m + 1h via Compactor 5m + 1h via Compactor 5m + 1h via Compactor
Write throughput at 1M samples/sec Limited by Prometheus + S3 upload High (sharded ingesters) Highest (zone-aware sharded ingesters)
Operational complexity score Low High Medium-high
Year-1 platform-team headcount estimate 0.5 FTE 2 FTE 1.5 FTE
Best fit ≤10 Prometheus pairs, single tenant, small platform team (do not pick for new deployments in 2026) >100 Prometheus pairs, multi-tenant, dedicated team

The Cortex column has a footnote: as of mid-2022 the Grafana team forked Cortex into Mimir and largely stopped contributing to upstream Cortex. The CNCF Cortex project is still maintained but the active community, the documentation depth, and the operational tooling have all migrated to Mimir. New deployments in 2026 should pick between Thanos and Mimir; Cortex is the answer only when migrating an existing Cortex deployment is too expensive to replace.

Thanos vs Mimir — architecture comparisonTwo side-by-side architecture diagrams. On the left, the Thanos topology shows Prometheus instances with sidecars on top, a Querier in the middle fanning out to sidecars and Store Gateways, and S3 underneath. On the right, the Mimir topology shows Prometheus instances feeding remote_write into a Distributor, which fans out to Ingesters via a hash ring; a Querier reads from Ingesters (recent) and Store Gateways (cold), with a Query Frontend caching results, all sitting on top of S3.Thanos: read-time dedup, sidecar-attachedMimir: write-time dedup, ring-replicatedProm-A+ sidecarProm-B+ sidecarQuerierread-time dedupStore-GWblock readerCompactorlabel-shardS3 / GCS / object storage2h blocks + 5m + 1h rollupsdashed = sidecar S3 upload (every 2h)Prom-Aremote_writeProm-Bremote_writeDistributorhash ringIngester-1RF=3Ingester-2RF=3Ingester-3RF=3Querierno dedupQuery FEcache + splitStore-GWCompactorS3 / GCS / object storage2h blocks + 5m + 1h rollups, per tenantdashed = Ingester flush to S3 (every ~2h)
Illustrative — not measured data. Thanos (left) keeps Prometheus operationally normal and pays for HA dedup at query time. Mimir (right) replaces Prometheus's local storage with a sharded write pipeline and pays for HA dedup at ingestion. Both rely on object storage as the long-term substrate.

Common confusions

Going deeper

The compactor.label-shard footgun

Thanos's compactor is logically a singleton — exactly one process should compact a given set of blocks at a time. To horizontally scale, you shard by external labels: each compactor handles only blocks where external_labels.cluster= matches its --compact.label-shard. The footgun is that Prometheus instances must have unique external_labels for the sharding to work — if two Prometheus instances in different regions both write cluster=ap-south-1, both compactors think they own those blocks and a duplicate-compaction race ensues, producing corrupted blocks that fail to load. The fix is to always include both cluster and replica in external_labels, and to pin the compactor's shard predicate on cluster alone (so the two replicas of the same cluster are compacted together as intended). This is the single most-common production incident on Thanos clusters in their first year of operation.

Mimir's zone-aware replication

Cortex's hash-ring replication picks 3 random Ingesters for each series, with no awareness of which AWS availability zone they live in. A bad-luck draw places all 3 replicas in ap-south-1a, and a single-AZ outage takes down 100% of recent data. Mimir adds zone-aware replication: the Ingester ring is partitioned by AZ, and the 3 replicas for any series are guaranteed to span 3 different AZs. The cost is a small amount of extra cross-AZ network traffic (which AWS bills at ₹0.84/GB inter-AZ in the Mumbai region). The benefit is that an AZ outage costs you 1/3 of your recent data, not all of it. PhonePe's 2023 internal postmortem on a 23-minute ap-south-1a outage credits zone-aware replication with keeping their UPI metrics 100% queryable throughout the incident.

Why both projects keep Prometheus's TSDB block format

Both Thanos and Mimir could have invented their own block format optimised for object storage. Neither did. They both write Prometheus 2.x TSDB blocks because the format is the lingua franca of the Prometheus ecosystempromtool tsdb works on them, the Prometheus query engine reads them directly, the index layout is well-understood, and the on-disk-to-in-memory mapping is well-tuned. Writing a new format would mean writing a new query engine, a new compactor, a new index reader. The cost is that the format was designed for a 2-hour-block local-disk workflow, not for object-storage-with-decoupled-compaction; some inefficiencies (like the postings-list size relative to a Parquet-style dictionary-encoded column) carry forward. The IOx project (an InfluxDB-team experiment) explored a Parquet-native block format and showed 2-3× compression wins but never reached parity with the ecosystem's tooling. Format inertia is real.

The query-frontend cache and dashboard refresh storms

A Grafana dashboard refreshing every 10 seconds with 12 panels, each running a 7-day query, generates 1.2 query/sec per user — modest. Across 500 engineers at Hotstar during the IPL final, that's 600 QPS of identical 7-day queries. The Query Frontend's split-and-cache logic divides each 7-day query into 7 1-day sub-queries and caches each independently in Redis with a 1-hour TTL; refreshes hit the cache for 6 of the 7 sub-queries and only the rolling-edge day actually runs. Cache hit rates of 92-96% are typical at this scale, and the Querier load is ~5% of what a naive setup would be. Thanos's Query Frontend (a v0.20+ feature) does the same thing; before it landed, large Thanos deployments routinely melted their Queriers under dashboard refresh storms.

Where this leads next

Once the long-term layer is chosen, the next chapter introduces Gorilla compression — the encoding that makes 13 months of metrics affordable on any of these substrates.

References

# Reproduce this on your laptop
docker run -d --name minio -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=admin -e MINIO_ROOT_PASSWORD=adminadmin \
  quay.io/minio/minio server /data --console-address ":9001"
docker run -d --name prom -p 9090:9090 prom/prometheus
docker run -d --name thanos-sidecar --link prom \
  -p 19090:19090 -p 19091:19091 \
  thanosio/thanos:v0.36 sidecar \
  --prometheus.url=http://prom:9090 --tsdb.path=/data \
  --objstore.config="type: S3
config:
  bucket: thanos
  endpoint: minio:9000
  access_key: admin
  secret_key: adminadmin
  insecure: true"
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas hdrh prometheus-client python-snappy
python3 thanos_query_walk.py   # script from this article