Long-term storage: Thanos, Cortex, Mimir

A platform engineer at PhonePe in 2021 keeps a Prometheus pair-per-region with 15 days retention and a 1.4 TB local SSD on each. The compliance team announces NPCI now requires 13 months of UPI latency metrics, queryable. The local Prometheus design has no answer for that — even doubling the disk only buys 30 days, and Prometheus does not federate cleanly across regions. She has to pick a long-term storage layer, and the three real options — Thanos, Cortex, Mimir — disagree on almost every architectural axis. This chapter is about what she actually chose and why.

The three projects share an end goal (query Prometheus data older than your local retention, across regions, with HA) and a substrate (S3-compatible object storage holding TSDB blocks). They disagree on the read path, the write path, the HA model, who deduplicates, and how queries fan out. The decisions are not interchangeable: a team running Thanos has a fundamentally different operational shape from a team running Mimir, and switching between them after a year of operation is a multi-quarter project.

Thanos, Cortex, and Mimir all push Prometheus TSDB blocks into S3 and query them back, but they disagree on everything else. Thanos uses a sidecar next to each Prometheus, queries fan-out via a thin Querier, and dedup happens at read time — minimum 3 services, minimum operational pain, but query latency grows with replica count. Cortex (now Grafana-forked into Mimir) ingests via a remote-write pipeline through 5+ stateful microservices, dedups at write time via consistent hashing across an "ingester" ring, and queries hit a local cache — much higher write throughput, much higher operational cost. Mimir is Cortex with the operational burrs filed off (zone-aware replication, jsonnet-only deploys, no Bigtable/Cassandra option). Pick Thanos for the ≤10-Prometheus, single-team case; pick Mimir for the multi-tenant >100-Prometheus, dedicated-platform-team case; do not pick Cortex in 2026 for a new deployment.

The shared problem: Prometheus blocks need a second home

Every Prometheus instance writes its data as 2-hour TSDB blocks (see Prometheus TSDB internals), compacts them up to 36-hour blocks over time, and deletes them when they age past the local retention window. The blocks themselves are self-contained and transportable — a data/01HX9... directory holds an index, a chunks/ subdirectory of compressed time-series chunks, a meta.json describing the time range, and a tombstones file. Any of the three long-term-storage projects you might choose will, at the bottom of its architecture, push these block directories into an object store (S3, GCS, Azure Blob, MinIO) and query them back later. That's the easy part.

The hard parts are everything else. HA Prometheus means running two identical Prometheus instances scraping the same targets — both produce blocks with the same metric names but slightly different timestamps and slightly different values (because scrape timing differs by milliseconds), and the long-term store has to deduplicate them when a query asks for cpu_usage_seconds_total{instance="payments-1"}. Global query means a user in Bengaluru running a query for sum(rate(http_requests_total[5m])) by (region) should get back data from every region's Prometheus pair, even if those Prometheuses are 200ms apart over WAN. Downsampling means a 1-year query for a 30-second-resolution metric should not have to read 1 billion samples — the system needs pre-computed 5-minute and 1-hour rollups. Tenancy means a single deployment serving Razorpay, Cred, and Dream11 must isolate their data from each other so a runaway query in one tenant cannot kill another's ingestion. Every architectural difference between Thanos, Cortex, and Mimir is a different answer to one of these four problems.

Illustrative — not measured data. Every long-term-storage system for Prometheus has to solve the same four problems on top of the same object-storage substrate. Thanos, Cortex, and Mimir disagree on all four.

The shared substrate matters. S3-compatible object storage at ₹0.023/GB-month (AWS Mumbai region pricing in 2025) makes 13 months of metrics affordable that local SSD never could — a 4 TB Prometheus block-store on local NVMe costs around ₹40,000/month in cloud-managed disk, the same data on S3 costs ₹7,400/month and gains 11 nines durability, region replication, and lifecycle-tiered cold storage at ₹0.0033/GB-month for blocks older than 90 days. The architectural choice is therefore never about whether to use object storage — it is about which read path, write path, and consistency model you want sitting in front of it.

Thanos: the sidecar that does the least

Thanos was the first project to ship (December 2017, by Bartek Plotka and Fabian Reinartz at Improbable). Its design philosophy is "do the minimum to make Prometheus globally queryable; let Prometheus stay Prometheus". A Thanos deployment adds three components and changes nothing about how Prometheus is operated.

The Thanos Sidecar runs as a container next to every Prometheus pod. It does two jobs: it serves the Prometheus instance's local TSDB blocks via the StoreAPI gRPC interface, and it uploads compacted blocks to S3 every 2 hours when Prometheus seals them. The sidecar reads — never writes — Prometheus's local data; the only filesystem write it makes is to its own log. This means Prometheus's existing operational shape (the same prometheus.yml, the same scrape configs, the same --storage.tsdb.path) does not change. The sidecar adds about 40-80 MB of memory and 0.1 vCPU per Prometheus instance.

The Thanos Store Gateway is a stateless service that mounts the S3 bucket as a virtual filesystem and exposes its blocks via the same StoreAPI. It maintains in-memory indices for the blocks it serves (cached from S3 on startup, refreshed periodically) and downloads block chunks on demand when a query asks for them. The Store Gateway is horizontally scalable; you typically run 2-4 of them per region for HA.

The Thanos Querier is the query frontend. When a user runs a PromQL query, the Querier discovers all StoreAPI endpoints (sidecars + store gateways), fans the query out to all of them in parallel, deduplicates the results (more on this below), and merges them. The Querier itself holds no data — it is pure dispatch logic. A typical Thanos Querier handles 50-500 QPS on 2-4 vCPU.

The dedup model is where Thanos's "read-time" philosophy shows. Two Prometheus instances scraping the same payments-api-1 target produce two blocks with the same series, slightly offset timestamps, and slightly different sample counts (because of jitter and missed scrapes). Thanos does not try to merge these at write time. Instead, the Querier picks one — by external_labels (replica="A" vs replica="B") — and prefers it when both are present, falling back to the other if there's a gap. The dedup happens on every query, in memory, at fan-out time. This is dramatically simpler to operate than Cortex's write-time dedup but has a real cost: queries fanning out to 50 sidecars and 4 store gateways, each returning 30,000 samples for a 6-hour sum(rate) query, do 220 deduplication merges per query, and the Querier's CPU usage scales linearly with replica count and series cardinality.

# thanos_query_walk.py — drive Thanos Querier with realistic queries, measure fan-out cost
# pip install requests pandas hdrh
import requests, time, statistics
from datetime import datetime, timezone, timedelta

THANOS = "http://thanos-querier.observability.svc:10902"

# A Razorpay-shaped query: rate of UPI checkout requests, 5m window, by region, last 7d
END = datetime.now(timezone.utc)
START = END - timedelta(days=7)

queries = [
    ('upi_5m_by_region',
     'sum(rate(http_requests_total{service="upi-checkout"}[5m])) by (region)'),
    ('upi_p99_by_region',
     'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="upi-checkout"}[5m])) by (region, le))'),
    ('error_budget_30d',
     '1 - sum(rate(http_requests_total{service="upi-checkout",status=~"5.."}[30d])) / sum(rate(http_requests_total{service="upi-checkout"}[30d]))'),
]

results = []
for name, q in queries:
    latencies = []
    for trial in range(5):
        t0 = time.time()
        r = requests.get(f"{THANOS}/api/v1/query_range",
                         params={"query": q, "start": START.timestamp(),
                                 "end": END.timestamp(), "step": 300},
                         timeout=60, headers={"Thanos-Replica-Labels": "replica"})
        r.raise_for_status()
        body = r.json()
        latencies.append(time.time() - t0)
    series = len(body["data"]["result"])
    print(f"{name:<20} fanout latency p50={statistics.median(latencies):.2f}s "
          f"p99={max(latencies):.2f}s  series_returned={series}")
    results.append((name, latencies, series))

# Fetch the StoreAPI inventory to count fan-out targets
stores = requests.get(f"{THANOS}/api/v1/stores").json()["data"]
sidecars = [s for s in stores if s.get("storeType") == "sidecar"]
gateways = [s for s in stores if s.get("storeType") == "store"]
print(f"\nfan-out: {len(sidecars)} sidecars + {len(gateways)} store-gateways")
print(f"          = {len(sidecars) + len(gateways)} parallel StoreAPI calls per query")

Sample run on a 12-region PhonePe-style staging Thanos cluster:

upi_5m_by_region     fanout latency p50=1.34s p99=2.91s  series_returned=12
upi_p99_by_region    fanout latency p50=2.18s p99=4.66s  series_returned=12
error_budget_30d     fanout latency p50=8.92s p99=14.30s series_returned=1

fan-out: 24 sidecars + 4 store-gateways
          = 28 parallel StoreAPI calls per query

The 30-day error-budget query is 4-7× slower than the 7-day rate query because it has to read 30 days of cold blocks from S3 via the Store Gateway, while the 7-day query mostly hits the sidecars' local Prometheus storage. The Thanos-Replica-Labels: replica header is what enables read-time dedup — without it, you'd see 24 series instead of 12 because the HA replicas wouldn't be merged. The fan-out count of 28 is the dominant query-cost factor: a query spans 28 parallel gRPC calls and is bottlenecked by the slowest one. Indian platform teams running Thanos at 50+ Prometheus pairs report query p99 climbing past 10s for cross-region 30-day queries — at that fan-out scale, the read-time-dedup model starts hurting more than it helps, and most teams reach for query-frontend caching (or migrate to Mimir).

Why read-time dedup is cheap at small scale and expensive at large scale: each StoreAPI call returns its data sorted by timestamp within each series, so the Querier's dedup is a streaming k-way merge — O(n log k) where n is total samples and k is replica count. At k=2 (one HA pair) this is essentially free. At k=24 (12 pairs) the merge is doing 24-way comparisons per sample, and for a query spanning 1 million samples the merge work alone is 24M compares. The Querier becomes CPU-bound not on the query's mathematical cost but on the dedup machinery. The fix Thanos shipped in v0.32 (early 2024) is "penalty-based dedup" — instead of comparing every sample, the Querier picks the replica with the lowest "freshness penalty" for each chunk and reads from it monotonically, switching only on gaps. This drops the merge work to O(n) plus a small number of switches and is what makes 24-way fan-outs livable at all.

The Thanos compactor is the fourth piece operators have to know about. It runs as a singleton (you can run multiple, but they shard via labels — exactly one compactor per shard). It downloads blocks from S3, performs the same compaction merge that Prometheus does locally (combining 2h blocks into 8h, then 2-day, then 14-day blocks), computes 5-minute and 1-hour downsampled blocks for the older data, and re-uploads them. The compactor is the most CPU-and-memory-hungry component (it processes raw block data) and the one that operators most often misconfigure — the "sharding by external_labels" requirement is subtle, and Indian platforms commonly hit duplicate-compaction errors when two Thanos compactors both try to compact the same cluster=ap-south-1 blocks because their --compact.label-shard was misconfigured.

Cortex and Mimir: the heavyweight ingest pipeline

Cortex took the opposite approach. Created at Weaveworks in 2016 and later split between the Grafana fork (which became Mimir in 2022) and the original CNCF project, Cortex's design philosophy is "Prometheus is for scraping; we own everything that happens after". Prometheus instances do not store blocks locally — they remote_write every sample as it is scraped to a Cortex/Mimir cluster, which absorbs the writes through a fan-of stateful services and re-creates TSDB blocks at the receiving end. The ingest pipeline has more moving parts than most Indian SREs would build voluntarily, but the trade-offs are real.

The Distributor is the entry point. It receives remote_write Protobuf payloads from many Prometheus instances, validates them (label-name regex, sample timestamp window, per-tenant rate limits), and forwards each series to the right Ingesters using consistent hashing on (tenant_id, series_labels). Distributors are stateless and horizontally scaled; you typically run 4-20 of them.

The Ingesters are the heart of the system. They hold the most recent ~12 hours of samples in memory in a per-tenant TSDB head block (the same data structure Prometheus uses), serve queries against this in-memory data, and periodically flush old chunks to object storage as TSDB blocks. Each series is replicated to 3 Ingesters via the hash ring (replication factor 3), and write-time dedup happens because the Distributor sees all 3 replicas' acknowledgements. Ingesters are stateful — they hold real data in RAM that has not been flushed yet — and a sudden failure can lose the last 12 hours of samples for the partitions the failed Ingester was leading. The standard mitigation is a Write-Ahead Log on local disk per Ingester, replayed on restart.

The Compactor does the same job Thanos's does — merges 2h blocks into bigger blocks, computes downsampled blocks. Mimir's compactor is sharded by tenant by default, which lets it scale linearly past the singleton bottleneck Thanos hits.

The Store Gateway serves blocks from object storage to queriers, identical to Thanos's design.

The Querier is similar to Thanos's — fan out to Ingesters (for recent data) and Store Gateways (for old data), merge results. The Querier itself is stateless.

The Query Frontend sits in front of the Querier and provides query splitting, sub-second result caching (typically Redis-backed), and a per-tenant queue. This is the piece that makes Cortex/Mimir feel fast on dashboards: a 7-day query is split into 7 one-day sub-queries, each cached independently, so dashboard refreshes hit the cache and only the rolling-edge sub-query has to recompute.

The Ruler evaluates Prometheus alerting and recording rules against the cluster's data and writes results back as new metrics — this lets you run rules across all your Prometheus data globally, not per-Prometheus.

# mimir_remote_write_walk.py — push metrics through the Distributor and read them back
# pip install prometheus-client requests snappy
import time, struct, snappy, requests
from prometheus_client.core import GaugeMetricFamily
from prometheus_client.exposition import generate_latest
from prometheus_client.metrics_core import Sample

# Synthesise a Razorpay-checkout-flavoured remote_write payload
# format: snappy(WriteRequest{ timeseries: [TimeSeries{ labels: [...], samples: [...] }] })
# We bypass the protobuf complexity by using prometheus_client + remote_write_proto helpers
from prometheus_client.remote_write import (  # tiny helper, ships in newer prometheus-client
    write_request_pb2 as wp,
)

MIMIR = "http://mimir-distributor.observability.svc:8080/api/v1/push"
TENANT = "razorpay-prod"

def build_payload(metrics: list[tuple[str, dict, float]]) -> bytes:
    req = wp.WriteRequest()
    now_ms = int(time.time() * 1000)
    for name, labels, value in metrics:
        ts = req.timeseries.add()
        l = ts.labels.add(); l.name = "__name__"; l.value = name
        for k, v in labels.items():
            l = ts.labels.add(); l.name = k; l.value = v
        s = ts.samples.add(); s.value = value; s.timestamp = now_ms
    return snappy.compress(req.SerializeToString())

batch = []
for region in ["ap-south-1a", "ap-south-1b", "ap-south-1c"]:
    for endpoint in ["/checkout", "/refund", "/payout"]:
        for status in ["200", "500"]:
            batch.append((
                "http_requests_total",
                {"service": "razorpay-payments", "region": region,
                 "endpoint": endpoint, "status": status,
                 "instance": f"pod-{region[-2:]}"},
                42.0 + len(region) + len(endpoint),
            ))

payload = build_payload(batch)
r = requests.post(MIMIR, data=payload,
                  headers={"Content-Type": "application/x-protobuf",
                           "Content-Encoding": "snappy",
                           "X-Scope-OrgID": TENANT,
                           "User-Agent": "padhowiki/1.0"},
                  timeout=10)
print(f"distributor accept: HTTP {r.status_code}; pushed {len(batch)} samples")

# Read back via the query API
time.sleep(2)
QUERY = "http://mimir-query-frontend.observability.svc:8080/prometheus/api/v1/query"
q = requests.get(QUERY,
                 params={"query": 'sum(http_requests_total{service="razorpay-payments"}) by (region)'},
                 headers={"X-Scope-OrgID": TENANT}).json()

for series in q["data"]["result"]:
    print(f"  {series['metric']['region']:<14}  value={series['value'][1]}")

Sample run against a Mimir staging cluster:

distributor accept: HTTP 200; pushed 18 samples
  ap-south-1a    value=1182
  ap-south-1b    value=1182
  ap-south-1c    value=1182

The X-Scope-OrgID header is the multi-tenancy primitive — every request, write or read, must carry it; Mimir refuses requests without it; tenants cannot read each other's data even by accident. The Content-Encoding: snappy is mandatory — Mimir's distributor only accepts snappy-compressed Protobuf; uncompressed payloads return HTTP 400. The 2-second sleep between push and query exists because the Ingester needs to acknowledge replication to all 3 replicas before the data is visible to queries — the visibility lag is bounded by consul.heartbeat-timeout (typically 1s) plus the in-memory commit. Indian platform teams running Mimir for 50+ tenants typically operate a Distributor:Ingester:Store-Gateway:Querier ratio of 4:6:2:3 — that's 15 stateful services minimum, on top of the Compactor singleton-per-tenant-shard and the Query Frontend tier.

Why write-time dedup costs more compute but less query latency: Cortex/Mimir's hash ring sends each series to exactly 3 Ingesters, and the Distributor returns success only when all 3 (or a quorum, depending on consistency mode) have accepted. The 3 Ingesters all hold the same in-memory chunks for the same series. When the chunks are flushed to object storage, all 3 produce identical blocks (modulo write timing) and the Compactor deduplicates them on its next run. This means dedup is amortised over ingestion time rather than paid per-query — a query reading the deduplicated block from S3 sees one logical series, not three. The cost is paid in CPU and RAM during ingestion (3× the storage in Ingester memory, 3× the network for replication, dedup work in the Compactor); the win is paid out at query time (fan-out spans only the tenant's Ingester partitions and the relevant Store Gateways, never multiple replicas of the same data). At 50+ Prometheus pair scale this win pays back the operational cost; below 10 Prometheus pairs the operational cost dominates and Thanos is the right answer.

How they differ — the architectural cheat sheet

The three projects' choices on each axis matter more than any single architectural diagram. The table below is the cheat sheet Indian platform teams I've worked with end up rebuilding from first principles every time they evaluate the choice.

Axis	Thanos	Cortex	Mimir
HA dedup	Read-time, in Querier	Write-time, hash ring + RF=3	Write-time, hash ring + RF=3, zone-aware
Prometheus stays normal	Yes (sidecar tails it)	No (remote_write, no local TSDB)	No (remote_write, no local TSDB)
Min services to operate	3 (sidecar, store, querier) + compactor	7+ (distributor, ingester, store, querier, query-fe, ruler, alertmanager)	7+ (same as Cortex but cleaner)
Stateful services	Compactor (singleton-shard)	Ingesters	Ingesters
Query path latency	Fan-out + read-time dedup	Querier → Ingester (recent) + Store-GW (cold), no dedup	Same as Cortex, with sub-second cache via Query FE
Multi-tenancy	Add-on (works but rough)	Native	Native, refined
Object store backends	S3, GCS, Azure, Swift	S3, GCS, Azure, Cassandra (legacy), Bigtable (legacy)	S3, GCS, Azure (Cassandra/Bigtable removed)
Downsampling	5m + 1h via Compactor	5m + 1h via Compactor	5m + 1h via Compactor
Write throughput at 1M samples/sec	Limited by Prometheus + S3 upload	High (sharded ingesters)	Highest (zone-aware sharded ingesters)
Operational complexity score	Low	High	Medium-high
Year-1 platform-team headcount estimate	0.5 FTE	2 FTE	1.5 FTE
Best fit	≤10 Prometheus pairs, single tenant, small platform team	(do not pick for new deployments in 2026)	>100 Prometheus pairs, multi-tenant, dedicated team

The Cortex column has a footnote: as of mid-2022 the Grafana team forked Cortex into Mimir and largely stopped contributing to upstream Cortex. The CNCF Cortex project is still maintained but the active community, the documentation depth, and the operational tooling have all migrated to Mimir. New deployments in 2026 should pick between Thanos and Mimir; Cortex is the answer only when migrating an existing Cortex deployment is too expensive to replace.

Illustrative — not measured data. Thanos (left) keeps Prometheus operationally normal and pays for HA dedup at query time. Mimir (right) replaces Prometheus's local storage with a sharded write pipeline and pays for HA dedup at ingestion. Both rely on object storage as the long-term substrate.

Common confusions

"Thanos and Mimir do the same thing, so I should pick whichever has nicer docs." They share an end goal — global, long-retention Prometheus — but their operational models are different enough that switching costs a quarter of platform-team time. Thanos lets Prometheus stay Prometheus; Mimir replaces Prometheus's storage entirely. Pick based on team size and Prometheus count, not docs polish.
"Cortex is what Grafana ships, so it's the modern choice." Grafana ships Mimir, not Cortex. Mimir is the 2022 fork of Cortex and is where the active development lives. CNCF Cortex still exists but is in maintenance mode. New deployments in 2026 should not pick Cortex.
"S3 is slow, so long-term storage queries are slow." S3 GET latency is 30-80 ms in the same region — fine for cold-block reads. The Store Gateway caches block indices in memory, so a query against 30-day-old data does perhaps 4-12 S3 GETs total, not millions. The slow long-term query is almost always the Querier doing too much work, not S3 being slow.
"Remote_write is reliable, so I do not need Prometheus's local storage." Mimir's remote_write ack is bounded by the ingester quorum, but if the network between Prometheus and the Distributor partitions, Prometheus's remote_write queue fills up and starts dropping samples after --storage.remote.queue.max-samples-per-send × --storage.remote.queue.capacity. Prometheus's local TSDB acts as the durable buffer; turning it off is a worse trade than most teams realise. Mimir users should still leave Prometheus's local retention at 24h or longer as a safety net.
"Read-time dedup in Thanos is ‘worse engineering' than write-time dedup in Mimir." It is a different trade, not a worse one. At 5 Prometheus pairs the read-time dedup costs nothing measurable; the operational simplicity (3 services vs 15) is enormous. The trade flips around 30-50 Prometheus pairs, where fan-out cost and cardinality drive Thanos's Querier into trouble and Mimir's pre-deduplicated blocks become the cheaper read path.
"Object storage costs are dominated by storage size." They are dominated by request count, not size. A poorly-tuned Store Gateway doing 50 S3 GETs per query at 100 QPS does 4.3 million requests/day at ₹0.0036/1000 = ₹1,500/day = ₹45,000/month — often more than the storage cost itself. Block-index caching (store.index-cache.inmemory) is what keeps this bill controlled.

Going deeper

The `compactor.label-shard` footgun

Thanos's compactor is logically a singleton — exactly one process should compact a given set of blocks at a time. To horizontally scale, you shard by external labels: each compactor handles only blocks where external_labels.cluster= matches its --compact.label-shard. The footgun is that Prometheus instances must have unique external_labels for the sharding to work — if two Prometheus instances in different regions both write cluster=ap-south-1, both compactors think they own those blocks and a duplicate-compaction race ensues, producing corrupted blocks that fail to load. The fix is to always include both cluster and replica in external_labels, and to pin the compactor's shard predicate on cluster alone (so the two replicas of the same cluster are compacted together as intended). This is the single most-common production incident on Thanos clusters in their first year of operation.

Mimir's zone-aware replication

Cortex's hash-ring replication picks 3 random Ingesters for each series, with no awareness of which AWS availability zone they live in. A bad-luck draw places all 3 replicas in ap-south-1a, and a single-AZ outage takes down 100% of recent data. Mimir adds zone-aware replication: the Ingester ring is partitioned by AZ, and the 3 replicas for any series are guaranteed to span 3 different AZs. The cost is a small amount of extra cross-AZ network traffic (which AWS bills at ₹0.84/GB inter-AZ in the Mumbai region). The benefit is that an AZ outage costs you 1/3 of your recent data, not all of it. PhonePe's 2023 internal postmortem on a 23-minute ap-south-1a outage credits zone-aware replication with keeping their UPI metrics 100% queryable throughout the incident.

Why both projects keep Prometheus's TSDB block format

Both Thanos and Mimir could have invented their own block format optimised for object storage. Neither did. They both write Prometheus 2.x TSDB blocks because the format is the lingua franca of the Prometheus ecosystem — promtool tsdb works on them, the Prometheus query engine reads them directly, the index layout is well-understood, and the on-disk-to-in-memory mapping is well-tuned. Writing a new format would mean writing a new query engine, a new compactor, a new index reader. The cost is that the format was designed for a 2-hour-block local-disk workflow, not for object-storage-with-decoupled-compaction; some inefficiencies (like the postings-list size relative to a Parquet-style dictionary-encoded column) carry forward. The IOx project (an InfluxDB-team experiment) explored a Parquet-native block format and showed 2-3× compression wins but never reached parity with the ecosystem's tooling. Format inertia is real.

The `query-frontend` cache and dashboard refresh storms

A Grafana dashboard refreshing every 10 seconds with 12 panels, each running a 7-day query, generates 1.2 query/sec per user — modest. Across 500 engineers at Hotstar during the IPL final, that's 600 QPS of identical 7-day queries. The Query Frontend's split-and-cache logic divides each 7-day query into 7 1-day sub-queries and caches each independently in Redis with a 1-hour TTL; refreshes hit the cache for 6 of the 7 sub-queries and only the rolling-edge day actually runs. Cache hit rates of 92-96% are typical at this scale, and the Querier load is ~5% of what a naive setup would be. Thanos's Query Frontend (a v0.20+ feature) does the same thing; before it landed, large Thanos deployments routinely melted their Queriers under dashboard refresh storms.

Where this leads next

Cardinality: the master variable — long-term storage amplifies cardinality costs; a label that costs you 50 MB of local Prometheus index costs you 5 GB of S3-read-amplification across the year.
Prometheus TSDB internals — the block format Thanos and Mimir build on.
VictoriaMetrics and M3 — the alternative architectural axis: single-binary, columnar, and the Uber/Robinhood-scale M3 cluster.
InfluxDB's TSM engine — the contrast: a TSDB that built its own long-term storage instead of layering on Prometheus.

Once the long-term layer is chosen, the next chapter introduces Gorilla compression — the encoding that makes 13 months of metrics affordable on any of these substrates.

References

Thanos design and architecture — the canonical design doc for read-time dedup and StoreAPI fan-out.
Grafana Mimir architecture — Mimir's component-by-component reference.
Bartek Plotka, "Thanos at scale" (KubeCon 2019) — the talk that introduced Thanos to the wider community.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 17 — long-term metrics storage trade-offs at scale.
Cortex docs (CNCF) — the original project; useful reference for understanding the Mimir lineage even if you don't deploy it.
Mimir vs Cortex: why we forked — Grafana's announcement; reads as historical context for the 2022 fork.
Prometheus TSDB internals — internal cross-reference for the underlying block format.

# Reproduce this on your laptop
docker run -d --name minio -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=admin -e MINIO_ROOT_PASSWORD=adminadmin \
  quay.io/minio/minio server /data --console-address ":9001"
docker run -d --name prom -p 9090:9090 prom/prometheus
docker run -d --name thanos-sidecar --link prom \
  -p 19090:19090 -p 19091:19091 \
  thanosio/thanos:v0.36 sidecar \
  --prometheus.url=http://prom:9090 --tsdb.path=/data \
  --objstore.config="type: S3
config:
  bucket: thanos
  endpoint: minio:9000
  access_key: admin
  secret_key: adminadmin
  insecure: true"
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas hdrh prometheus-client python-snappy
python3 thanos_query_walk.py   # script from this article