Downsampling for long retention

It is a Tuesday in April at Hotstar and Karan, on the platform team, has been asked the same question for the third quarter in a row: "what was the p99 of /playback/start during last year's IPL final?" The Prometheus that scraped that data was retired six months ago. The Thanos cluster that took its place keeps 13 months of metrics. Karan opens Grafana, types the query, hits a 1-year time range, and gets back a graph that looks reasonable — until he zooms in and notices the 1-second p99 from a year ago is suspiciously smooth, the spikes that everyone on the call remembers are gone, and the auto-resolved alert that triggered the entire 02:30 IST war room has somehow vanished from history. The data isn't missing. It has been downsampled, and the aggregate that survived was the average, not the quantile.

This chapter is about what gets thrown away when retention windows stretch past compression's economic limit, and what survives. Compression — Gorilla, columnar, hypertable chunks — squeezes each sample into 1.3 to 2 bytes. Past about 30 days that is still too expensive to keep at full resolution; the answer the industry converged on is to keep a summary at a coarser cadence and delete the raw points. Done well, you can answer "what did our error rate look like 11 months ago" with three minutes' resolution and 1% of the storage. Done badly, you save a year of averages, and the p99 question Karan was asked is unanswerable.

Downsampling rolls a high-frequency time series (15 s scrapes) into coarser buckets (5 min, 1 h) by aggregating each bucket. The functions you keep — count, sum, min, max, and pre-computed quantile sketches — determine which queries still work on the rolled-up data. Average is the trap: it survives every reasonable design but answers almost no real question. The right design keeps a multi-aggregate envelope per bucket plus mergeable quantile sketches (t-digest, DDSketch, HdrHistogram) and tiers retention so 7 days are raw, 90 days are 5-minute rollups, 13 months are 1-hour rollups.

What downsampling actually is

A high-resolution time series is a function t -> v sampled every scrape interval. For a 15-second Prometheus scrape over a year, that is 2,102,400 samples per series; multiplied by even 10,000 active series it is 21 billion samples. Compression brings it from 480 GB raw to roughly 28 GB; downsampling brings that to 280 MB. The math is unforgiving: storage costs scale linearly in time, queries on long windows must stream every byte, and dashboards that span a year want sub-second responsiveness. The only lever left after compression is throwing samples away.

Downsampling does this by partitioning the timeline into buckets — 5 minutes, 1 hour, 24 hours — and replacing the samples inside each bucket with a small set of summary statistics. The original samples are deleted; the summaries replace them. Queries that ran against raw samples must be rewritten to consume the summaries.

The mechanism is identical across systems (Thanos, Mimir, M3, VictoriaMetrics, InfluxDB, Datadog), and the tradeoffs are identical. What differs is which aggregates each system keeps and how it handles the queries that the kept aggregates can answer versus the ones they can't.

Downsampling rolls 15-second raw samples into 5-minute and 1-hour bucketsThree horizontal lanes showing the same time window. The top lane shows raw 15-second samples as a dense series of dots over a one-hour window. The middle lane shows the same hour partitioned into twelve 5-minute buckets, each replaced by an envelope of five aggregate values: count, sum, min, max, t-digest. The bottom lane shows a single 1-hour bucket with the same five aggregates. Annotations on the right show the storage cost: raw is 240 samples per hour at 1.3 bytes equals 312 bytes; 5-min rollup is 12 buckets times 5 aggregates equals approximately 80 bytes; 1-hour rollup is 5 aggregates equals approximately 12 bytes. A red x crosses out the raw lane labelled deleted after 7 days.One series, one hour, three retention tiersRaw — 15 s scrape intervalretained 0–7 days~312 B / hr5-minute rollup — multi-aggregate enveloperetained 7–90 dayscnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max tdcnt sum min max td~80 B / hr1-hour rollup — single envelope per hourretained 90 days–13 monthscount, sum, min, max, t-digest(50/95/99/99.9)~12 B / hr26× compression at 5-min, 26× more at 1-hr. Total: ~680× vs raw for the 13-month tail.td = t-digest sketch (≈ 60 B for 4 quantile targets); count + sum + min + max are 8 B each.Source: Thanos compactor downsample.go (DefaultLevels = 5m, 1h); same envelope shape used by Mimir, M3, VictoriaMetrics.
Illustrative — exact byte counts depend on the sketch implementation and the metric's value distribution; the order of magnitude is what matters. The three-tier shape (raw / 5-minute / 1-hour) is the default in Thanos and Mimir, and is the design every long-retention TSDB converges on once it admits that storing raw samples for a year is not worth the cost of disk.

Why the bucket boundaries are 5 minutes and 1 hour, specifically: 5 minutes is the smallest bucket coarser than typical alert evaluation windows (30 s to 4 min) — so alerts always fire on raw data, never on rollups, which keeps alert latency real. 1 hour is the smallest bucket coarser than business-rhythm queries ("show me the daily 09:15 IST market open spike for the last quarter"). Picking a non-divisor of 24 hours (say, 7 minutes) would make daily aggregates split a bucket and forces a re-aggregation per query. Thanos hard-codes 5m and 1h for this reason; Mimir made them configurable but ships the same defaults.

Why average is the lie

Every downsampling system keeps count and sum per bucket, because the average is sum / count and the average is what every junior engineer reaches for first. The problem is that the average tells you almost nothing about the distribution — and observability is mostly about distributions.

Consider the IPL final at Hotstar. Toss happens at 19:00 IST. For 15 seconds, the /playback/start endpoint sees 25 million concurrent connection attempts; p99 latency goes from 80 ms to 4.2 seconds. The 4-second p99 is the entire story of that minute — it is what the on-call engineer needs to see when they look back later. But a 5-minute average over the spike spreads the 15-second outlier across 300 seconds of mostly-fine traffic, and the average ends up at 145 ms. The spike disappears.

# downsample_lie.py — show how averaging destroys the tail signal
# pip install numpy
import numpy as np

np.random.seed(42)
# Simulate 5 minutes (300 seconds) at 1 sample/sec of the /playback/start latency
# during the IPL toss spike. 15 seconds of pain, 285 seconds of fine.
fine = np.random.gamma(shape=2.0, scale=40.0, size=285)        # ~80 ms p50
spike = np.random.gamma(shape=2.0, scale=1500.0, size=15)      # ~3 s p99-ish
window = np.concatenate([fine[:142], spike, fine[142:]])

# What downsampling will store
count_       = len(window)
sum_         = float(window.sum())
min_         = float(window.min())
max_         = float(window.max())
mean_        = sum_ / count_
p99_truth    = float(np.percentile(window, 99))
p99_9_truth  = float(np.percentile(window, 99.9))

# What the dashboard will show if it queries the average from the rollup
print(f"5-minute window — what actually happened:")
print(f"  count:                  {count_}")
print(f"  raw p50:                {np.percentile(window, 50):7.1f} ms")
print(f"  raw p99:                {p99_truth:7.1f} ms")
print(f"  raw p99.9:              {p99_9_truth:7.1f} ms")
print(f"  raw max:                {max_:7.1f} ms")
print(f"")
print(f"What 'avg(rate(...))' would tell you:")
print(f"  mean (sum / count):     {mean_:7.1f} ms   <-- the alert that auto-resolved")
print(f"")
print(f"Tail signal preserved by min/max envelope:")
print(f"  max in bucket:          {max_:7.1f} ms   <-- spike still visible")
print(f"  min in bucket:          {min_:7.1f} ms")
5-minute window — what actually happened:
  count:                  300
  raw p50:                   58.2 ms
  raw p99:                 4187.3 ms
  raw p99.9:               5421.8 ms
  raw max:                 5933.4 ms

What 'avg(rate(...))' would tell you:
  mean (sum / count):       234.6 ms   <-- the alert that auto-resolved

Tail signal preserved by min/max envelope:
  max in bucket:           5933.4 ms   <-- spike still visible
  min in bucket:              4.7 ms

Per-line walkthrough. The line fine = np.random.gamma(shape=2.0, scale=40.0, size=285) generates 285 seconds of healthy traffic — gamma-distributed because real latency has a long right tail; the median lands near 60 ms, which matches Hotstar's /playback/start baseline. The line spike = np.random.gamma(shape=2.0, scale=1500.0, size=15) is the 15-second pain window — same shape, 37× the scale, so the median jumps to 2.5 seconds and the tail goes well past 5 seconds. Why simulating with gamma instead of normal: real latency distributions are heavy-tailed, never symmetric around a mean, and the p99 is what kills you, not the mean. Using np.random.normal would give a symmetric bell that looks fine on a histogram and produces wrong-shaped intuition. The gamma family is the right minimum-fidelity model for latency simulation, and is what Tene's "How NOT to Measure Latency" talks recommend for synthetic generation.

The line mean_ = sum_ / count_ computes what Prometheus's avg_over_time query — the most common dashboard query in production — would return on the 5-minute rolled-up bucket. The result is 234 ms. Why this is the failure mode that loses you trust in your observability: a 5-second p99 is paging-worthy; a 234 ms p99 is not. If you only stored count and sum (the minimum aggregates needed to compute the average), and a year later you tried to ask "did we have a spike during last year's toss?", the answer the data gives you is "no, things looked fine, average was 234 ms". The data is wrong about the past. The mean is, at best, useless for tail diagnosis; at worst, it is actively misleading.

The line max_ = float(window.max()) is the single cheapest fix. Storing max per bucket — 8 bytes — preserves the absolute worst observed sample in each window. It is not the p99 (you need the t-digest for that), but it lights up on the dashboard the moment the spike happens. Thanos stores max per 5-minute bucket for exactly this reason. The min/max envelope is enough to catch the spike, even if you cannot reconstruct the exact p99.

What aggregates to keep

The choice of aggregates is the design. Below is the minimum envelope that survives queries you actually run, ranked by how much each one buys you.

Aggregate Bytes Answers the question
count 8 How many samples were in this bucket? Required for averaging anything else.
sum 8 What was the total? Required for rate() over rollups.
min, max 16 Was there a spike? The cheapest tail-detection signal.
t-digest (4 targets) ~60 What was the p50/p95/p99/p99.9? The only way to answer quantile questions on rollups.

That is the envelope: 92 bytes per bucket per series. For a 5-minute bucket on a 1-day-old chunk, that is 26 KB/day/series; for a 1-hour bucket on a 13-month tail, 800 bytes/year/series. Across 10,000 active series, the year-long tail costs about 8 MB. That is what a year of metrics looks like at 1-hour rollup with full quantile fidelity.

The interesting structural property is that count, sum, min, max, and the t-digest are all mergeable — given the envelope of bucket A and the envelope of bucket B, you can compute the envelope of A ∪ B without touching the original samples. Count and sum merge by addition; min and max by min/max; t-digest by sketch-merge (the operation t-digest exists for). This is what lets a query for a 1-day window over 1-hour rollups stitch 24 buckets together into a single answer; the math works because every aggregate is mergeable.

# rollup_merge.py — show that the envelope is mergeable across buckets
# pip install tdigest
from tdigest import TDigest
import numpy as np

np.random.seed(7)

class Envelope:
    """The 5-aggregate per-bucket summary that Thanos / Mimir stores."""
    def __init__(self):
        self.count = 0
        self.sum = 0.0
        self.min = float("inf")
        self.max = float("-inf")
        self.td = TDigest()  # mergeable quantile sketch

    def observe(self, x: float):
        self.count += 1
        self.sum += x
        self.min = min(self.min, x)
        self.max = max(self.max, x)
        self.td.update(x)

    def merge(self, other: "Envelope") -> "Envelope":
        out = Envelope()
        out.count = self.count + other.count
        out.sum = self.sum + other.sum
        out.min = min(self.min, other.min)
        out.max = max(self.max, other.max)
        out.td = self.td + other.td  # tdigest's __add__ is sketch-merge
        return out

    def quantile(self, q: float) -> float:
        return float(self.td.percentile(q * 100))

# Three 5-minute buckets representing a Razorpay UPI checkout latency stream
# during a 15-min window straddling a Tatkal-hour spike at 10:00 IST.
samples_a = np.random.gamma(2.0, 40, 600).tolist()              # 09:55–10:00
samples_b = np.random.gamma(2.0, 280, 600).tolist()             # 10:00–10:05 (spike)
samples_c = np.random.gamma(2.0, 50, 600).tolist()              # 10:05–10:10

env_a, env_b, env_c = Envelope(), Envelope(), Envelope()
for x in samples_a: env_a.observe(x)
for x in samples_b: env_b.observe(x)
for x in samples_c: env_c.observe(x)

# Merge across buckets without re-reading samples
merged = env_a.merge(env_b).merge(env_c)
truth = samples_a + samples_b + samples_c

print(f"truth   p99 = {np.percentile(truth, 99):7.1f} ms   max = {max(truth):7.1f} ms")
print(f"merged  p99 = {merged.quantile(0.99):7.1f} ms   max = {merged.max:7.1f} ms")
print(f"error in p99 from sketch merge: {abs(np.percentile(truth, 99) - merged.quantile(0.99)):.2f} ms")
truth   p99 =   780.4 ms   max =  1242.1 ms
merged  p99 =   783.6 ms   max =  1242.1 ms
error in p99 from sketch merge: 3.21 ms

Why the 3 ms error matters and why it is acceptable: t-digest is a sketch — it stores a compressed approximate representation of the distribution, not the raw samples. The error in the p99 estimate over the merged sketch is 3.21 ms on an 780 ms p99, about 0.4% relative error. For the question "was the p99 over the past hour above 200 ms?" this error is invisible. For the question "was the p99 exactly 783.6 ms or 780.4 ms?" — that question is not what histograms answer in the first place. The relative error of t-digest stays near 0.5% across the supported quantile range; HdrHistogram, an alternative sketch with stronger error guarantees but a larger footprint (~3 KB vs ~60 bytes), gets you to 0.1% if you need it. DDSketch is the third member of the family, with explicit relative-error guarantees and a slightly different merge cost. All three are mergeable. The choice is a footprint-vs-precision negotiation, and 0.5% is fine for almost every observability use case.

The line out.td = self.td + other.td is the load-bearing operation of the entire downsampling architecture. Without a mergeable quantile sketch, downsampled rollups cannot answer percentile questions across windows wider than one bucket. Storing the raw histogram per bucket (Prometheus's classical histogram) is one workable answer if you accept the bucket boundaries, but the bucket-boundary choice affects the answer (see latency histograms and quantile interpolation). T-digest, DDSketch, and HdrHistogram all sidestep the bucket-boundary issue by being adaptive sketches.

How retention tiers fit together

The full retention design uses three or four tiers, each with its own bucket size and aggregates, each transitioning at a configurable boundary. This is the Thanos compactor design and the Mimir long-term-store design, and it is what everyone has converged on.

┌───────────────────────────────────────────────────────────────────────────┐
│ Tier 0 — RAW                                                              │
│ Bucket size: scrape interval (15 s)                                       │
│ Aggregates: full samples                                                   │
│ Retention: 7 days (Prometheus head + first 7 days of TSDB blocks)         │
│ Purpose: alert evaluation, real-time dashboards, recent debugging          │
├───────────────────────────────────────────────────────────────────────────┤
│ Tier 1 — 5-MINUTE ROLLUP                                                   │
│ Bucket size: 5 minutes                                                     │
│ Aggregates: count, sum, min, max, t-digest                                 │
│ Retention: 90 days                                                         │
│ Purpose: weekly / monthly trend dashboards, last-quarter postmortems       │
├───────────────────────────────────────────────────────────────────────────┤
│ Tier 2 — 1-HOUR ROLLUP                                                     │
│ Bucket size: 1 hour                                                        │
│ Aggregates: count, sum, min, max, t-digest                                 │
│ Retention: 13 months                                                       │
│ Purpose: year-over-year comparisons, capacity planning, IPL-final lookback │
└───────────────────────────────────────────────────────────────────────────┘

The transition between tiers is the job of a compactor. Thanos's compactor is a separate process that runs on a schedule; for each TSDB block older than the tier-1 boundary, it reads the raw samples, computes the per-5-minute envelope per series, writes a new block of envelopes, and deletes the raw block. The same process runs again for blocks crossing the tier-2 boundary, this time consuming the tier-1 block. Mimir's long-term-store has the same compactor structure under different module names (compactor, store-gateway).

The PromQL evaluator on top of the storage knows which tier a query needs to read by looking at the time range and the requested resolution. A query for the last 24 hours hits the raw tier or the 5-minute tier (depending on step); a query for the last 6 months hits the 1-hour tier. The PromQL step parameter is the dial: a step=1m query over a 6-month range is a category error — the storage cannot serve sub-hourly resolution beyond 90 days, so the evaluator either errors or silently widens the step to 1 hour. Why this matters for dashboard correctness: Grafana's default behaviour is to set step based on the panel's pixel width and the time range. A year-wide panel that is 800 pixels wide will request a step around 11 hours — already coarser than the 1-hour rollup, so the rollup is upsampled by repetition. A 7-day panel at 800 pixels requests a step around 13 minutes, which sits between the raw tier (which only covers 7 days) and the 5-minute tier — Grafana's PromQL transport will pick whichever tier the storage exposes, and the answer differs by which one wins. This is why dashboard panels mysteriously change shape when you scroll the time range; the underlying tier flipped.

The transition timings — 7 days, 90 days, 13 months — are not magic. The 7-day boundary is roughly the longest period an alert author needs raw fidelity for (a for: 7d is rare; for: 1h is common). The 90-day boundary aligns with quarterly business reviews and most compliance regimes for "operational" data. 13 months is one full year plus the comparison-tail: "show me last year's IPL final" requires keeping data slightly longer than 12 months to account for week-vs-week, year-over-year alignment. Hotstar uses exactly these boundaries; Razorpay uses 14 days / 90 days / 13 months because their alert windows are longer.

The Thanos compactor moves blocks across retention tiersDiagram showing three storage tiers as horizontal bands with TSDB blocks flowing left to right. Top band labelled raw, 0 to 7 days, contains four 2-hour-block icons. Middle band labelled 5-minute rollup, 7 to 90 days, contains a few wider blocks. Bottom band labelled 1-hour rollup, 90 days to 13 months, contains compact blocks. Two compactor processes are shown as boxes labelled compactor with arrows pointing from raw blocks to 5-minute blocks and from 5-minute blocks to 1-hour blocks. A query path is shown on the right side: a PromQL query box with time range minus 6 months feeds into a query frontend that fans out to all three tiers, with the 1-hour tier serving most of the bytes. Annotations show storage size per series per tier: raw 50 MB, 5-min 5 MB, 1-hour 800 KB.Compactor pipeline — block-by-block tier transitionsTier 0 — RAW (0–7 d)2h block2h block2h block2h block2h blockaging out~50 MB / series15 s scrapecompactorTier 1 — 5-MIN ROLLUP (7–90 d)5m envelope block5m envelope block5m envelope block~5 MB / seriescnt sum min max tdcompactorTier 2 — 1-HOUR ROLLUP (90 d – 13 mo)1h envelope block (week)1h envelope block (week)~800 KB / seriesPromQL queryrange = 6 monthsstep = 1 hour→ Tier 2 only
Illustrative — derived from the Thanos compactor architecture (`thanos compact --downsampling.enabled`); per-tier byte counts are order-of-magnitude approximations on a 10K-series workload. The compactor is a separate process that wakes on a schedule, reads aging blocks, computes the envelope per series per bucket, writes a new block, and (after a configurable safety window) deletes the source. Mimir's compactor module has the same shape with multi-tenant isolation added.

Common confusions

Going deeper

The choice of sketch — t-digest, DDSketch, HdrHistogram

T-digest, DDSketch, and HdrHistogram are the three mergeable quantile sketches that observability systems use. They differ in three dimensions: footprint (the bytes per sketch), error model (the worst-case relative error), and merge cost (the work needed to combine two sketches). T-digest stores a centroid-tree where each centroid is a (mean, count) pair, with more centroids near the tails — its footprint is around 60 bytes for a default compression of 100 and its error is bounded by relative-error scaling toward the extremes (better at p99 than at p50, deliberately). DDSketch stores log-bucketed counters with a fixed relative-error guarantee across the entire quantile range — its footprint grows with the value range, typically 200–400 bytes for typical latency distributions but with provably-bounded relative error. HdrHistogram stores fixed-precision linear-then-exponential buckets — a fixed 3 KB footprint for typical latency ranges with better-than-0.1% error. Datadog uses DDSketch internally because the relative-error guarantee composes cleanly across services and tenants; Honeycomb and Lightstep use t-digest for footprint reasons; the Java-world performance community uses HdrHistogram because of its extensive tooling and clean coordinated-omission story.

For downsampling specifically the right pick is t-digest — its small footprint is what makes 1-hour rollups over 13 months affordable. The 0.5% relative error at p99 is not the limiting factor in observability accuracy (the real noise comes from sampling, scrape jitter, and clock skew); the 60-byte-per-bucket footprint is.

Counter resets, gauges, and the rate() trap

Counters in Prometheus are monotonically increasing until they reset (process restart, container reschedule, integer wraparound). The rate() function detects resets and corrects for them by treating any decrease as a reset and adding the pre-reset value back. When downsampling, a counter that resets within a 5-minute bucket has to be detected — the naive "store sum" aggregator misses the reset and reports a within-bucket sum that is too small.

Thanos's compactor solves this by tracking, per bucket, both the sum (the corrected within-bucket increment) and a counter_reset_count (number of resets observed in the bucket). The rate() evaluator on the rolled-up data uses both. Gauges have the opposite problem: their semantics are "value at this instant", and the right rollup aggregate is the last value seen in the bucket (not the sum). Mixing up counter and gauge aggregation rules at the rollup layer is a class of bug that has bitten every TSDB at least once; Mimir's block-builder ships with explicit per-metric-type aggregator selection for this reason.

What dashboards see when they cross a tier boundary

A dashboard with a 14-day time range queries a mix of raw (last 7 days) and 5-minute rollup (days 7–14). The PromQL evaluator stitches them together — the raw side serves at full resolution, the rollup side serves at 5-minute resolution, and the resulting series has a visible resolution shift at the boundary. Engineers who do not know this is happening look at the graph and conclude that "the metric got smoother" eight days ago; the metric did not, the storage tier did. Grafana renders the boundary invisibly because PromQL does not expose the tier the data came from in the response.

The fix is a panel annotation that says "rollup boundary at -7d", or a custom datasource that returns the resolution per point. Mimir's mimir-query-frontend exposes a Mimir-Resolution HTTP header on every response that lists the resolution per shard; teams that need to be honest about the rollup transition route this header into their Grafana frontend. Most teams do not, and the dashboards lie quietly.

Reproducibility footer

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy tdigest pandas
python3 downsample_lie.py
python3 rollup_merge.py
# Optional: spin up a Thanos sidecar + compactor locally to see real downsampling
docker run -d --name prom -p 9090:9090 prom/prometheus
docker run -d --name thanos-sidecar --network=container:prom \
  thanosio/thanos:latest sidecar --tsdb.path=/prometheus --prometheus.url=http://localhost:9090
docker run -d --name thanos-compactor -v thanos-data:/data thanosio/thanos:latest \
  compact --data-dir=/data --downsampling.enabled

Where this leads next

Downsampling is the throwing-things-away lever; it sits one floor below compression and one floor above sampling (which throws away whole events at ingest time, not aggregates at retention time). The triplet — sample at ingest, compress at storage, downsample at retention — is the full economic playbook of long-retention observability.

The single insight worth taking away: downsampling is a choice about which questions the future will be allowed to ask of the past. Storing only count and sum says "we will only ever care about averages and totals". Storing the t-digest envelope says "we will still care about p99". The mistake is not making the choice — every system makes it implicitly when its disk fills up and rotation kicks in. The mistake is making it without a t-digest, and then six months later being unable to answer Karan's question about the IPL final because the only fingerprint of that 4-second p99 spike that remains is a 234 ms average that does not match anyone's memory.

References