Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Dashboards almost never need raw 1-second data from six months ago — they need the shape (avg, p99, max) at minute or hour granularity. Downsampling pre-computes those bucketed aggregates and continuous aggregates keep them current incrementally, so a 30-day p99 query reads 720 hourly rows instead of scanning 250 billion raw points. The retention hierarchy (raw 7d, minute 30d, hour 1y, day forever) compounds the win — each tier 10-100x cheaper than the one above, and the dashboard never knows which tier it hit.

The previous chapter ended with a single-node TimescaleDB hypertable answering a "p99 latency last hour" dashboard query in 200 milliseconds — chunk pruning, columnar storage, and per-column compression combining to a 2500x I/O reduction. It is one of the cleanest engineering wins in the database world. And it is about to look quaint.

But that win was for a query covering one hour. What happens when the SRE team wants to look at the same dashboard, but for the last month — to compare today's p99 against the same week last quarter, or to verify that a deployment four weeks ago really did reduce tail latency? The columnar layout helps, but the math gets uncomfortable. 30 days at 200K writes/second is 518 billion rows; even at the columnar layout's compression ratios, that is around 100 GB of compressed latency_ms data to scan. A query that streams 100 GB of decompressed doubles through a percentile estimator, even with vectorised execution, takes 30-90 seconds. That is not dashboard-acceptable. It is barely human-acceptable.

The fix is not to make the scan faster. The fix is to not scan the raw data at all. Almost no dashboard ever needs to know what the latency was at 14:23:07 IST on the third of March. Dashboards need the shape of latency at minute or hour granularity — the trend, the outliers, the diurnal pattern. If the database has already computed (hour, p99_latency_ms) for every hour in the last month and stored it as 720 rows, the dashboard query becomes "select 720 rows from a tiny table" — sub-50ms, regardless of how many billions of raw events are sitting underneath.

That is the entire idea behind downsampling. The rest of this chapter is about how to do it without making your life miserable: how to keep the rollups in sync with new data without re-aggregating from scratch (continuous aggregates), how to organise multiple resolutions in a tiered hierarchy (raw → minute → hour → day), and what the SQL actually looks like in TimescaleDB, the rule syntax in Prometheus, and the Flux task in InfluxDB.

What downsampling actually is

Downsampling is the operation of taking a high-resolution time-series and summarising it at a coarser resolution. Concretely: given raw data points (t_1, v_1), (t_2, v_2), ..., (t_N, v_N) arriving at one-second cadence, you choose a bucket size (say, one minute), partition the timeline into buckets [t_0, t_0 + 60s), [t_0 + 60s, t_0 + 120s), ..., and for each bucket compute one or more aggregate statistics — average, sum, min, max, count, p99 — over the raw points that fell inside that bucket. The output is a new, much smaller time-series at one-minute resolution.

Raw vs downsampledDownsampling — raw 1-second points reduced to 1-minute aggregate bucketsRaw stream (1 sample/second over ~16 days):|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|.|1,000,000,000 points (1B), every column compressed but still ~12 GB of latency_ms doublesBucketing — group every 60 raw points into a 1-minute window:60 pts60 pts60 pts...one bucket per minute...60 ptsAggregated output (1 row per minute):(t,avg,p99,min,max,n) (t,avg,p99,min,max,n) (t,avg,p99,min,max,n) ... 1,000,000 rows ...~80 bytes per bucket -> 80 MB total. 150x smaller than raw, queries 1000x faster.The win — dashboard query "p99 by minute over the last 7 days":- Against RAW: scan ~600M latency values, run percentile -> 30 to 60 seconds (even on columnar)- Against AGGREGATE: read ~10,000 pre-computed rows, plot directly -> 20 millisecondsCost: 80 MB extra storage. Benefit: ~1500x query speedup, every dashboard refresh.The aggregate is always queried, never the raw — dashboards rarely need second-level resolution.

Two design choices fall out immediately. The first is what to aggregate — for a single source metric like latency_ms, you typically want at minimum count, sum, min, max, and one or more percentiles (p50, p95, p99, p999). For a counter like requests_total, you typically want last_value (since the running total is itself an aggregate). For a gauge like cpu_percent, you want avg and max. The aggregates you pick determine which dashboard queries you can serve from the rollup vs falling back to raw. A dashboard that only needs averages can be answered from (t, avg) alone; one that needs p99 needs the percentile pre-computed.

The second design choice is whether percentiles can be combined across buckets. Sums are easy: the sum over an hour is the sum of the sums over the 60 minutes in that hour. Averages are slightly harder: you need to keep (sum, count) per bucket, then re-divide. Mins and maxes are easy: min-of-mins, max-of-maxes. Percentiles are hard. You cannot compute the p99 over an hour from the 60 per-minute p99s — that gives the wrong answer in general. You need to either keep the full distribution per bucket (expensive) or use a sketch like t-digest or HyperLogLog that supports merging across buckets. Why this matters for downsampling: TimescaleDB's percentile_agg and InfluxDB's percentile functions both build on t-digest exactly so that the per-minute aggregates can be merged into per-hour aggregates without re-scanning the raw data. Without a mergeable sketch, you would have to keep the raw data — which defeats the entire point.

The reason mergeable aggregates are the foundation of the whole approach is that they enable a hierarchy of rollups: minute aggregates merge into hourly aggregates merge into daily aggregates, each level computed from the level above without ever touching raw data. This is what lets you keep daily aggregates "forever" cheaply — they are themselves aggregates of hourly aggregates, which are themselves aggregates of minute aggregates.

The naive way and why it fails

The first time an engineer encounters the dashboard-is-too-slow problem, the instinct is to write a cron job:

-- Every minute, append one new row to the rollup.
INSERT INTO latency_per_minute (bucket, p99)
SELECT date_trunc('minute', ts) AS bucket,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM events
WHERE ts >= date_trunc('minute', now()) - INTERVAL '1 minute'
  AND ts <  date_trunc('minute', now())
GROUP BY bucket;

This works. For a while. Three failure modes appear in production. First, late-arriving data: events whose timestamp is from three minutes ago but that hit the database now (because of buffering, retries, network delays) never make it into the rollup, because the cron only re-aggregates the most recent complete minute. Second, backfill: if you ingest historical data — a one-off load of last month's events — the rollup has no idea those rows exist and never recomputes the affected buckets. Third, schema or formula changes: if the SRE team decides they also want p999, you have to re-run the aggregation for the entire history, manually, in batches, hoping you do not double-insert.

The continuous-aggregate machinery solves all three by tracking which buckets have been touched by writes and recomputing only those. It is not a cron job; it is a structural feature of the storage engine. The next two sections describe how TimescaleDB and Prometheus implement it.

TimescaleDB continuous aggregates

A TimescaleDB continuous aggregate is a materialised view declared over a hypertable, with two pieces of magic the engine adds on top of plain Postgres materialised views. First, the view is itself stored as a hypertable — partitioned by bucket time — so queries against it benefit from chunk pruning just like queries against raw data. Second, the engine maintains an invalidation log that records which raw chunks have been written to since the view was last refreshed; the refresh job consults the log and recomputes only the affected buckets, not the whole view.

The setup is two SQL statements. First, declare the view:

CREATE MATERIALIZED VIEW orders_per_minute
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
       merchant_id,
       count(*) AS order_count,
       avg(amount_inr) AS avg_amount_inr,
       percentile_agg(latency_ms) AS latency_sketch
FROM orders
GROUP BY bucket, merchant_id;

The time_bucket function is TimescaleDB's bucketing primitive — it floors a timestamp to the nearest minute (or any other interval). The percentile_agg function returns a t-digest sketch from which any percentile can later be extracted with approx_percentile(0.99, latency_sketch). Crucially, t-digests are mergeable: two sketches over disjoint windows can be combined without going back to the raw data. Why store the sketch and not just p99: because if you store only p99, you can never compute hourly p99 from the minute aggregates. Storing the t-digest sketch lets you derive p50, p95, p999 — anything — at any coarser resolution by merging.

The second statement adds a refresh policy:

SELECT add_continuous_aggregate_policy('orders_per_minute',
    start_offset => INTERVAL '1 hour',
    end_offset   => INTERVAL '1 minute',
    schedule_interval => INTERVAL '1 minute');

This says: every minute, refresh the view for the time range [now() - 1 hour, now() - 1 minute], but only for buckets whose underlying raw data has actually changed. The end_offset of one minute is intentional — it leaves a small "settling" window during which late-arriving data can still slot into the right bucket before that bucket is materialised. The start_offset of one hour is the maximum lateness the system will tolerate; data arriving more than an hour late will not retroactively update the view. These two parameters are the operationally important knobs, balancing freshness against resilience to late data.

Continuous aggregate refreshContinuous aggregate refresh — only the affected buckets are recomputedRaw hypertable, daily chunks. New writes land in the most recent chunk:day -6day -5day -4day -3day -2day -1today<-- new writes (last 5 min)Invalidation log (engine-maintained):{ chunk_id: today, ts_range: [11:55, 12:00] } { chunk_id: today, ts_range: [11:50, 11:55] }All other chunks have empty invalidation entries -> nothing to recompute there.Continuous aggregate (1-minute buckets, also a hypertable):cachedcachedcachedcachedcachedcachedREBUILDRefresh job runs every minute. Reads the invalidation log. For each entry, deletes the affectedaggregate rows and re-inserts them by re-aggregating just those buckets from the raw data.Cost analysis per refresh cycle:- Without invalidation log: re-scan all of "today" -> ~17B raw rows -> minutes of CPU per refresh- With invalidation log: re-aggregate only ~5 buckets x 60 rows/sec x 60 sec = 18,000 raw rows -> ~5 msThe invalidation log is the structural feature that makes "incremental" actually mean incremental.

Querying the view is a regular SELECT:

SELECT bucket, merchant_id, approx_percentile(0.99, latency_sketch) AS p99
FROM orders_per_minute
WHERE bucket > now() - INTERVAL '7 days'
  AND merchant_id = 12345
ORDER BY bucket;

Because the view is itself a hypertable partitioned by bucket, the same chunk-pruning machinery from the previous chapter applies. The query reads only the chunks of the view covering the last 7 days — about 10,000 rows (7 days x 1440 minutes/day) for one merchant. The dashboard query that took 30 seconds against raw data takes 15 milliseconds against the continuous aggregate.

A second policy makes it concrete how the rollup stays correct under late-arriving data. Suppose at time t = 12:30 an event from t = 12:25 arrives (network retry, queue lag, whatever). The insert lands in today's raw chunk. The engine writes an invalidation entry marking [12:25, 12:26] as dirty. The next refresh tick (one minute later) sees the invalidation entry, re-aggregates that single one-minute bucket from raw, and overwrites the corresponding row in the view. Total recomputation cost: 60 raw rows. The other 1439 buckets of the day are not touched. That is what "continuous" really means — incremental, invalidation-driven, never a from-scratch rebuild.

A second continuous aggregate, defined over the first one rather than over raw, gives you the hourly tier essentially for free:

CREATE MATERIALIZED VIEW orders_per_hour
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', bucket) AS hour_bucket,
       merchant_id,
       sum(order_count) AS order_count,
       average(rolling_agg(avg_amount_inr)) AS avg_amount_inr,
       rollup(latency_sketch) AS latency_sketch
FROM orders_per_minute
GROUP BY hour_bucket, merchant_id;

The rollup function merges t-digest sketches; the result is a sketch over the hour from which any percentile can be derived. The hourly aggregate is built from the minute aggregate, not from raw. Each tier's refresh cost is proportional to the tier above it, not to the raw stream. This is the structural property that makes the hierarchy economically sensible.

Prometheus recording rules

Prometheus achieves the same outcome through a different mechanism: recording rules. A rule file declares a query that the rule engine evaluates on a fixed schedule, writes the result back into the time-series database as a new metric, and serves dashboard queries against the rule output.

groups:
  - name: latency_rollups
    interval: 1m
    rules:
      - record: api:request_latency_seconds:p99_5m
        expr: histogram_quantile(0.99,
                sum(rate(api_request_latency_seconds_bucket[5m])) by (endpoint, le))
      - record: api:request_count:rate1m
        expr: sum(rate(api_request_count_total[1m])) by (endpoint, status)

Every minute, the Prometheus rule engine evaluates each expr and stores the result as a new metric named by the record field. Dashboards query the recorded metric directlyapi:request_latency_seconds:p99_5m — never running the underlying histogram_quantile over the raw histogram buckets at dashboard time. The naming convention (level:metric:operation) makes the layer explicit. Why this works for percentiles in Prometheus: Prometheus's underlying counter type for latency is a histogram with predefined buckets, and histogram_quantile over a sum of bucket rates is itself a mergeable computation. Recording rules essentially pre-sum-and-rate the buckets so dashboard queries are O(rules) rather than O(raw samples).

Recording rules are not as automatically incremental as TimescaleDB continuous aggregates — they re-run the full query each evaluation interval rather than tracking invalidation. But because Prometheus only retains a fixed window of raw data (typically 15 days) and evaluates over short look-back windows (5m, 1h), the cost is bounded. The conceptual model is identical: pre-compute the aggregate, store it, query it instead of raw.

InfluxDB tasks

InfluxDB 2.x calls the same idea tasks: scheduled Flux scripts that read from a raw bucket, aggregate, and write to a downsampled bucket.

option task = {name: "downsample_5m", every: 5m}

from(bucket: "telemetry_raw")
  |> range(start: -task.every)
  |> filter(fn: (r) => r._measurement == "api_latency")
  |> aggregateWindow(every: 5m, fn: mean)
  |> set(key: "_measurement", value: "api_latency_5m")
  |> to(bucket: "telemetry_5m_30d")

Every five minutes, the task aggregates the most recent five-minute window into a separate bucket (telemetry_5m_30d) which has its own retention policy. A second task can downsample the 5-minute bucket into an hourly bucket; a third into daily. The retention on each bucket is set independently — raw might be 24 hours, 5-minute might be 30 days, hourly might be a year. Dashboards select from the bucket whose resolution and retention match the query window, with Grafana templating handling the routing.

The InfluxDB task model is more flexible than continuous aggregates (Flux is a full programming language) but less automatically incremental — like Prometheus, you re-run the aggregation each schedule rather than tracking dirty buckets. For most monitoring workloads the difference is invisible; the tasks run in seconds either way.

The retention hierarchy

Once you have downsampling working, the next operational decision is how long to keep each tier. The standard pattern is a hierarchy where each tier has its own retention policy, and each tier is roughly 10x smaller than the one above it.

Retention hierarchyRetention hierarchy — each tier coarser, longer-lived, cheaper than the one aboveTier 1 - RAW (1 sec resolution, kept 7 days)- Storage: hot NVMe, columnar compressed (~4 TB for 100K events/sec)- Used for: incident debugging "what exactly happened at 14:23:07?", trace correlation, training dataTier 2 - MINUTE (1 min resolution, kept 30 days, ~60x smaller)- Storage: warm NVMe or local SSD, continuous aggregate (~70 GB)- Used for: real-time dashboards "p99 over the last day", alerting, week-over-week comparisonsTier 3 - HOUR (1 hour resolution, kept 1 year, ~3600x smaller than raw)- Storage: cheap SSD or warm S3, continuous aggregate of minute aggregate (~1.2 GB)- Used for: capacity planning, monthly trend dashboards, SLO compliance reportsTier 4 - DAY (1 day resolution, kept forever, ~86400x smaller than raw)- Storage: object storage (S3/Glacier), continuous aggregate of hourly aggregate (< 100 MB/year)- Used for: year-over-year analytics, board metrics, "compare 2024 Diwali traffic to 2026 Diwali"The query router (Grafana, Chronograf, custom): pick the coarsest tier whose resolution >= 1 pixelDashboards never know which tier they hit. Last-hour query -> minute. Last-month -> hour. Last-decade -> day.

The numbers are characteristic. At 100K events/sec for 7 days of raw data on a columnar TSDB, you store ~4 TB. The minute aggregate over 30 days is ~70 GB. Hourly over 1 year is ~1.2 GB. Daily forever is under 100 MB per year of additional storage. The total storage footprint of the analytics — across all tiers — is dominated by the raw tier; everything else is rounding error. Once raw rolls off the 7-day window, the lifetime cost of keeping the rolled-up data is essentially free.

The query router lives in the dashboard layer. Grafana with the TimescaleDB or Prometheus data source will (with templating or with explicit per-panel queries) select from the right tier based on the dashboard's time range. A common rule of thumb: pick the coarsest tier whose bucket size is no larger than one pixel-width of the chart at the current zoom level. A dashboard 1000 pixels wide showing a 30-day window has 30 days x 24 hours = 720 hours, well over one hour per pixel — hourly aggregate is the right tier. The same dashboard zoomed to 6 hours has 6 minutes per pixel, so minute aggregate. Zooming further into 30 minutes might warrant raw. Why match resolution to pixel width: rendering more data points than the chart has pixels is wasted work — the renderer aggregates anyway, just less efficiently. Matching upstream cuts the bytes transferred, the bytes processed, and the bytes scanned in storage all simultaneously.

A worked example: the Indian fintech dashboard

The PaisaBridge-scale fintech monitoring system

A fictional Indian fintech, Saral Pay, runs a UPI gateway used by 8,000 merchants across India. At peak — Diwali night, India-Pakistan cricket finals, salary day, GST filing deadline — the gateway processes 100,000 transactions per second. Every transaction emits a structured event: (ts, merchant_id, region, status, latency_ms, amount_inr, payment_method). The on-call SRE wants a dashboard that shows p99 latency per region, per hour, over the last 30 days, refreshing every 60 seconds. The product team wants a separate dashboard showing transaction volume per merchant, per day, year-to-date.

The raw data sizing. 100,000 events/sec x 86,400 sec/day x 30 days = 259 billion events. Even on the columnar TSDB layout from the previous chapter — with 30 daily chunks, columnar compression at 10x — the latency_ms column alone for 30 days is around 100 GB compressed. Scanning that for a single dashboard render is 30 to 90 seconds; doing it every 60 seconds for ten on-call engineers means the database CPU is 100% occupied serving one dashboard. Without downsampling, the dashboard is unusable.

The continuous aggregate setup. The platform team adds three views on top of the existing transactions hypertable:

-- Tier 2: minute aggregate, kept 30 days
CREATE MATERIALIZED VIEW txn_per_minute
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
       region,
       merchant_id,
       count(*) AS txn_count,
       sum(amount_inr) AS total_inr,
       percentile_agg(latency_ms) AS latency_sketch
FROM transactions
GROUP BY bucket, region, merchant_id;

SELECT add_continuous_aggregate_policy('txn_per_minute',
    start_offset => INTERVAL '2 hours',
    end_offset   => INTERVAL '1 minute',
    schedule_interval => INTERVAL '1 minute');

SELECT add_retention_policy('txn_per_minute', INTERVAL '30 days');

-- Tier 3: hourly aggregate, kept 1 year, built from the minute aggregate
CREATE MATERIALIZED VIEW txn_per_hour
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', bucket) AS bucket,
       region,
       merchant_id,
       sum(txn_count) AS txn_count,
       sum(total_inr) AS total_inr,
       rollup(latency_sketch) AS latency_sketch
FROM txn_per_minute
GROUP BY bucket, region, merchant_id;

SELECT add_continuous_aggregate_policy('txn_per_hour',
    start_offset => INTERVAL '3 hours',
    end_offset   => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');

SELECT add_retention_policy('txn_per_hour', INTERVAL '1 year');

-- Tier 4: daily aggregate, kept forever, built from hourly
CREATE MATERIALIZED VIEW txn_per_day
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', bucket) AS bucket,
       region,
       merchant_id,
       sum(txn_count) AS txn_count,
       sum(total_inr) AS total_inr,
       rollup(latency_sketch) AS latency_sketch
FROM txn_per_hour
GROUP BY bucket, region, merchant_id;

-- Raw retention: 7 days
SELECT add_retention_policy('transactions', INTERVAL '7 days');

The dashboard query. Grafana asks for "p99 latency per region, hourly, last 30 days":

SELECT bucket, region, approx_percentile(0.99, latency_sketch) AS p99_ms
FROM txn_per_hour
WHERE bucket > now() - INTERVAL '30 days'
GROUP BY bucket, region
ORDER BY bucket;

That returns 30 days x 24 hours x ~6 regions = 4,320 rows. Runtime: 35-50 milliseconds on a single TimescaleDB node, even on the second-tier aggregate. The query touches under 500 KB of data. The chart renders before the user releases the mouse from the time-range picker.

The contrast. The same query against raw data — the dashboard the SRE team would have written without downsampling — touches the entire 30-day raw latency_ms column (100 GB compressed, 1 TB uncompressed) and pushes 250 billion values through a percentile estimator. Best-case runtime on a beefy node: 30 minutes. Worst-case: hours. With a 60-second dashboard refresh, this is a non-starter even before considering that the t-digest sketch in the rollup actually gives a more accurate p99 than a sampling-based percentile over 250 billion raw points anyway.

The cost summary.

Tier Resolution Retention Storage Used for
Raw 1 event 7 days ~3 TB Incident debugging, trace lookup
Minute 1 minute 30 days ~50 GB Real-time dashboards, alerting
Hourly 1 hour 1 year ~1 GB Trends, capacity planning, SLOs
Daily 1 day Forever ~30 MB/year Year-over-year, executive reports

The total storage cost across all tiers is dominated 60-fold by the raw tier; the rolled-up tiers are essentially free. The query speedup for dashboard workloads is roughly 36,000x for the 30-day p99 query (50 ms vs 30 minutes); for the year-to-date merchant volume dashboard it is over a million-fold (raw scan would be infeasible; daily aggregate query is 8 ms).

The operational point. Saral Pay's monitoring stack runs on a single TimescaleDB node — no sharding, no separate analytics warehouse, no elastic compute spinning up for ad-hoc queries. The rollups + retention hierarchy do all the work. **The total monthly infrastructure cost is around Rs 3,00,000 (3,500), serving real-time dashboards for hundreds of engineers and product managers.** A naive "store raw forever, scan it every dashboard refresh" architecture for the same workload would cost upward of Rs 30,00,000 (35,000) per month even before considering that it would not actually meet the latency SLOs.

What you should take away

  1. Dashboards do not want raw data; they want shape. Pre-compute the shape (bucket-level aggregates) once, store it, and query it instead of raw. The win is structural, not a mere optimisation — it changes query complexity from O(raw points) to O(buckets).

  2. Continuous aggregates are materialised views with an invalidation log. TimescaleDB tracks which raw chunks have been written to since the last refresh and recomputes only the affected buckets. The result: refresh cost is proportional to write rate, not to total stored data.

  3. Mergeable aggregates (sums, mins, maxes, t-digest sketches) enable the rollup hierarchy. Per-minute aggregates merge into per-hour merge into per-day, no raw data needed at any level above the bottom. Non-mergeable aggregates (raw percentiles) break the hierarchy and force re-scanning.

  4. Prometheus recording rules and InfluxDB tasks are the same idea, slightly differently packaged. Both schedule a query, write the result back, and serve dashboards from the rollup. Prometheus's mechanism is less incremental but bounded by short look-back windows; InfluxDB's is more flexible (Flux scripts) but similarly schedule-driven.

  5. The retention hierarchy compounds the win. Raw 7 days, minute 30 days, hour 1 year, day forever. Each tier 10-100x smaller and cheaper than the one above. The total storage is dominated by the raw tier; everything above it is rounding error.

  6. The query router belongs in the dashboard layer. Grafana (or whatever) picks the tier whose bucket size matches the chart's pixel width. Coarser tiers for wider time ranges. The dashboard never knows which tier it hit; the user sees only "fast".

  7. Together with the columnar layout from the previous chapter, downsampling is what makes a single-node TSDB serve a workload that would otherwise need a multi-node analytics cluster. The 36,000x speedup on the 30-day p99 dashboard is not a benchmark; it is what falls out of replacing 250 billion-row scans with 720-row scans.

In the next chapter you will see how the retention hierarchy extends one step further — into tiered storage where old chunks physically migrate from NVMe to S3, and the compression regime gets more aggressive as data ages, taking the cost-per-stored-byte down by another two orders of magnitude.

Common confusions

  • "A continuous aggregate is just a Postgres materialised view." It is not. A vanilla Postgres MATERIALIZED VIEW has to be REFRESH MATERIALIZED VIEW-ed in full each time — every refresh re-scans the entire underlying table. A TimescaleDB continuous aggregate is itself a hypertable plus an invalidation log; the refresh job consults the log and recomputes only the buckets whose underlying chunks changed since the last refresh. The WITH (timescaledb.continuous) clause is the difference between an O(n) refresh and an O(dirty buckets) refresh.

  • "I can compute the hourly p99 from the per-minute p99s." No. The 99th percentile of the union of 60 one-minute windows is not the average of the 60 per-minute p99s, nor the max, nor any other simple combination of them — there is no formula that recovers it from the per-minute p99 numbers alone. This is exactly why TimescaleDB's percentile_agg and InfluxDB's t-digest aggregations store a sketch per bucket (a few hundred bytes of distribution summary), not the percentile itself. Sketches merge; percentiles do not.

  • "Downsampling and compression are the same thing." They are not. Compression keeps every raw value but encodes it in fewer bytes (delta-of-delta on timestamps, RLE on repeats, bit-packing on small ints — the techniques from chapter 165). Downsampling throws values away and replaces a window with a summary statistic. The two stack: a TSDB usually compresses raw chunks and maintains downsampled aggregates. Compression cuts storage 10x; downsampling cuts it 60x more. They are independent levers on the cost-per-query equation.

  • "The minute aggregate is correct as soon as the minute ends." No — there is a configurable settling window (end_offset in TimescaleDB, the lookback in Prometheus rules) during which the engine deliberately does not materialise a bucket, because late-arriving events from that minute might still land. Set end_offset => INTERVAL '1 minute' and the bucket for 12:34 is materialised at 12:36, not 12:35. Querying a too-recent bucket returns either nothing or partial data. This is intentional — the alternative is constantly recomputing the most recent bucket, or worse, returning answers that change underneath the dashboard.

  • "Recording rules in Prometheus are incremental." They are scheduled, not incremental. Each evaluation interval (typically one minute) the rule engine re-runs the full expr over the configured look-back window from scratch. The reason this is acceptable is that the look-back windows are short (5m, 1h) and the underlying TSDB can answer them in milliseconds. If you set a recording rule with [30d] look-back, you will discover the difference quickly — Prometheus will choke. TimescaleDB's invalidation-log model is genuinely incremental; Prometheus's is "fast enough because the windows are small."

  • "More tiers is always better." The opposite, usually. Each tier adds operational surface area: a refresh policy, a retention policy, a query-router rule. Three tiers (raw + minute + hour) covers most monitoring workloads; four (add daily) is needed if you want multi-year retention. Going beyond four is rarely justified — the storage savings of a 10-minute tier between 1-minute and 1-hour are tiny, and the routing complexity is real. Most real deployments at PaisaBridge-scale ingest run with three to four tiers, never more.

Going deeper

If you only needed to ship a downsampling pipeline, the recipe above — TimescaleDB continuous aggregates with a 7d/30d/1y/forever retention hierarchy — is enough. The rest of this section connects the mechanism to the academic and production lineage you will meet in real systems.

t-digest, HDR-histogram, and the mergeability question

The whole rollup hierarchy depends on aggregates being mergeable across buckets. For sums, counts, mins and maxes this is trivial — they are associative operations, so f(A ∪ B) = combine(f(A), f(B)) for free. For percentiles it is fundamentally not trivial: percentiles are not an associative reduction. The way out is a probabilistic data structure that approximates the distribution well enough that merging two structures, each summarising a window, gives a structure that summarises the union. The two dominant choices in production are:

  • t-digest (Dunning & Ertl, 2019, ref-4). A clustering of (centroid, weight) pairs that adaptively keeps more resolution at the tails (where percentiles like p99, p999 live) and less in the middle. Roughly 100 centroids — about 1 KB per sketch — gives p99 accuracy under 0.1% in practice. TimescaleDB's percentile_agg returns a t-digest; approx_percentile extracts a quantile from one. T-digests merge by concatenating centroid arrays and re-clustering.

  • HDR-histogram (Gil Tene's high dynamic range histogram). A pre-allocated array of fixed-width buckets on a log scale. Faster to update than t-digest, slightly less space-efficient, equally mergeable. Used by Prometheus's histogram_quantile and by the JVM ecosystem (Cassandra, Kafka client metrics).

The choice between them rarely matters at the user's level — both give p99 accuracy good enough for monitoring. What matters is that one of them is in use; without a sketch, the rollup hierarchy collapses back to "store the raw data forever," which defeats the entire idea.

Gorilla, the paper that made this approach mainstream

The retention-hierarchy idea was around for decades (RRDtool, the round-robin database from the late 1990s, did exactly this, with fixed-size circular buffers per resolution). What changed the engineering culture was Sociogram's Gorilla paper (Pelkonen et al., VLDB 2015, ref-5). Gorilla showed that a TSDB serving a billion writes per day at Sociogram could keep 26 hours of raw data in RAM by combining XOR-based delta encoding for floats with delta-of-delta for timestamps. The paper made two points the rest of the industry absorbed: (1) raw data does not need to be on disk if you compress it well enough, and (2) anything older than the in-memory window goes through downsampling, not bigger disks. Gorilla ran with a single in-memory tier and a coarser disk tier; modern TSDBs (TimescaleDB, InfluxDB, VictoriaMetrics, QuestDB) generalised that to the four-tier retention hierarchy you saw above. Reading the Gorilla paper after this chapter is the single highest-leverage three hours of time-series database reading you can spend.

Hierarchical continuous aggregates — the cascade

In the Saral Pay example you saw three views: minute → hour → day, each defined over the previous. This is called hierarchical continuous aggregates, supported in TimescaleDB since version 2.9 (ref-6). The structural property is that each tier's refresh cost is bounded by the write rate of the tier above it, not by raw write rate. The hourly aggregate, refreshed once per hour, has to merge ~60 minute-buckets per group key — milliseconds, regardless of how many billions of raw rows underlie those 60 minutes. The daily aggregate, refreshed once per day, merges 24 hourly buckets — microseconds. The cascade is what lets a single-node TSDB serve "year-to-date" queries that a flat single-tier rollup could never handle.

Why Prometheus chose recording rules over continuous aggregates

Prometheus is deliberately simple — there is no built-in invalidation log, no schema migrations, no continuous-view feature. The design call was that for monitoring data, late-arriving samples are rare (the scrape interval is fixed, the agents run on a clock), retention is short (15 days default), and look-back windows are bounded (5m, 15m, 1h). Under those constraints, "schedule the query, write the result, query the recording" gives you 95% of the value of continuous aggregates with 5% of the engine complexity. Where Prometheus pays the price is in long-range queries: a 30-day p99 dashboard against Prometheus alone is slow, which is why production setups federate Prometheus into a long-term TSDB (Thanos, Mimir, VictoriaMetrics) for queries beyond a week.

What goes wrong in production — the post-mortem patterns

Three failure modes recur in real PaisaBridge/BharatBazaar-scale deployments:

  • Late data dropped silently. An event arrives more than start_offset after the bucket it belongs to. Default TimescaleDB behaviour: ignore it; the aggregate row stays as it was. The fix is add_continuous_aggregate_policy with a generous start_offset (24 hours for batch-uploaded data; 5 minutes for live UPI events) and an alert on the count of dropped invalidations.

  • Aggregate definition mismatch with raw schema. You added a column to the transactions table but did not update the continuous aggregate definition. New writes go in fine; the aggregate keeps computing from the old projection; nobody notices for two months until the SRE runs SELECT amount_inr_with_gst FROM txn_per_minute and sees a column that does not exist. The fix is treating the aggregate definition as part of the schema migration, alongside the raw table change.

  • Refresh policy starvation. The minute-aggregate refresh and the hourly-aggregate refresh and the daily-aggregate refresh all kick off at the top of the hour, contending for the same workers. The minute aggregate, the most latency-sensitive, gets queued behind the daily refresh on the busiest day of the year. The fix is staggering schedules (schedule_interval offsets) and giving the minute aggregate a dedicated worker pool.

These are the kind of gotchas you only learn from running a production system through Diwali or Mega Bargain Days. The good news is that the underlying primitive — invalidation log + bucket merge — is sound; the failures are operational and learnable.

References

  1. TimescaleDB — Continuous aggregates (official docs)
  2. InfluxData — Tasks and downsampling with Flux
  3. Prometheus — Recording rules
  4. Dunning & Ertl — Computing Extremely Accurate Quantiles Using t-Digests (arXiv 2019)
  5. Pelkonen et al. (Sociogram) — Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015)
  6. TimescaleDB — Hierarchical continuous aggregates (blog)