Downsampling and continuous aggregates

In short

The previous chapter showed how time-partitioned columnar storage scans a year of metrics in milliseconds. This chapter is about the layer that sits on top of that storage and makes dashboards over months of data run as fast as dashboards over hours: downsampling combined with continuous aggregates.

The thesis in one line: dashboards almost never want raw 1-second data from six months ago — they want the shape of the data (average, p99, max, count) at a coarser time bucket — so pre-compute those aggregates once, store them, and never re-scan the raw data again.

Downsampling is the algorithm: divide raw data into time buckets (1-minute, 1-hour, 1-day) and reduce each bucket to a small set of summary statistics. Continuous aggregates are the machinery that keeps those buckets up to date as new raw data arrives — a materialised view that refreshes incrementally rather than from scratch. TimescaleDB ships them as a first-class feature; Prometheus achieves the same effect with recording rules evaluated by the rule engine; InfluxDB does it with scheduled tasks written in Flux. The mechanisms differ; the underlying idea — pre-aggregate, refresh incrementally, query the rollup not the raw — is identical.

The performance arithmetic is brutal in favour of pre-aggregation. A query for "p99 latency by hour over the last 30 days" against raw data at 100,000 events/sec scans 100,000 x 86,400 x 30 = 259 billion raw points. Even on a columnar TSDB doing 1 GB/sec of decompressed throughput, that is half an hour of CPU. The same query against a pre-computed hourly aggregate scans 30 x 24 = 720 rows. The aggregate runs in 50 milliseconds. That is a 36,000x speedup that costs almost nothing in storage — the hourly aggregate for 30 days is a few hundred kilobytes, against tens of GB of raw data.

The retention hierarchy compounds the win: keep raw data for 7 days (hot, expensive, queried rarely for incident debugging), minute aggregates for 30 days (warm, fast dashboards), hourly aggregates for a year (cold, capacity planning), daily aggregates forever (long-term trends). Each tier is 10-100x cheaper than the one above it. The same dashboard query routes to the appropriate tier transparently — the dashboard never knows.

This chapter walks through downsampling math, TimescaleDB's continuous aggregate machinery (with the SQL to set one up), Prometheus recording rules, InfluxDB tasks, and a worked example of an Indian fintech monitoring 100K transactions/sec.

The previous chapter ended with a single-node TimescaleDB hypertable answering a "p99 latency last hour" dashboard query in 200 milliseconds — fast enough for a 30-second dashboard refresh. The chapter showed why: chunk pruning narrowed 30 days to 1 chunk, columnar storage narrowed 14 columns to 1, per-column compression squeezed another 10x out of what was left. The combination produced a 2500x I/O reduction. It is one of the cleanest engineering wins in the database world.

But that win was for a query covering one hour. What happens when the SRE team wants to look at the same dashboard, but for the last month — to compare today's p99 against the same week last quarter, or to verify that a deployment four weeks ago really did reduce tail latency? The columnar layout helps, but the math gets uncomfortable. 30 days at 200K writes/second is 518 billion rows; even at the columnar layout's compression ratios, that is around 100 GB of compressed latency_ms data to scan. A query that streams 100 GB of decompressed doubles through a percentile estimator, even with vectorised execution, takes 30-90 seconds. That is not dashboard-acceptable. It is barely human-acceptable.

The fix is not to make the scan faster. The fix is to not scan the raw data at all. Almost no dashboard ever needs to know what the latency was at 14:23:07 IST on the third of March. Dashboards need the shape of latency at minute or hour granularity — the trend, the outliers, the diurnal pattern. If the database has already computed (hour, p99_latency_ms) for every hour in the last month and stored it as 720 rows, the dashboard query becomes "select 720 rows from a tiny table" — sub-50ms, regardless of how many billions of raw events are sitting underneath.

That is the entire idea behind downsampling. The rest of this chapter is about how to do it without making your life miserable: how to keep the rollups in sync with new data without re-aggregating from scratch (continuous aggregates), how to organise multiple resolutions in a tiered hierarchy (raw → minute → hour → day), and what the SQL actually looks like in TimescaleDB, the rule syntax in Prometheus, and the Flux task in InfluxDB.

What downsampling actually is

Downsampling is the operation of taking a high-resolution time-series and summarising it at a coarser resolution. Concretely: given raw data points (t_1, v_1), (t_2, v_2), ..., (t_N, v_N) arriving at one-second cadence, you choose a bucket size (say, one minute), partition the timeline into buckets [t_0, t_0 + 60s), [t_0 + 60s, t_0 + 120s), ..., and for each bucket compute one or more aggregate statistics — average, sum, min, max, count, p99 — over the raw points that fell inside that bucket. The output is a new, much smaller time-series at one-minute resolution.

Two design choices fall out immediately. The first is what to aggregate — for a single source metric like latency_ms, you typically want at minimum count, sum, min, max, and one or more percentiles (p50, p95, p99, p999). For a counter like requests_total, you typically want last_value (since the running total is itself an aggregate). For a gauge like cpu_percent, you want avg and max. The aggregates you pick determine which dashboard queries you can serve from the rollup vs falling back to raw. A dashboard that only needs averages can be answered from (t, avg) alone; one that needs p99 needs the percentile pre-computed.

The second design choice is whether percentiles can be combined across buckets. Sums are easy: the sum over an hour is the sum of the sums over the 60 minutes in that hour. Averages are slightly harder: you need to keep (sum, count) per bucket, then re-divide. Mins and maxes are easy: min-of-mins, max-of-maxes. Percentiles are hard. You cannot compute the p99 over an hour from the 60 per-minute p99s — that gives the wrong answer in general. You need to either keep the full distribution per bucket (expensive) or use a sketch like t-digest or HyperLogLog that supports merging across buckets. Why this matters for downsampling: TimescaleDB's percentile_agg and InfluxDB's percentile functions both build on t-digest exactly so that the per-minute aggregates can be merged into per-hour aggregates without re-scanning the raw data. Without a mergeable sketch, you would have to keep the raw data — which defeats the entire point.

The reason mergeable aggregates are the foundation of the whole approach is that they enable a hierarchy of rollups: minute aggregates merge into hourly aggregates merge into daily aggregates, each level computed from the level above without ever touching raw data. This is what lets you keep daily aggregates "forever" cheaply — they are themselves aggregates of hourly aggregates, which are themselves aggregates of minute aggregates.

The naive way and why it fails

The first time an engineer encounters the dashboard-is-too-slow problem, the instinct is to write a cron job:

-- Every minute, append one new row to the rollup.
INSERT INTO latency_per_minute (bucket, p99)
SELECT date_trunc('minute', ts) AS bucket,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM events
WHERE ts >= date_trunc('minute', now()) - INTERVAL '1 minute'
  AND ts <  date_trunc('minute', now())
GROUP BY bucket;

This works. For a while. Three failure modes appear in production. First, late-arriving data: events whose timestamp is from three minutes ago but that hit the database now (because of buffering, retries, network delays) never make it into the rollup, because the cron only re-aggregates the most recent complete minute. Second, backfill: if you ingest historical data — a one-off load of last month's events — the rollup has no idea those rows exist and never recomputes the affected buckets. Third, schema or formula changes: if the SRE team decides they also want p999, you have to re-run the aggregation for the entire history, manually, in batches, hoping you do not double-insert.

The continuous-aggregate machinery solves all three by tracking which buckets have been touched by writes and recomputing only those. It is not a cron job; it is a structural feature of the storage engine. The next two sections describe how TimescaleDB and Prometheus implement it.

TimescaleDB continuous aggregates

A TimescaleDB continuous aggregate is a materialised view declared over a hypertable, with two pieces of magic the engine adds on top of plain Postgres materialised views. First, the view is itself stored as a hypertable — partitioned by bucket time — so queries against it benefit from chunk pruning just like queries against raw data. Second, the engine maintains an invalidation log that records which raw chunks have been written to since the view was last refreshed; the refresh job consults the log and recomputes only the affected buckets, not the whole view.

The setup is two SQL statements. First, declare the view:

CREATE MATERIALIZED VIEW orders_per_minute
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
       merchant_id,
       count(*) AS order_count,
       avg(amount_inr) AS avg_amount_inr,
       percentile_agg(latency_ms) AS latency_sketch
FROM orders
GROUP BY bucket, merchant_id;

The time_bucket function is TimescaleDB's bucketing primitive — it floors a timestamp to the nearest minute (or any other interval). The percentile_agg function returns a t-digest sketch from which any percentile can later be extracted with approx_percentile(0.99, latency_sketch). Crucially, t-digests are mergeable: two sketches over disjoint windows can be combined without going back to the raw data. Why store the sketch and not just p99: because if you store only p99, you can never compute hourly p99 from the minute aggregates. Storing the t-digest sketch lets you derive p50, p95, p999 — anything — at any coarser resolution by merging.

The second statement adds a refresh policy:

SELECT add_continuous_aggregate_policy('orders_per_minute',
    start_offset => INTERVAL '1 hour',
    end_offset   => INTERVAL '1 minute',
    schedule_interval => INTERVAL '1 minute');

This says: every minute, refresh the view for the time range [now() - 1 hour, now() - 1 minute], but only for buckets whose underlying raw data has actually changed. The end_offset of one minute is intentional — it leaves a small "settling" window during which late-arriving data can still slot into the right bucket before that bucket is materialised. The start_offset of one hour is the maximum lateness the system will tolerate; data arriving more than an hour late will not retroactively update the view. These two parameters are the operationally important knobs, balancing freshness against resilience to late data.

Querying the view is a regular SELECT:

SELECT bucket, merchant_id, approx_percentile(0.99, latency_sketch) AS p99
FROM orders_per_minute
WHERE bucket > now() - INTERVAL '7 days'
  AND merchant_id = 12345
ORDER BY bucket;

Because the view is itself a hypertable partitioned by bucket, the same chunk-pruning machinery from the previous chapter applies. The query reads only the chunks of the view covering the last 7 days — about 10,000 rows (7 days x 1440 minutes/day) for one merchant. The dashboard query that took 30 seconds against raw data takes 15 milliseconds against the continuous aggregate.

A second policy makes it concrete how the rollup stays correct under late-arriving data. Suppose at time t = 12:30 an event from t = 12:25 arrives (network retry, queue lag, whatever). The insert lands in today's raw chunk. The engine writes an invalidation entry marking [12:25, 12:26] as dirty. The next refresh tick (one minute later) sees the invalidation entry, re-aggregates that single one-minute bucket from raw, and overwrites the corresponding row in the view. Total recomputation cost: 60 raw rows. The other 1439 buckets of the day are not touched. That is what "continuous" really means — incremental, invalidation-driven, never a from-scratch rebuild.

A second continuous aggregate, defined over the first one rather than over raw, gives you the hourly tier essentially for free:

CREATE MATERIALIZED VIEW orders_per_hour
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', bucket) AS hour_bucket,
       merchant_id,
       sum(order_count) AS order_count,
       average(rolling_agg(avg_amount_inr)) AS avg_amount_inr,
       rollup(latency_sketch) AS latency_sketch
FROM orders_per_minute
GROUP BY hour_bucket, merchant_id;

The rollup function merges t-digest sketches; the result is a sketch over the hour from which any percentile can be derived. The hourly aggregate is built from the minute aggregate, not from raw. Each tier's refresh cost is proportional to the tier above it, not to the raw stream. This is the structural property that makes the hierarchy economically sensible.

Prometheus recording rules

Prometheus achieves the same outcome through a different mechanism: recording rules. A rule file declares a query that the rule engine evaluates on a fixed schedule, writes the result back into the time-series database as a new metric, and serves dashboard queries against the rule output.

groups:
  - name: latency_rollups
    interval: 1m
    rules:
      - record: api:request_latency_seconds:p99_5m
        expr: histogram_quantile(0.99,
                sum(rate(api_request_latency_seconds_bucket[5m])) by (endpoint, le))
      - record: api:request_count:rate1m
        expr: sum(rate(api_request_count_total[1m])) by (endpoint, status)

Every minute, the Prometheus rule engine evaluates each expr and stores the result as a new metric named by the record field. Dashboards query the recorded metric directly — api:request_latency_seconds:p99_5m — never running the underlying histogram_quantile over the raw histogram buckets at dashboard time. The naming convention (level:metric:operation) makes the layer explicit. Why this works for percentiles in Prometheus: Prometheus's underlying counter type for latency is a histogram with predefined buckets, and histogram_quantile over a sum of bucket rates is itself a mergeable computation. Recording rules essentially pre-sum-and-rate the buckets so dashboard queries are O(rules) rather than O(raw samples).

Recording rules are not as automatically incremental as TimescaleDB continuous aggregates — they re-run the full query each evaluation interval rather than tracking invalidation. But because Prometheus only retains a fixed window of raw data (typically 15 days) and evaluates over short look-back windows (5m, 1h), the cost is bounded. The conceptual model is identical: pre-compute the aggregate, store it, query it instead of raw.

InfluxDB tasks

InfluxDB 2.x calls the same idea tasks: scheduled Flux scripts that read from a raw bucket, aggregate, and write to a downsampled bucket.

option task = {name: "downsample_5m", every: 5m}

from(bucket: "telemetry_raw")
  |> range(start: -task.every)
  |> filter(fn: (r) => r._measurement == "api_latency")
  |> aggregateWindow(every: 5m, fn: mean)
  |> set(key: "_measurement", value: "api_latency_5m")
  |> to(bucket: "telemetry_5m_30d")

Every five minutes, the task aggregates the most recent five-minute window into a separate bucket (telemetry_5m_30d) which has its own retention policy. A second task can downsample the 5-minute bucket into an hourly bucket; a third into daily. The retention on each bucket is set independently — raw might be 24 hours, 5-minute might be 30 days, hourly might be a year. Dashboards select from the bucket whose resolution and retention match the query window, with Grafana templating handling the routing.

The InfluxDB task model is more flexible than continuous aggregates (Flux is a full programming language) but less automatically incremental — like Prometheus, you re-run the aggregation each schedule rather than tracking dirty buckets. For most monitoring workloads the difference is invisible; the tasks run in seconds either way.

The retention hierarchy

Once you have downsampling working, the next operational decision is how long to keep each tier. The standard pattern is a hierarchy where each tier has its own retention policy, and each tier is roughly 10x smaller than the one above it.

The numbers are characteristic. At 100K events/sec for 7 days of raw data on a columnar TSDB, you store ~4 TB. The minute aggregate over 30 days is ~70 GB. Hourly over 1 year is ~1.2 GB. Daily forever is under 100 MB per year of additional storage. The total storage footprint of the analytics — across all tiers — is dominated by the raw tier; everything else is rounding error. Once raw rolls off the 7-day window, the lifetime cost of keeping the rolled-up data is essentially free.

The query router lives in the dashboard layer. Grafana with the TimescaleDB or Prometheus data source will (with templating or with explicit per-panel queries) select from the right tier based on the dashboard's time range. A common rule of thumb: pick the coarsest tier whose bucket size is no larger than one pixel-width of the chart at the current zoom level. A dashboard 1000 pixels wide showing a 30-day window has 30 days x 24 hours = 720 hours, well over one hour per pixel — hourly aggregate is the right tier. The same dashboard zoomed to 6 hours has 6 minutes per pixel, so minute aggregate. Zooming further into 30 minutes might warrant raw. Why match resolution to pixel width: rendering more data points than the chart has pixels is wasted work — the renderer aggregates anyway, just less efficiently. Matching upstream cuts the bytes transferred, the bytes processed, and the bytes scanned in storage all simultaneously.

A worked example: the Indian fintech dashboard

The Razorpay-scale fintech monitoring system

A fictional Indian fintech, Saral Pay, runs a UPI gateway used by 8,000 merchants across India. At peak — Diwali night, India-Pakistan cricket finals, salary day, GST filing deadline — the gateway processes 100,000 transactions per second. Every transaction emits a structured event: (ts, merchant_id, region, status, latency_ms, amount_inr, payment_method). The on-call SRE wants a dashboard that shows p99 latency per region, per hour, over the last 30 days, refreshing every 60 seconds. The product team wants a separate dashboard showing transaction volume per merchant, per day, year-to-date.

The raw data sizing. 100,000 events/sec x 86,400 sec/day x 30 days = 259 billion events. Even on the columnar TSDB layout from the previous chapter — with 30 daily chunks, columnar compression at 10x — the latency_ms column alone for 30 days is around 100 GB compressed. Scanning that for a single dashboard render is 30 to 90 seconds; doing it every 60 seconds for ten on-call engineers means the database CPU is 100% occupied serving one dashboard. Without downsampling, the dashboard is unusable.

The continuous aggregate setup. The platform team adds three views on top of the existing transactions hypertable:

-- Tier 2: minute aggregate, kept 30 days
CREATE MATERIALIZED VIEW txn_per_minute
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
       region,
       merchant_id,
       count(*) AS txn_count,
       sum(amount_inr) AS total_inr,
       percentile_agg(latency_ms) AS latency_sketch
FROM transactions
GROUP BY bucket, region, merchant_id;

SELECT add_continuous_aggregate_policy('txn_per_minute',
    start_offset => INTERVAL '2 hours',
    end_offset   => INTERVAL '1 minute',
    schedule_interval => INTERVAL '1 minute');

SELECT add_retention_policy('txn_per_minute', INTERVAL '30 days');

-- Tier 3: hourly aggregate, kept 1 year, built from the minute aggregate
CREATE MATERIALIZED VIEW txn_per_hour
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', bucket) AS bucket,
       region,
       merchant_id,
       sum(txn_count) AS txn_count,
       sum(total_inr) AS total_inr,
       rollup(latency_sketch) AS latency_sketch
FROM txn_per_minute
GROUP BY bucket, region, merchant_id;

SELECT add_continuous_aggregate_policy('txn_per_hour',
    start_offset => INTERVAL '3 hours',
    end_offset   => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');

SELECT add_retention_policy('txn_per_hour', INTERVAL '1 year');

-- Tier 4: daily aggregate, kept forever, built from hourly
CREATE MATERIALIZED VIEW txn_per_day
    WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', bucket) AS bucket,
       region,
       merchant_id,
       sum(txn_count) AS txn_count,
       sum(total_inr) AS total_inr,
       rollup(latency_sketch) AS latency_sketch
FROM txn_per_hour
GROUP BY bucket, region, merchant_id;

-- Raw retention: 7 days
SELECT add_retention_policy('transactions', INTERVAL '7 days');

The dashboard query. Grafana asks for "p99 latency per region, hourly, last 30 days":

SELECT bucket, region, approx_percentile(0.99, latency_sketch) AS p99_ms
FROM txn_per_hour
WHERE bucket > now() - INTERVAL '30 days'
GROUP BY bucket, region
ORDER BY bucket;

That returns 30 days x 24 hours x ~6 regions = 4,320 rows. Runtime: 35-50 milliseconds on a single TimescaleDB node, even on the second-tier aggregate. The query touches under 500 KB of data. The chart renders before the user releases the mouse from the time-range picker.

The contrast. The same query against raw data — the dashboard the SRE team would have written without downsampling — touches the entire 30-day raw latency_ms column (100 GB compressed, 1 TB uncompressed) and pushes 250 billion values through a percentile estimator. Best-case runtime on a beefy node: 30 minutes. Worst-case: hours. With a 60-second dashboard refresh, this is a non-starter even before considering that the t-digest sketch in the rollup actually gives a more accurate p99 than a sampling-based percentile over 250 billion raw points anyway.

The cost summary.

Tier	Resolution	Retention	Storage	Used for
Raw	1 event	7 days	~3 TB	Incident debugging, trace lookup
Minute	1 minute	30 days	~50 GB	Real-time dashboards, alerting
Hourly	1 hour	1 year	~1 GB	Trends, capacity planning, SLOs
Daily	1 day	Forever	~30 MB/year	Year-over-year, executive reports

The total storage cost across all tiers is dominated 60-fold by the raw tier; the rolled-up tiers are essentially free. The query speedup for dashboard workloads is roughly 36,000x for the 30-day p99 query (50 ms vs 30 minutes); for the year-to-date merchant volume dashboard it is over a million-fold (raw scan would be infeasible; daily aggregate query is 8 ms).

The operational point. Saral Pay's monitoring stack runs on a single TimescaleDB node — no sharding, no separate analytics warehouse, no elastic compute spinning up for ad-hoc queries. The rollups + retention hierarchy do all the work. **The total monthly infrastructure cost is around Rs 3,00,000 (3,500), serving real-time dashboards for hundreds of engineers and product managers.** A naive "store raw forever, scan it every dashboard refresh" architecture for the same workload would cost upward of Rs 30,00,000 (35,000) per month even before considering that it would not actually meet the latency SLOs.

What you should take away

Dashboards do not want raw data; they want shape. Pre-compute the shape (bucket-level aggregates) once, store it, and query it instead of raw. The win is structural, not a mere optimisation — it changes query complexity from O(raw points) to O(buckets).
Continuous aggregates are materialised views with an invalidation log. TimescaleDB tracks which raw chunks have been written to since the last refresh and recomputes only the affected buckets. The result: refresh cost is proportional to write rate, not to total stored data.
Mergeable aggregates (sums, mins, maxes, t-digest sketches) enable the rollup hierarchy. Per-minute aggregates merge into per-hour merge into per-day, no raw data needed at any level above the bottom. Non-mergeable aggregates (raw percentiles) break the hierarchy and force re-scanning.
Prometheus recording rules and InfluxDB tasks are the same idea, slightly differently packaged. Both schedule a query, write the result back, and serve dashboards from the rollup. Prometheus's mechanism is less incremental but bounded by short look-back windows; InfluxDB's is more flexible (Flux scripts) but similarly schedule-driven.
The retention hierarchy compounds the win. Raw 7 days, minute 30 days, hour 1 year, day forever. Each tier 10-100x smaller and cheaper than the one above. The total storage is dominated by the raw tier; everything above it is rounding error.
The query router belongs in the dashboard layer. Grafana (or whatever) picks the tier whose bucket size matches the chart's pixel width. Coarser tiers for wider time ranges. The dashboard never knows which tier it hit; the user sees only "fast".
Together with the columnar layout from the previous chapter, downsampling is what makes a single-node TSDB serve a workload that would otherwise need a multi-node analytics cluster. The 36,000x speedup on the 30-day p99 dashboard is not a benchmark; it is what falls out of replacing 250 billion-row scans with 720-row scans.

In the next chapter you will see how the retention hierarchy extends one step further — into tiered storage where old chunks physically migrate from NVMe to S3, and the compression regime gets more aggressive as data ages, taking the cost-per-stored-byte down by another two orders of magnitude.