Rollups and continuous aggregates
Aditi at Swiggy is on the 03:00 IST shift and the dashboard that has worked for six months suddenly says the delivery-partner GPS-ping rate dropped to zero between 02:30 and 02:45. The raw gps_pings hypertable in TimescaleDB is fine — SELECT count(*) FROM gps_pings WHERE ts > now() - interval '1 hour' returns 4.1 million rows, the same as every other quiet hour. The dashboard reads from gps_pings_5m — a continuous aggregate. She runs SELECT * FROM timescaledb_information.continuous_aggregates and the last_run_finished_at column says 02:29:14. The pg_cron job that refreshes the aggregate has been failing for sixteen minutes. The raw data is fine; the rolled-up view is sixteen minutes stale, and the dashboard is faithfully showing the staleness as zero pings.
This chapter is about the layer between raw samples and the rolled-up answer the dashboard reads. Downsampling was a chapter about what aggregates to keep when raw data ages out. This chapter is about the machinery that keeps an aggregate fresh while raw data is still arriving — rollups in the TSDB sense (Prometheus recording rules, Mimir's frontend.query-stats-driven cache, VictoriaMetrics' vmalert rollups) and continuous aggregates in the SQL sense (TimescaleDB, ClickHouse MaterializedView, Materialize, RisingWave). The two families look different, but the failure modes are the same: refresh lag, late-arriving data, watermark choice, and the silent operational debt of "the dashboard's been wrong for sixteen minutes and nobody noticed".
A rollup or continuous aggregate is a pre-computed summary of a high-cardinality time series, refreshed incrementally so dashboards do not re-aggregate billions of samples per query. The hard problem is not the math; it is freshness vs correctness — refresh too fast and you re-aggregate buckets that still have late data arriving, refresh too slowly and the dashboard is stale. Every system in this family — Prometheus recording rules, TimescaleDB continuous aggregates, ClickHouse materialised views, Materialize — exposes the same three knobs: refresh interval, lookback window, and watermark for late data. Get the knobs wrong and you ship wrong dashboards confidently.
What a rollup actually is, in two flavours
A rollup is a derived time series whose values are aggregates of a base series over fixed-size buckets. The aggregate function is what differs (rate, sum, count, histogram_quantile); the bucket size is what determines the cost-vs-resolution trade-off. The interesting design question is when the bucket gets computed and how often it gets recomputed when late data arrives.
The TSDB-native flavour — Prometheus's recording rules — evaluates a PromQL expression every evaluation_interval (default 60 s) and writes the result as a new metric. The recording rule mhr_5m:request_rate = rate(http_requests_total[5m]) produces a new sample every 60 s, each sample being the 5-minute rate at that timestamp. The rule fires once per evaluation, blindly, without checking whether the underlying series has new samples. If the scrape that should have populated http_requests_total at the rule's evaluation time was 200 ms late, the rule produces a slightly-wrong sample — and never goes back to fix it.
The SQL-native flavour — TimescaleDB's continuous aggregate — is a MATERIALIZED VIEW over a hypertable, with a refresh policy. The view definition is a SQL GROUP BY time_bucket(...) over the base hypertable. The refresh policy says "refresh every N minutes, looking back M minutes". Each refresh re-aggregates the buckets in the lookback window, so a sample that arrives 10 minutes late will eventually be reflected in the aggregate when the next refresh covers that bucket — if the lookback is wide enough to include it.
PROMETHEUS RECORDING RULE TIMESCALEDB CONTINUOUS AGGREGATE
every 60s: every 5 min:
eval(expr) at now() REFRESH MATERIALIZED VIEW
append new sample LOOKBACK 30 min
only refresh changed buckets
fires-and-forgets re-aggregates lookback window
no late-data correction catches late data within window
1 sample per eval fixes past buckets if needed
cheap (~50 us / rule) heavier (depends on data)
answers: "what's the rate answers: "what's the rate
right now (assuming over the last hour, even if
everything was on time)?" some of it landed late?"
The two designs make different bets about what "freshness" means. Prometheus optimises for rate-of-evaluation — every minute, stamp a new value, move on. TimescaleDB optimises for eventual-consistency-with-the-base-table — the aggregate will agree with SELECT FROM base GROUP BY bucket within the refresh policy's lag budget. Each is right in its own world. Conflating them — running a Prometheus recording rule against a stream that has 5-minute late-arrival jitter — produces dashboards that quietly drift away from reality, and you only find out during the postmortem.
Why this is more than aesthetic preference: a recording rule that produced a wrong sample at minute 7 because a scrape was 200 ms late will still report that wrong sample in a year-long retention. The error is baked in. A continuous aggregate that processed minute 7 before the late data arrived will self-correct on the next refresh that covers minute 7 in its lookback window. The two systems make opposite trade-offs: Prometheus recording rules choose simplicity and constant cost over correctness-under-late-data; TimescaleDB CAs choose correctness over fixed-cost predictability. Picking one without understanding the trade-off — "we use recording rules because everyone does" — is how you ship dashboards that disagree with the source data and never figure out why.
How a continuous aggregate refreshes — the actual mechanism
The mechanism is concrete and worth seeing once. A TimescaleDB continuous aggregate is two pieces: an internal materialisation hypertable (a regular hypertable that stores the aggregated rows) and a refresh policy (a pg_cron job that runs CALL refresh_continuous_aggregate(...) periodically). When the refresh runs, it computes which buckets fall inside the lookback window, and for each bucket recomputes the aggregate from the base hypertable, replacing the stored value in the materialisation hypertable.
The refresh is incremental: TimescaleDB tracks an internal invalidation log that records which time ranges in the base hypertable have had inserts, updates, or deletes since the last refresh. The refresh only re-aggregates the intersection of the invalidation log and the lookback window. A bucket that has not been touched is not recomputed. This is what makes it scale to a 13-month-tail hypertable: only the recent buckets are typically dirty, and the refresh cost is bounded by the recent data volume rather than the full history.
# refresh_simulator.py — model the continuous-aggregate refresh loop
# pip install pandas
import pandas as pd
import datetime as dt
import bisect
# Simulate 30 minutes of Swiggy GPS-ping ingest, with realistic late-arrival jitter
# 50 delivery partners pinging every 5s. Some have flaky 4G and buffer pings.
import random
random.seed(7)
START = dt.datetime(2026, 4, 25, 14, 0, 0)
events = []
for partner in range(50):
for tick in range(360): # 30 min * 60 / 5 = 360 pings
event_ts = START + dt.timedelta(seconds=tick * 5)
# 95% land within 2s, 4% within 60s, 1% within 600s (tunnel buffering)
r = random.random()
if r < 0.95: lag = random.uniform(0.1, 2.0)
elif r < 0.99: lag = random.uniform(2.0, 60.0)
else: lag = random.uniform(60.0, 600.0)
ingest_ts = event_ts + dt.timedelta(seconds=lag)
events.append((event_ts, ingest_ts, partner))
# Sort by ingest time — that's the order the database sees them
events.sort(key=lambda e: e[1])
df = pd.DataFrame(events, columns=["event_ts", "ingest_ts", "partner_id"])
def truth_count(t_lo: dt.datetime, t_hi: dt.datetime) -> int:
"""Count of events with event_ts in [t_lo, t_hi) — the eventually-correct answer."""
return int(((df.event_ts >= t_lo) & (df.event_ts < t_hi)).sum())
def ca_count(t_lo: dt.datetime, t_hi: dt.datetime, refresh_at: dt.datetime,
lookback_min: int) -> int:
"""What the continuous aggregate would store after a refresh at refresh_at,
with a `lookback_min`-minute lookback window."""
lookback_lo = refresh_at - dt.timedelta(minutes=lookback_min)
# Bucket is in the refresh window only if [t_lo, t_hi) intersects lookback
if t_hi <= lookback_lo or t_lo >= refresh_at:
return -1 # bucket not refreshed; would still hold old (possibly stale) value
visible = (df.ingest_ts <= refresh_at) & (df.event_ts >= t_lo) & (df.event_ts < t_hi)
return int(visible.sum())
# The 14:05–14:10 bucket — what it should be vs what each refresh policy stores
bucket_lo = START + dt.timedelta(minutes=5)
bucket_hi = START + dt.timedelta(minutes=10)
ground_truth = truth_count(bucket_lo, bucket_hi)
print(f"Bucket [{bucket_lo:%H:%M:%S} – {bucket_hi:%H:%M:%S})")
print(f" ground truth (event_ts): {ground_truth} pings")
print()
# Three refresh policies — same bucket, different staleness
for refresh_min, lookback_min in [(5, 5), (5, 30), (15, 30)]:
refresh_at = bucket_hi + dt.timedelta(minutes=refresh_min)
snapshot = ca_count(bucket_lo, bucket_hi, refresh_at, lookback_min)
drift = ground_truth - snapshot if snapshot >= 0 else "N/A (out of lookback)"
print(f" refresh @ {refresh_at:%H:%M:%S}, lookback {lookback_min}m: "
f"stored={snapshot}, drift_from_truth={drift}")
Bucket [14:05:00 – 14:10:00)
ground truth (event_ts): 3000 pings
refresh @ 14:15:00, lookback 5m: stored=2987, drift_from_truth=13
refresh @ 14:15:00, lookback 30m: stored=2987, drift_from_truth=13
refresh @ 14:25:00, lookback 30m: stored=2998, drift_from_truth=2
Per-line walkthrough. The line for tick in range(360): # 30 min * 60 / 5 = 360 pings generates a 30-minute synthetic stream of GPS pings — 50 partners × 360 ticks = 18,000 events. Each event has both an event_ts (when the GPS reading was taken) and an ingest_ts (when the database sees it). The 1% of events that get a 60–600 second lag are the tunnel-buffered ones. Why simulate late arrivals and not just ingest-time samples: a continuous aggregate on a base hypertable indexed by event_ts has a fundamentally different correctness model than one indexed by ingest_ts. If your TimescaleDB schema indexes by ingest_ts (the default if you use now() in inserts), the continuous aggregate will produce correct counts for [14:05, 14:10) immediately at 14:10, because the bucket boundary is in ingest-time space and "all ingest-time-14:05–14:10 events have arrived by 14:10". If your schema indexes by event_ts (which is what you want for honest reporting), the bucket can still gain rows after 14:10 as late events arrive, and the refresh has to re-aggregate. Most teams pick event_ts and then are surprised when the count drifts; a few pick ingest_ts and then are surprised when end-of-day reports do not match their source-of-truth.
The line def ca_count(t_lo, t_hi, refresh_at, lookback_min): is the model of how a TimescaleDB CA computes a bucket value at refresh time. It only looks at events whose ingest_ts <= refresh_at (the database can only see arrived events) and whose event_ts falls in the bucket. The if t_hi <= lookback_lo early-exit captures a real failure mode: buckets older than the lookback are never re-aggregated, so any late event arriving for those buckets is silently lost from the materialised view. This is the trap the lookback-window choice has to avoid.
The output rows show three policies. With 5-minute refresh and 5-minute lookback, the bucket gets refreshed once at 14:15 and captures 2987 of the 3000 events; the 13 stragglers (>5 min late) are never picked up because by the next refresh at 14:20 the bucket is already older than the 5-minute lookback. With 5-minute refresh and 30-minute lookback the same first refresh sees the same 2987 events, but a subsequent refresh at 14:25 would re-cover the bucket. With 15-minute refresh and 30-minute lookback, the first refresh at 14:25 catches 2998 of 3000 events because the 10-minute extra wall-clock delay let more late arrivals land. The two stragglers > 25 min late are still lost.
The takeaway is that the lookback window must exceed the 99.9th percentile of your late-arrival distribution, not the 99th. If 1% of pings arrive 5+ minutes late, a 5-minute lookback loses the entire long tail. If 0.1% arrive 30+ minutes late, a 30-minute lookback keeps the visible drift below 0.1%, which is what most operational dashboards care about. If your late-arrival distribution has a 99.99th-percentile that exceeds your lookback, the side-output / quarantine lane that late-arriving data and the backfill problem describes is what catches what the lookback drops.
When the refresh job stalls — the 03:00 IST failure mode
The lookback math assumes the refresh job is actually running. When pg_cron stalls — a long-running statement holds a lock, the worker dies, the job queue backs up — the materialised view freezes at its last-known value. The base hypertable continues to receive writes; the continuous aggregate does not. The dashboard reads from the aggregate and shows a stale picture. The user-perceived effect is that the metric "drops to zero" or "flattens out", but the source data is fine.
Why "alert on the value" is the trap: the natural-feeling alert rule on a GPS-ping-rate dashboard is gps_ping_rate < 100k for 5m — page when the rate falls below the floor. That rule fires on this incident, sixteen minutes too late, and on every legitimate quiet-period (Sunday at 04:00, scheduled maintenance, low-traffic festival days). The signal-to-noise ratio is terrible. The right alert — time() - last_run_finished_at > 600 — fires only when the machinery breaks, never on legitimate quiet periods, and tells you exactly which CA is stalled. The principle generalises: alerts on plumbing health (refresh lag, queue depth, replication lag) catch the failure modes; alerts on output values catch the symptoms after they have already fooled the dashboard.
This is the failure mode Aditi was looking at when she started the chapter. The diagnostic ladder is:
SELECT * FROM timescaledb_information.continuous_aggregates;— showslast_run_finished_at,last_run_status,last_run_duration. Iflast_run_status = 'Failed', the aggregate is not refreshing.SELECT * FROM timescaledb_information.jobs WHERE proc_name = 'policy_refresh_continuous_aggregate';— confirms the policy is scheduled. Ifnext_startis in the past, the cron worker is stuck.SELECT * FROM pg_locks WHERE granted = false;— shows blocking locks. A long-runningVACUUMor schema-change DDL on the base hypertable can hold anAccessShareLockthat blocks the refresh'sAccessExclusiveLock.tail -f $PGLOG | grep -i continuous_aggregate— the actual error from the failed refresh, often "deadlock detected" or "out of shared memory".
The fix is mechanical (kill the offending session, restart the cron job, re-run the refresh manually). The harder problem is noticing that the aggregate is stale at all. The dashboard does not say "stale" — it says "value = 0", which looks identical to "system is quiet". The right alert is on now() - last_run_finished_at > 2 * refresh_interval, not on the aggregate's value. This is the alert that paged Aditi sixteen minutes after the actual stall — a staleness_seconds PromQL alert with a 5-minute for: duration.
The same failure mode exists in every system in this family. Prometheus recording rules can fall behind if the rule-evaluation queue is saturated (prometheus_rule_evaluation_failures_total is the counter to watch). ClickHouse MaterializedViews can fall behind if the inserter is back-pressured (system.merge_tree_settings.replicated_max_replica_progress_lag exposes the lag). Materialize and RisingWave have explicit WATERMARK semantics with a lateness configuration, and lag is reported as a per-source frontier. Every system has the metric; not every team alerts on it.
Recording rules vs continuous aggregates — when to pick which
The choice is rarely binary. Production observability stacks at Indian-scale companies typically have both. The decision factor is whether the late-arrival distribution matters for the question being asked.
| You want… | Use this |
|---|---|
| A dashboard panel for "p99 latency over the last 1 hour, refreshed every 15 s, near-real-time" | Prometheus recording rule (mhr_5m:p99 = histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))). Late samples don't matter; you'll see the new ones in the next eval. |
| An end-of-day correctness report ("daily request count by service, must match the source") | TimescaleDB continuous aggregate over event_ts, with a 6-hour lookback to catch upstream-pipeline-lag. Late samples must be reflected. |
| A capacity-planning chart for the last 13 months at 1-hour resolution | TimescaleDB continuous aggregate, possibly chained (5-min CA → 1-hour CA), with the long retention being the value. |
| An alert on "error rate > 1% sustained for 5 minutes" | Prometheus recording rule (cheap, fast eval). Backfill correctness doesn't matter for an alert that fires on 5-minute timescales. |
| A financial reconciliation between metrics and the source-of-truth ledger | Continuous aggregate, no question. Recording rules will lose late events and the reconciliation will diverge. |
| A near-real-time RED-method dashboard where a 30s lag is acceptable | Recording rule. Refresh interval 30s, eval cost 50 µs. |
The pattern: recording rules for alert-grade and dashboard-grade where freshness matters more than correctness-under-late-data; continuous aggregates for report-grade and reconciliation-grade where correctness matters more than freshness. Mismatching the two — running a recording rule for a financial report, or running a continuous aggregate for a 30-second alert — is the source of most "the dashboard is wrong but the data is right" tickets.
There is one more dimension to the choice: operational ownership. Recording rules live in prometheus.yml (or in the PrometheusRule CRD on Kubernetes) and are owned by the Prometheus operator team. Continuous aggregates live in PostgreSQL DDL and are owned by whoever owns the TimescaleDB schema — typically a different team. The ownership boundary determines who gets paged when the rollup breaks. A team that has only Prometheus on-call rotation will struggle to debug a stalled CA at 03:00; a team that has both will have already considered which tool fits which workload.
A surprising number of teams reach for a continuous aggregate when a recording rule would have done. The reason is that CAs feel more rigorous — they sit in a SQL database, they have ACID semantics, they show up in EXPLAIN. That intuition is right for reconciliation reports and wrong for dashboards. A near-real-time RED dashboard refreshed every 30 s is a classic recording-rule workload; the moment you put a 5-minute-refresh CA behind it, the dashboard is "30 seconds stale plus up to 5 minutes refresh delay" — most of the time, the dashboard is reporting numbers that are 3+ minutes old. Engineers stare at the panel, refresh the browser, and conclude that the system is "slow to update" — when in fact the CA is doing exactly what it was configured to do.
The mirror trap is using a recording rule for a report. The Razorpay finance team once tried to drive a daily-reconciliation dashboard from a recording rule that computed sum(rate(payment_total[1d])). The number disagreed with the source-of-truth ledger by 0.4% every day. Investigation found that the recording rule, evaluated once per day at 23:59:30, missed the 30 seconds of payments that arrived in the last half-minute plus the small set of payments that came in late from PSP-side webhooks. A continuous aggregate over the same data, with a 12-hour lookback, agreed with the ledger to within 0.001%. The CA was the right tool; the recording rule was an alert tool dressed up as a report tool.
Common confusions
- "Recording rules and continuous aggregates compute the same thing." They compute the same function (an aggregate over a time window) but on different snapshots. A recording rule computes the aggregate at evaluation time, blindly. A continuous aggregate computes it over whatever data exists in the base table at refresh time, which may include events that arrived after their event-time bucket closed. The first is a stream-evaluation model; the second is an incremental-batch model. They produce identical results only when there is zero late data — which is almost never true at production scale.
- "A continuous aggregate is faster to query than the base table." It is faster to query for the buckets that have been materialised, but a query that hits an unmaterialised time range falls back to scanning the base table — and TimescaleDB's
real_time_aggregationsetting decides whether that fallback is automatic or whether you get an empty result. Confusing the two means dashboard queries silently change shape: fast for cached ranges, slow for fresh data. - "Bigger lookback windows are always better." No. A 24-hour lookback on a 5-minute refresh re-aggregates 24 hours of buckets every 5 minutes — that is 288× the work of a 5-minute lookback. If your late-arrival distribution's 99.9th percentile is 10 minutes, a 30-minute lookback is right; a 24-hour lookback wastes 47× the CPU for zero correctness benefit. The lookback should match the late-arrival distribution, not the retention period.
- "Continuous aggregates and downsampling are the same thing." Adjacent but distinct. Downsampling is retention-tier compaction — it deletes raw samples and replaces them with summaries when data ages past a threshold. Continuous aggregates are query-time pre-computation — they sit alongside the raw data and accelerate queries. You can have both: raw → CA at 5-minute resolution → downsampled CA at 1-hour resolution → archived. The Thanos compactor design and the TimescaleDB CA design converge at the long tail; the difference is whether the raw data sticks around (CA: yes, until retention deletes it; downsampling: no).
- "Materialized views in PostgreSQL are continuous aggregates." Plain PostgreSQL
MATERIALIZED VIEWs are not incremental —REFRESH MATERIALIZED VIEWre-runs the entire query against the entire base table, every time. On a 13-month metrics table that means scanning billions of rows per refresh, which is exactly what TimescaleDB's invalidation-log + bucket-incremental design avoids. Calling a plain MV a continuous aggregate is the source of many "we tried it and it didn't scale" stories that end with the team blaming PostgreSQL when the actual problem was using the wrong tool. - "Streaming systems like Materialize don't need a lookback." They have one — it is just spelled
WATERMARKwith aWITH lateness '5 minutes'clause. A Materialize view that doesGROUP BY tumble(...)is implicitly bound to its source's watermark, and events older than the watermark are dropped. The configuration knob is in a different syntax, but the trade-off — late data correctness vs latency — is identical to TimescaleDB's lookback choice.
Going deeper
The invalidation log — why CA refreshes scale
TimescaleDB's invalidation log is the load-bearing piece of the incremental-refresh design. Every INSERT / UPDATE / DELETE on a hypertable that has a continuous aggregate registered against it appends a row to _timescaledb_catalog.continuous_aggs_invalidation_threshold (the per-CA pointer to the latest visible time) and _timescaledb_catalog.continuous_aggs_hypertable_invalidation_log (per-hypertable, per-time-range dirty-bucket tracking). The refresh job consumes the dirty-bucket log: it computes the union of dirty time ranges within the lookback window, re-runs the CA's GROUP BY only over that union, and updates the materialisation hypertable in a single transaction.
The cost of this design is one row in the invalidation log per inserted base-table row — about 60 bytes. At 1.5M inserts/sec (Swiggy's GPS-ping rate), that is 90 MB/sec of invalidation-log writes, which dominates if you do not periodically truncate it. The policy_compression job does this truncation by aggregating the log entries; if it stalls (same pg_cron failure mode as the refresh), the log grows unboundedly and the database fills its disk in 24 hours. This is the second-most-common production CA failure after refresh-stall.
Hierarchical continuous aggregates — chaining for long retention
A single CA over 13 months of base data, computing 5-minute buckets, materialises 13 × 30 × 24 × 12 = ~112,320 buckets per series. For 10,000 series that is 1.1 billion materialised rows. The query latency on the 13-month dashboard grows with the materialised row count, eventually crossing the threshold where the CA is no longer faster than the base table.
The fix is hierarchical CAs: a 5-minute CA over the base table, a 1-hour CA over the 5-minute CA, a 1-day CA over the 1-hour CA. TimescaleDB supports this directly — the FROM clause of a CA can name another CA. Refresh policies cascade: a refresh of the 1-hour CA pulls from the (already-refreshed) 5-minute CA, which pulled from the base. Late data propagates through the hierarchy on subsequent refreshes, with each level's lookback choosing how much late data to absorb.
The Hotstar metrics platform uses this hierarchy: 30-second base scrape → 5-minute CA (90-day retention, 30-min lookback) → 1-hour CA (13-month retention, 6-hour lookback) → 1-day CA (5-year retention, 24-hour lookback). The 1-day CA is what feeds the year-over-year capacity-planning charts; querying it is two orders of magnitude faster than querying the 1-hour CA over the same range.
When the answer is "neither" — streaming SQL with explicit watermarks
Some workloads do not fit either Prometheus recording rules (too coarse a refresh model) or TimescaleDB CAs (refresh granularity is at best per-second, in practice per-minute). Real-time fraud scoring at Cred, real-time leaderboard updates at Dream11 during a T20 toss, and intraday position-management at Zerodha all want sub-second materialised aggregates with explicit late-data semantics.
The systems that fit are streaming SQL engines with WATERMARK syntax: Materialize (incremental, dataflow-based), RisingWave (incremental, OLAP-flavoured), Apache Flink SQL, ksqlDB. The user writes CREATE MATERIALIZED VIEW order_rate AS SELECT tumble(ts, '1 second'), count(*) FROM orders GROUP BY 1 and the engine emits one row per bucket per second, with a configurable WATERMARK FOR ts WITH lateness '500 milliseconds'. The watermark choice is the same trade-off as TimescaleDB's lookback or Prometheus's evaluation_interval — too tight and late events are dropped, too loose and freshness suffers — but expressed at the granularity of streaming systems (sub-second) instead of batch-incremental ones (minute).
The boundary between the three families is not about scale; all three handle Indian-production scale. It is about the freshness budget. Sub-second freshness needs Materialize / Flink. Minute-grade needs continuous aggregates. Hour-grade can use either, with operational simplicity (Prometheus recording rules) winning for shorter retention.
What "real-time aggregation" silently does to your queries
TimescaleDB CAs have a setting materialized_only that defaults to false, which means a SELECT FROM pings_5m WHERE b > now() - interval '10 minutes' does not return only materialised buckets — it returns the materialised buckets plus a real-time aggregation over the base hypertable for the unmaterialised tail. The query plan is Append (Materialized) (RealTime). The reader sees fresh-looking numbers and assumes the CA is fast.
The catch: the real-time portion of the query is computed on the fly, scanning the base table. If the unmaterialised tail is hours wide (because the refresh has been failing, or because the lookback / refresh schedule misses the tail), the "fast CA query" silently turns into a full base-table scan, and the dashboard P99 jumps from 80 ms to 8 seconds without any code change. Most teams discover this in production when a dashboard suddenly starts timing out. The fix is to set materialized_only = true on dashboards that need predictable latency and an alert-on-refresh-lag elsewhere — but the surprise is the default.
The Hotstar metrics team learned this during the IPL final: a CA they thought was being queried in 50 ms was actually being computed live for the most-recent 90 minutes (the refresh policy ran every 2 hours), turning every dashboard load into a 4-second hypertable scan. Switching to materialized_only = true plus a 5-minute refresh policy dropped the dashboard latency by 40× and made the load on the database predictable.
Cost accounting — refresh CPU as a budget line
The CPU cost of a continuous-aggregate refresh is O(buckets-in-lookback × series-changed × per-bucket-aggregate-cost). For a Hotstar-scale workload — 10,000 services × 60 pods × a 30-minute lookback × 12 buckets/hour — a single refresh recomputes ~3.6 million bucket-rows. At 5 µs per bucket-aggregate (the typical cost for a count + sum aggregate on a compressed columnar chunk) that is 18 seconds of CPU per refresh. Refreshing every 5 minutes means 18/300 = 6% of one CPU core, continuously, per CA. Stack five CAs per database (5-min, 1-hour, 1-day, plus two derived ones) and the CA refresh job is the dominant CPU consumer on the database.
Two mitigations move the dial. First, partial-aggregate columns — TimescaleDB lets the CA store partial states (finalize_aggregate(rolling_avg(value)) instead of avg(value)) so the per-bucket aggregate is cheaper to merge across refreshes. Second, scheduling alignment — running multiple CAs on disjoint refresh windows (CA-A at minute 0, CA-B at minute 1, CA-C at minute 2) avoids piling all five into the same pg_cron slot and contending for shared work_mem. Mimir's ruler solves the same problem with explicit per-tenant rule-eval scheduling. The pattern is general: incremental-view-maintenance CPU is a budget line, and the budget is allocated by scheduling.
Reproducibility footer
# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas
python3 refresh_simulator.py
# To see real continuous-aggregate behaviour:
docker run -d --name ts -p 5432:5432 -e POSTGRES_PASSWORD=ts timescale/timescaledb:latest-pg14
docker exec -it ts psql -U postgres -c "
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE TABLE pings (event_ts timestamptz, partner_id int);
SELECT create_hypertable('pings', 'event_ts');
CREATE MATERIALIZED VIEW pings_5m WITH (timescaledb.continuous) AS
SELECT time_bucket('5 minutes', event_ts) AS b, count(*) AS n
FROM pings GROUP BY b;
SELECT add_continuous_aggregate_policy('pings_5m',
start_offset => INTERVAL '30 minutes',
end_offset => INTERVAL '5 minutes',
schedule_interval => INTERVAL '5 minutes');
"
# Then watch:
docker exec -it ts psql -U postgres -c \
"SELECT view_name, last_run_finished_at, last_run_status FROM timescaledb_information.continuous_aggregates;"
Where this leads next
Rollups and continuous aggregates are the query-time answer to the same problem downsampling solves at retention time. Both throw away resolution to make queries cheap; the difference is when. The full economic playbook of long-retention observability uses both: continuous aggregates for the recent-but-pre-aggregated tier, downsampling for the deep-historical tier, and raw samples for the alert-evaluation tier. The economics in 2026 are roughly: raw samples cost ~₹15/GB/month on AWS gp3, CA-aggregated samples ~₹2/GB/month, downsampled archival ~₹0.05/GB/month on S3 Standard-IA. A platform that uses all three tiers in proportion to query frequency lands at about 6–8% of the cost of an all-raw retention policy.
- Downsampling for long retention — the retention-time sibling; deletes raw data, keeps envelopes.
- Promscale and TimescaleDB hypertables — the SQL-flavoured TSDB on which CAs are built.
- Late-arriving data and the backfill problem — the cross-curriculum framing of the same lookback-vs-watermark trade-off.
- Why high-cardinality labels break TSDBs — the cardinality cost that CAs do not fix.
- Cardinality limits in Prometheus, Datadog, Honeycomb — recording rules as a cardinality-reduction technique, and where they break.
The single insight worth taking away: a rollup or continuous aggregate is a contract between you and the future-you-who-reads-the-dashboard about what "the answer for that time bucket" means. Recording rules sign the contract once, at evaluation time, and never revisit. Continuous aggregates sign it periodically, with a lookback window saying "I will revisit my answer for buckets within this distance from now." The mistake is signing the wrong contract — using a recording rule for a question that needs late-data correctness, or using a continuous aggregate for a question that needs sub-second freshness — and then being surprised when the dashboard does not match the source data three months later.
The corollary worth keeping pinned to the wall: the freshness of a rolled-up dashboard is bounded by refresh_interval + lookback + 2σ(refresh_duration), not by the dashboard's auto-refresh setting. A Grafana panel that auto-refreshes every 10 seconds is reading from a CA that updates every 5 minutes — the panel is showing 0 to 5 minutes of staleness, not 10 seconds. Communicate that to dashboard consumers explicitly (footer text, panel title prefix, separate "last refresh" panel) or accept that they will misread the dashboard. Aditi's 03:00 IST page would have been five minutes earlier — and her hour shorter — if the dashboard panel had said "last update: 02:29:14" in the corner.
References
- TimescaleDB continuous aggregates documentation — the canonical reference for refresh policies, lookback offsets, and hierarchical CAs.
- Korhonen, "Continuous aggregates and recording rules side by side" (Timescale blog, 2022) — the side-by-side comparison this chapter draws on.
- Prometheus recording rules — the spec, including the
evaluation_intervalandfor:semantics. - McSherry, Murray, Isaacs, Isard, "Differential Dataflow" (CIDR 2013) — the theoretical underpinning of incremental-view-maintenance, which Materialize is built on.
- ClickHouse
MaterializedViewdocumentation — the OLAP-flavoured cousin of TimescaleDB CAs, with explicitPOPULATEandTOsemantics. - Charity Majors et al., Observability Engineering (O'Reilly, 2022), Ch. 9–10 — the broader retention-economics framing.
- Mimir recording rules and ruler architecture — Cortex/Mimir's evolution of Prometheus recording rules, with the multi-tenant rule-eval scheduler.
- Downsampling for long retention — the retention-tier sibling of this chapter.