Retention, tiered storage, compression regimes

In short

Time-series data has a structured life cycle: a metric written one second ago is hot (queried by dashboards, alerts, autoscalers), the same metric one month old is warm (occasional comparison queries), and one year old is cold (forensic only, perhaps an auditor reads it once). Storing all three on the same NVMe SSD is economically absurd. Tiered storage matches storage cost and compression aggressiveness to access frequency.

The hot tier lives on NVMe or in-memory caches with no compression — writes need to land at full speed and reads need to hit microsecond latency. The warm tier lives on cheaper SSD with light, fast compression (Snappy or LZ4 over delta-encoded columns) — about 5x size reduction, millisecond reads, but writes have to pause briefly to recompress affected blocks. The cold tier lives on spinning disk or S3-class object storage with everything turned on — full columnar layout, Gorilla XOR for floats, delta-of-delta for timestamps, dictionary + RLE for strings, all wrapped in ZSTD with a trained dictionary. Cold compression hits 30x to 100x; reads are second-latency; writes are batch-only (you do not insert into cold).

Retention policy then defines how long data lives at each tier and at what aggregation level. Raw 1-second resolution lives 7-30 days. 1-minute downsamples live a few months. 1-hour rollups live a year. 1-day rollups live forever. The data does not vanish at the boundary — it gets summarised, then deleted at the original resolution. The combined effect for a 1 TB/day Indian fintech is roughly half the storage cost of a flat NVMe setup with no loss of dashboard responsiveness for the last 30 days.

The implementation in real systems differs in mechanics but agrees on the idea. TimescaleDB uses compression policies (add_compression_policy) plus a tiered_storage extension that moves chunks to S3. InfluxDB uses the TSM compaction levels and retention policies. Prometheus delegates everything beyond ~15 days to remote storage like Thanos, Cortex, or Mimir, which themselves implement the tier hierarchy. VictoriaMetrics has native S3 backing for cold and per-block compression for warm.

This chapter walks the data lifecycle, the per-tier compression regimes, the promotion/demotion policies, and a worked Indian fintech example where tiering halves the bill.

The previous chapter ended with the chunk as the unit of storage — a self-contained sub-table holding one day or one week of data, internally laid out as a column store. That chapter implicitly assumed every chunk lives on the same medium with the same compression. This chapter relaxes that assumption. Once data is chunked, you can put each chunk on a medium that matches its access pattern — and time-series data has the loveliest access pattern in all of database engineering, because time itself is the access predictor.

The thesis in one sentence: a chunk's age is an almost-perfect predictor of how often it will be read, so storage tier should be a pure function of chunk age. From that one observation flows a hierarchy where the youngest chunks sit on the most expensive medium uncompressed, middle-aged chunks sit on cheaper medium with moderate compression, and oldest chunks sit on the cheapest medium with the most aggressive compression you can afford. Each transition is automated and invisible to the query layer — the planner consults chunk metadata, finds the storage path, and reads from whatever tier the chunk currently lives on.

The data lifecycle

Imagine a single chunk born at 09:14 IST on a Tuesday. At 09:14:00.001, the application is already querying it — the autoscaler computes a 30-second moving average of request latency to decide whether to add pods, and that average reads the chunk you literally just wrote into. A second later, the alerting system reads the same chunk to evaluate the rule "p99 latency > 200ms for 5 minutes." A minute later, the on-call engineer's Grafana dashboard refreshes and reads it. In its first hour of life, this chunk is read perhaps a thousand times.

A day later, the chunk is read maybe ten times an hour — by background reports, by the morning standup dashboard, by anomaly-detection batch jobs. A week later it is read once an hour — only by the "compare last week" widget. A month later it is read perhaps once a day. A year later it is read perhaps once total, when an auditor asks "show me the latency on the day of the IPO." Five years later, the answer is "never," but compliance forbids deleting it.

$ / GB</text><rect x="200" y="55" width="155" height="135" fill="#fef3c7" stroke="#a16207" rx="6"/><text x="277" y="76" text-anchor="middle" font-weight="bold" font-size="13" fill="#78350f">WARM</text><text x="277" y="92" text-anchor="middle" font-size="10">age: 7 - 90 days</text><text x="277" y="108" text-anchor="middle" font-size="10">commodity SSD</text><text x="277" y="124" text-anchor="middle" font-size="10">Snappy + delta enc.</text><text x="277" y="140" text-anchor="middle" font-size="10">indexed, ~5x compr.</text><text x="277" y="158" text-anchor="middle" font-size="10" fill="#78350f" font-weight="bold">read: milliseconds</text><text x="277" y="174" text-anchor="middle" font-size="10" fill="#78350f" font-weight="bold">

/ GBCOLDage: 90 days - 5yHDD or S3 / GCSZSTD + dict + Gorillabatch-only, ~30-100xread: seconds/ GB (1/10 of hot)</text><rect x="540" y="55" width="155" height="135" fill="#e5e7eb" stroke="#374151" rx="6" stroke-dasharray="4 3"/><text x="617" y="76" text-anchor="middle" font-weight="bold" font-size="13" fill="#1f2937">EXPIRED</text><text x="617" y="92" text-anchor="middle" font-size="10">age: > retention</text><text x="617" y="108" text-anchor="middle" font-size="10">DROP CHUNK</text><text x="617" y="124" text-anchor="middle" font-size="10">filesystem rm</text><text x="617" y="140" text-anchor="middle" font-size="10">or object delete</text><text x="617" y="158" text-anchor="middle" font-size="10" fill="#1f2937" font-weight="bold">read: 404</text><text x="617" y="174" text-anchor="middle" font-size="10" fill="#1f2937" font-weight="bold">0Chunks flow left to right purely by age. The query layer is unaware of which tier a chunk lives on.What moves the chunk:- Age threshold: a background job (TimescaleDB compression policy, InfluxDB shard rotation) checks chunk age daily- A chunk older than the hot threshold gets compressed in place and its metadata flips to "warm"- A chunk older than the warm threshold gets uploaded to S3 and its metadata flips to "cold" with the new path- A chunk older than the retention horizon: the metadata row is deleted, the file or object is unlinked. Pure rm.No tuple-level operations. The unit is always a whole chunk. That is what makes tiering cheap.

The lovely property is that the read frequency falls roughly logarithmically with age, while storage cost should be proportional to medium price per GB. Why this matches: when read frequency falls by 100x as the chunk ages from "today" to "last month", paying 10x less per GB to store it is still a net win — you get most of the cost saving and the slower read latency is acceptable because the read happens 100x less often. Tiered storage is, in essence, the storage analogue of caching: pay for fast where the access frequency justifies it, fall back to slow where it does not.

The retention horizon is the boundary where data simply ceases to exist. For raw resolution, retention is typically 7-30 days. But "raw goes away" does not mean "history goes away." Downsampling (the previous chapter in the part) keeps a 1-minute rollup for 90 days, a 1-hour rollup for 1 year, a 1-day rollup forever. So the actual data architecture is a pyramid: the base is wide (raw, short retention), each layer above is narrower (fewer rows, lower fidelity, longer retention), and the apex (daily aggregates) lives indefinitely. Each layer can have its own tier rules.

The hot tier — fast at any cost

The hot tier exists for one reason: writes must land in microseconds and recent reads must complete in microseconds. Compression is the enemy here. A compressed block has to be decompressed before it can be queried, and modified data has to be recompressed before it can be persisted — both add CPU cost and latency that the hot path cannot afford.

So the hot tier looks like a row-oriented (or lightly columnarised) write-ahead log on NVMe, with hot indexes (a BTree on time, perhaps a hash index on the metric label) kept fully in RAM. Many TSDBs go further and keep the most recent few minutes in a memory table — a pure RAM structure that absorbs the firehose of writes and gets flushed to NVMe every minute or so. InfluxDB's TSM cache, VictoriaMetrics' in-memory part, Prometheus's head block are all variations on the same idea. Why memory-first: a 200K-writes/sec stream is 12 million writes per minute. If each write involves a single fsync, you bottleneck on storage device sync latency (~100 microseconds per fsync on good NVMe = max 10K syncs/sec). Buffering writes in RAM, then flushing batched groups, is the only way to hit 200K writes/sec sustainably. The fsync gets amortised across thousands of rows.

The hot tier also keeps the full schema — every column, every label, every original sample. Nothing has been thrown away or summarised yet. That fidelity is what lets the on-call engineer answer the question "what was the exact CPU usage on host-237 at 14:23:07.418" — a question that will be unanswerable in a few weeks once the chunk has been downsampled. The hot tier is the only place where you can ask arbitrary questions at original resolution.

The flip side is cost. NVMe SSD costs roughly 50-100 INR per GB per month on cloud providers (translating AWS gp3 / io2 prices to rupees). Storing 30 TB on NVMe is ~2-3 lakh rupees per month before any redundancy or snapshots. You cannot afford to keep five years of data here, and you should not try.

The warm tier — cheap enough, fast enough

The warm tier is where most of the data actually lives. It still uses SSD (because seek latency on HDD is unbearable for any indexed query), but a cheaper class — AWS gp3 instead of io2, or local commodity NVMe in your own datacenter. The hot-warm boundary is typically 7 days at consumer-internet scale, 1 day at fintech scale, depending on how much "recent" data the dashboards actually need to be snappy.

The compression regime is light and fast: Snappy or LZ4 wrapped around already-delta-encoded columns. Snappy decompresses at roughly 2 GB/s per core, LZ4 at roughly 4 GB/s — both fast enough that a millisecond-latency read on a 100 MB compressed chunk is plausible. Compression ratios are modest, perhaps 3x to 5x, but that is enough to fit a year of data into a few terabytes of SSD. Why not heavier compression here: the warm tier still has to support occasional ad-hoc queries — "compare last month to this month" — and those queries should return in seconds, not minutes. Heavier codecs (ZSTD at level 19, for instance) would push decompression to 200 MB/s/core, multiplying query time by 10x. Snappy/LZ4 hit the sweet spot of "compresses meaningfully but disappears from the latency budget."

Critically, the warm tier still has indexes. Each chunk still maintains its per-column min/max metadata, its dictionary mapping for low-cardinality strings, and its time-range header. A query that asks "average latency in March" can still prune to the right chunks and within each chunk skip blocks whose min/max bracket doesn't include the queried time range. The index lookup is identical to the hot tier — only the actual read of the data block involves a decompression step.

Writes into the warm tier are almost always out-of-band: a background process picks up an old chunk from the hot tier, recompresses it, writes it to the warm directory, updates the metadata to point at the new path, and finally deletes the old hot copy. Direct writes to warm are forbidden in most systems, because they would require recompressing entire blocks for every insert. The exception is late-arriving data — if a metric from yesterday only just shows up, some systems (TimescaleDB, with its compressed-chunk-update support) will decompress, append, and recompress, but it is an expensive operation explicitly marked as such.

The cold tier — everything turned on

The cold tier is where you stop worrying about read latency entirely and start worrying only about how much storage you are paying for. The medium of choice today is object storage — Amazon S3, Google Cloud Storage, MinIO on bare metal — at 1.5-2 INR per GB per month for standard tier and 0.3-0.5 INR per GB for infrequent-access classes like S3 Glacier Instant Retrieval. Object storage is roughly one-tenth the cost per GB of NVMe, and roughly one-thirtieth if you go to true archival classes.

The price you pay is latency: an S3 GET typically completes in 50-300 milliseconds for the first byte, with throughput then ramping up to a few hundred MB/s per parallel connection. Random access patterns are punishing. So cold-tier reads are designed around bulk fetches: download the entire chunk (perhaps hundreds of MB compressed) into local cache, decompress, run the query, possibly cache the chunk locally for a while. A single cold query might take 5-30 seconds. The user should be told.

Compression at this tier turns everything on. The columnar layout is full — every column stored separately, dictionary-encoded if low-cardinality, RLE on top, delta-of-delta on timestamps, Gorilla XOR on floats, bit-packing on bool columns, all then wrapped in ZSTD level 19 or 22 with a trained dictionary — a small precomputed table (a few KB) that ZSTD uses to compress small blocks far better than it could without prior knowledge of the data shape. Why a dictionary: ZSTD's compression ratio degrades sharply on small inputs because the back-references that LZ-family algorithms rely on cannot find anything to reference inside a 4 KB block. A dictionary essentially gives the compressor a synthetic "history" to back-reference into, which can double the compression ratio on small blocks. Time-series chunks have very repetitive structure — the same metric names, the same hostnames, the same value patterns — so a trained dictionary is a near-free 2x.

Real-world cold-tier compression ratios on telemetry are 30x to 100x. A 1 GB hot chunk becomes 10-30 MB cold. That order-of-magnitude shrinkage is what makes 5-year retention financially possible.

Promotion and demotion policies

The mechanism that moves a chunk between tiers is a background job — typically a cron-style task that runs hourly or daily, scans the chunk metadata table, and acts on chunks that have crossed an age threshold. TimescaleDB exposes this directly:

-- After 7 days, compress the chunk in place (hot -> warm)
SELECT add_compression_policy('metrics', INTERVAL '7 days');
-- After 90 days, move the chunk to S3 (warm -> cold)
SELECT add_tiering_policy('metrics', INTERVAL '90 days');
-- After 1 year, drop the chunk entirely (cold -> expired)
SELECT add_retention_policy('metrics', INTERVAL '1 year');

Each policy is independent. The compression policy reads the chunk row by row, builds the columnar representation, writes it to a sibling file, and atomically swaps the metadata pointer. The tiering policy uploads the compressed chunk to S3, updates the metadata path, and deletes the local file. The retention policy unlinks the file (or issues an S3 delete) and removes the metadata row. All three operations are at chunk granularity — there is never a row-by-row scan. That is what makes the whole hierarchy economically reasonable to operate.

Three policy kinds show up across the ecosystem:

Time-based is the default and the one that handles 99% of real workloads. Promote/demote based purely on chunk age. Simple, deterministic, easy to reason about for capacity planning.

Size-based kicks in when ingest rate spikes and the hot tier risks filling its allocated SSD. The policy says "if hot tier > 80% full, demote the oldest chunks early until it drops below 60%." VictoriaMetrics and InfluxDB both support this as a safety valve. Why size-based matters: a sudden 3x traffic spike (sale day at the e-commerce site, a viral event at the streaming service) can fill the hot tier in hours. Without size-based eviction the writes start to fail. With it, the system gracefully degrades — recent reads get slightly slower because chunks moved to warm earlier than scheduled, but writes never stall.

Manual pinning is the escape hatch for ops teams. "Pin chunk X hot for the next 30 days" tells the policy machinery to skip this chunk during demotion. Common uses: an incident investigation needs to keep the affected day's data hot for repeated forensic queries; a regulatory request needs the data immediately available for the duration of the audit. Almost every TSDB exposes some form of manual override.

How real systems implement this

TimescaleDB has the cleanest model because it runs on top of Postgres. The compression policy converts a row-oriented chunk to a columnar one in place; the chunk continues to be a Postgres child table, just with different storage. The tiered storage extension (released in 2024) adds the S3 leg. The metadata catalog (timescaledb_catalog.chunk and friends) tracks each chunk's tier, and the planner's chunk-exclusion machinery transparently routes reads to whichever tier holds the chunk. From the application's view, SELECT avg(latency) FROM metrics WHERE ts > now() - INTERVAL '6 months' works identically whether the data is hot, warm, cold, or split across all three. TimescaleDB tiered storage docs give the operational details.

InfluxDB uses the TSM (time-structured merge tree) engine, which has its own compaction levels. Newly-written data lands in the WAL and the in-memory cache. A first-level compaction packs WAL into a TSM file (still relatively uncompressed). Subsequent compaction levels merge TSM files together and apply heavier compression. Retention policies delete shards when their shard group expires. The hot/warm/cold split is implicit in the compaction level: low-level TSM files are effectively warm, deeply-compacted ones are effectively cold. InfluxDB's downsampling docs cover the related continuous-query mechanism.

Prometheus is the interesting outlier — it deliberately does not implement tiered storage. Its local TSDB keeps two weeks of data on local disk by default (configurable up to a few months) and that's it. For longer retention you bolt on Thanos, Cortex, or Mimir — projects whose entire reason for existing is to provide the cold-tier object-storage backing that Prometheus refuses to do natively. They sit alongside Prometheus, scrape the same targets or read from Prometheus's remote-write endpoint, and store everything beyond the hot window in S3 (or compatible) with a query layer that federates across hot Prometheus and cold object storage. The Prometheus retention documentation explains the design choice and the boundary.

VictoriaMetrics has native S3 backing built in. The vm-storage component writes to local disk; a configurable backup process replicates to object storage; queries can transparently fall through to S3 for old data. VictoriaMetrics is also notable for using extremely aggressive compression by default — its on-disk representation is roughly 1-2 bytes per data point even in the warm tier, which is comparable to what other systems achieve only in cold.

A 1 TB/day Indian fintech bill

A real fintech in Mumbai runs a payments and risk platform with the following telemetry profile: 200,000 metric writes per second sustained, ~12,000 unique time series (per host per metric per region), ~120 bytes per uncompressed row. That works out to roughly 1 TB of raw metric data per day.

The retention requirements are driven by a mix of operational, business, and regulatory needs:

30 days of full 1-second resolution for incident investigation, autoscaling, and dashboards. Anything within 30 days might be queried by an SRE during an outage.
1 year of 1-minute resolution for capacity planning, month-over-month comparison, and quarterly reviews.
5 years of 1-hour resolution for RBI compliance and historical fraud-pattern analysis.

Without tiering (everything on NVMe at 1-second resolution): keeping 5 years would mean 5 x 365 x 1 TB = ~1.8 PB of NVMe. At 80 INR/GB/month for cloud NVMe, that is roughly 1.5 crore INR per month — economically impossible. Even keeping just 30 days flat is 30 TB on NVMe = roughly 2.4 lakh INR per month (~$3K), and that already discards everything older than a month.

With tiering, the same retention picture costs about half:

Hot (NVMe, 30 days, 1-second raw, no compression): 30 TB. At ~80 INR/GB/month: ~2.4 lakh INR/month.
Warm (commodity SSD, 1 year, 1-minute downsampled and Snappy-compressed): 1-minute resolution shrinks raw 1-sec data by 60x, then Snappy gives another 4x. So 1 year of 1-minute data is ~365 TB / 60 / 4 ≈ 1.5 TB. At 8 INR/GB/month for commodity SSD: ~12,000 INR/month.
Cold (S3 standard, 5 years, 1-hour rolled up and ZSTD+Gorilla compressed): 1-hour resolution shrinks raw 1-sec by 3600x, full cold compression another 30x. So 5 years of hourly data is roughly 5 x 365 TB / 3600 / 30 ≈ 17 GB. At 1.7 INR/GB/month: trivially small, ~30 INR/month.

Total: roughly 2.5 lakh INR/month vs the impossible 1.5 crore, and even compared against the flat 30-day-only NVMe baseline you save a few thousand rupees per month and gain the long-tail history. The dashboard for the last 30 days remains exactly as snappy as before — it is hitting hot-tier NVMe with no compression overhead. The ad-hoc "compare to last quarter" query takes a few extra milliseconds because it's reading warm-tier 1-minute aggregates. The compliance query "show me hourly metrics from January 2023" takes 2-3 seconds because it's pulling from S3, but compliance tolerates that easily.

The chunk geometry that makes this work: one chunk per 6 hours of raw data (≈4 GB compressed), one chunk per day of 1-minute aggregates (≈4 MB), one chunk per week of 1-hour aggregates (≈100 KB). Each is a separately-tiered, separately-retained object — and the whole thing is configured in TimescaleDB with three add_compression_policy, add_tiering_policy, and add_retention_policy calls per aggregate level. Eight SQL statements give you the full hot/warm/cold/expired pipeline.

The cost-vs-latency curve

Plotting the trade-off shows why tiered storage is the dominant strategy rather than just one option among many. The cost-per-GB axis spans roughly two orders of magnitude (NVMe at ~80 INR vs S3 Glacier at ~0.4 INR), and the read-latency axis spans roughly seven orders of magnitude (microseconds for hot vs seconds for cold). There is no medium that gives you both low cost and low latency simultaneously. That asymmetry is what makes tiering not just useful but inevitable.

The arithmetic the curve embodies: hot tier costs roughly 10x more per GB than cold tier, but serves reads about 100x faster. So the "fair" allocation rule is: data that is read 10x or more often than the average should sit on hot, data read 0.1x or less should sit on cold. Time-series access patterns satisfy that ratio almost perfectly, with recent data being read 1000x more often than data a year old.

Going deeper

Erasure coding, query federation, and the future of cold tiers

Object storage providers internally use erasure coding rather than full replication: a single 100 MB chunk gets split into N+K fragments where any N can reconstruct the original, giving durability with much less than 3x storage overhead. Some TSDBs are starting to adopt the same idea for their warm tier — store each chunk as 6 data fragments and 3 parity fragments across machines, recover from any 3 losses. The storage overhead drops from 3x (triple replication) to 1.5x with similar durability guarantees, and reads can pull from any subset of fragments. The trade-off is reconstruction CPU cost on every read of a degraded chunk.

Query federation across tiers

A query that spans the hot-warm-cold boundary (say, "average latency over the last 6 months") has to read from three different storage media in one operation. The execution model is a federated scan: the planner decomposes the query into per-tier sub-scans, dispatches each to the appropriate engine (in-memory hot scanner, mmap-based warm scanner, S3 client cold scanner), and merges the results. Latency is dominated by the slowest tier — even if hot returns in 1 ms and warm in 10 ms, the query waits the 500 ms for cold to land before returning. Why federation rather than copy-up: copying cold chunks to warm on demand would defeat the point of tiering — you'd thrash the warm tier with old data nobody will read again. Federation accepts the latency cost on the rare cross-tier queries in exchange for keeping the tier sizes stable.

Late-arriving data

What happens when a sample with timestamp from yesterday arrives now, but yesterday's chunk has already been compressed and tiered? Three approaches: reject it (Prometheus's default — late samples beyond a small window are dropped), accept it into a recent chunk with a flag (InfluxDB-style — the sample lands in today's chunk with the original timestamp, queries on yesterday's range will catch it via timestamp predicate), or rewrite the old chunk (TimescaleDB-style — decompress yesterday's chunk, append, recompress, retier). Each has tradeoffs around correctness, ingestion latency, and re-tiering cost. Most production systems pick "accept with flag" because it keeps the write path fast and the storage cost predictable.

Cold-tier compaction

Even cold chunks benefit from periodic recompression as more data accumulates and trained dictionaries can be improved. A monthly batch job that re-encodes all of last year's chunks with a freshly-trained dictionary can squeeze out another 10-20% — material at petabyte scale. The compute cost is 0 INR for a few hours of spot instances; the savings are measurable. The "Time-Series Storage Tiering" literature [5] discusses this and similar background-optimisation patterns.

What this chapter buys you

You can now look at any time-series storage system and read its tier structure off the configuration: where is the hot/warm boundary? What compression turns on at warm? Where does cold live and how long until expiry? What are the downsampling rules and which aggregation level lives at each tier? The pieces are universal even though the names differ. TimescaleDB calls it compression policies and tiering policies; InfluxDB calls it TSM compaction levels and retention policies; Prometheus calls it remote storage; VictoriaMetrics calls it native S3 backing. The math (cost ratio vs latency ratio) and the architecture (chunk as the unit of tiering) are the same everywhere.

The next chapter compares the four major TSDBs head-to-head — InfluxDB, TimescaleDB, QuestDB, VictoriaMetrics — on the dimensions established by the last few chapters: ingest rate, query latency, compression ratio, retention model, operational complexity. Each system makes different defaults and different trade-offs, and the choice for any given workload comes down to which trade-offs match your read/write profile.

References

TimescaleDB — Data Tiering and Compression Policies — official documentation for add_compression_policy, add_tiering_policy, add_retention_policy, and the S3 cold-tier integration.
InfluxDB — Downsampling and Retention Policies — covers the TSM cache, retention policies, and continuous queries used to roll up data before tier demotion.
Prometheus — Storage and Long-term Retention — describes the local TSDB design, the block layout, and the deliberate choice to delegate long-term storage to remote-write integrations.
Pelkonen et al. — Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015) — the original Gorilla paper that introduced the XOR-based float compression now used by every cold-tier float column.
VictoriaMetrics — Time-Series Storage Cost Optimization — practical writeup of compression, downsampling, and tiering trade-offs from a production TSDB vendor.
Thanos — Object Storage and Tiered Block Layout — documents how Thanos extends Prometheus with S3-backed cold storage, including block compaction and downsampling tiers.