Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
A time-series database stores its data as a sequence of time-bounded chunks, and each chunk is internally a column store. Range queries open only the chunks whose time window matches; within each chunk the query reads only the columns it needs, each compressed by a codec tuned to its data shape. The two ideas multiply: a "last hour" query on a 30-day, 14-column table touches roughly 1/30 × 1/14 × 1/10 ≈ 1/4200 of the raw bytes, which is the entire economic argument for TSDBs as a separate category.
A "p99 latency in the last hour" dashboard at an Indian payments SaaS scans 500 GB on a vanilla Postgres row store and 200 MB on a TimescaleDB hypertable — same data, same query, 2500x less I/O. That gap does not come from one clever trick; it comes from two layouts stacked on top of each other.
The thesis in one sentence: a time-series database stores its data as a sequence of time-bounded chunks, and each chunk is internally a column store. From that one design choice flow most of the performance characteristics — the chunk pruning that makes "last hour" queries cheap, the per-column compression that makes 18 months of metrics fit on a single SSD, the easy retention story (drop a chunk = rm -rf a directory), the natural fit for tiered storage (recent chunks on NVMe, old chunks on S3). It is a layout that does not work well for arbitrary OLTP — point updates are awkward, foreign keys are painful — but on the workload it was built for it wins by orders of magnitude.
Why a row store loses on this workload
Before showing the time-partitioned columnar layout, it is worth being precise about what it is replacing. Imagine you naively store the time-series workload in a Postgres table with the obvious schema:
CREATE TABLE metrics (
ts TIMESTAMPTZ NOT NULL,
host TEXT NOT NULL,
region TEXT NOT NULL,
cpu DOUBLE PRECISION,
memory DOUBLE PRECISION,
disk_io DOUBLE PRECISION,
network DOUBLE PRECISION,
-- ...10 more measurement columns...
PRIMARY KEY (host, ts)
);
CREATE INDEX ON metrics (ts);
You ingest 200K rows per second for 30 days. The math is brutal: 200,000 rows/second x 86,400 seconds/day x 30 days = 518 billion rows. At ~120 bytes per row (timestamp 8, host 30, region 10, fourteen doubles 112, header overhead 12), that is 62 TB of raw heap data, plus the secondary index on ts.
A query for SELECT avg(cpu) FROM metrics WHERE ts > now() - INTERVAL '1 hour' will use the BTree index on ts to find the row IDs of the recent rows. There are about 720 million such rows in the last hour. Each one is a tuple ID pointing at a random page in the 62 TB heap. The query plan therefore performs ~720 million random heap accesses, and when it reads each tuple it pulls all fourteen columns into memory, even though it only needs the cpu column. With NVMe at 200K random reads per second, that is an hour of pure I/O for one query. Why row stores are bad here: the storage was designed assuming OLTP — that when you read a row, you generally want all of its columns, and that you read rows in patterns close to their on-disk order. Time-series queries violate both assumptions. They want one column out of many, and they want a contiguous time slice that is not contiguous on disk.
You can patch this with CLUSTER metrics USING idx_ts to physically reorder the heap by timestamp — but CLUSTER is a one-shot reorganisation that locks the table. You can use BRIN indexes to summarise time ranges per page — that helps the index lookup but does nothing about the row-store scan amplification. No combination of vanilla Postgres tuning makes this workload fast. What you want is a fundamentally different layout, and that layout has two parts.
Part one: time partitioning
The first piece is to physically split the table into chunks where each chunk holds a fixed time range. The natural unit is a day or a week, depending on ingest rate. At 200K writes/second, one day is ~17 billion rows — perhaps too large; six-hour chunks at 4 billion rows might be better. The principle is the same: each chunk is a self-contained sub-table covering a contiguous, non-overlapping time range.
The bookkeeping for time partitioning is small but essential: somewhere — usually a metadata table kept in RAM — the database tracks chunk_id -> (min_ts, max_ts, file_path) for every chunk. When a query arrives with a WHERE ts > X predicate, the planner walks this metadata, identifies which chunks have max_ts > X, and only opens those chunks. The pruning is a metadata lookup measured in microseconds, not an I/O measured in seconds. Why this is so cheap: with 30 chunks and a metadata entry of perhaps 100 bytes per chunk, the entire chunk index is 3 KB — it lives in L1 cache. The planner's chunk-pruning step costs less than parsing the SQL statement. Compare to a BTree index lookup over 500 billion rows, which has to traverse multiple disk pages of internal nodes.
Each TSDB names this differently. TimescaleDB hypertables are the most familiar to Postgres users: a hypertable is a logical view over many child tables (the chunks), each a regular Postgres table containing one time slice. The create_hypertable('metrics', 'ts', chunk_time_interval => INTERVAL '1 day') call sets the chunk size; subsequent inserts get routed to the right chunk by the trigger machinery. The query planner has a custom node — chunk exclusion — that consults the catalog and rewrites the plan to scan only the chunks that match the predicate. To the user, SELECT * FROM metrics WHERE ts > now() - INTERVAL '1 hour' looks like a single-table query; the planner quietly turns it into a scan of just the most recent chunk.
InfluxDB's TSM engine (time-structured merge tree) divides storage into shards, each covering a shard group duration (default 1 week for unbounded retention, 1 hour for short retention). Each shard is itself a tree of TSM files. A query selects the shards whose time range overlaps the query range and processes only those.
ClickHouse uses PARTITION BY to declare the chunking. A common time-series schema looks like PARTITION BY toYYYYMMDD(ts) ORDER BY (host, ts), which gives one partition per day. The MergeTree engine keeps partition metadata in memory and prunes whole partitions from the query plan. VictoriaMetrics does similar partitioning automatically, with a write-ahead log and per-day partitioning baked in.
The chunking trick alone is worth perhaps a 30x to 100x speedup for time-range queries on a 30-day retention. But the second trick — columnar storage inside each chunk — multiplies that further.
Part two: columnar storage within each chunk
The chunk you have just selected (today's chunk) still holds 17 billion rows in our hypothetical workload. If those rows are stored row-by-row, every column-projection query still pulls all 14 measurement columns. The second piece of the layout is lay each chunk out as a column store: physically segregate each column's values into its own file or block, so a query for avg(cpu) reads only the cpu column.
The compression is where the magic compounds. A column of timestamps in a steady-state ingest is almost monotonically increasing with almost constant deltas. Delta-of-delta encoding stores the second derivative of the timestamp sequence: if timestamps are [1000, 1010, 1020, 1030, 1041], the deltas are [10, 10, 10, 11] and the delta-of-deltas are [0, 0, 0, 1]. That sequence of mostly-zeroes compresses to almost nothing under any general-purpose compressor. Real-world ratios are 20-50x for time columns. Why this works so well: most monitoring agents send data on a fixed cadence (every 10 seconds, every 1 second). The stream is therefore arithmetically progressing, and the residual after delta-of-delta is genuinely close to zero — not just compressible but informationally tiny.
For floating-point telemetry — CPU percentages, temperatures, latencies, throughputs — the compression of choice is Gorilla (named after Sociogram's monitoring system that introduced it [4]). Gorilla observes that consecutive measurements are usually close in value, and that "close in value" for IEEE 754 doubles means they share most of their leading bits. Gorilla XORs each value with the previous one, then strips off the matching leading and trailing zeros, storing only the meaningful middle bits. Real telemetry compresses to ~1-2 bytes per double — a 4x to 8x ratio before applying ZSTD on top, and 8x to 12x after. Why XOR rather than delta: floats are not arithmetic — (a+1) - a is not always 1 in IEEE 754. XOR sidesteps the rounding issue entirely while exploiting the same "consecutive values are similar" observation.
Low-cardinality string columns — host names (a few hundred unique values), region codes (~30), status names (~10) — are dictionary-encoded into integer codes, then run-length-encoded. If a host emits a metric every second for an hour, that is 3600 rows with the same host value, which RLE collapses to (host_code, 3600) — six bytes for what was 36 KB raw. Compression ratios of 50x to 1000x are routine for these columns.
Putting the codec choices together, a typical real-world chunk compresses by 10x to 30x overall versus the row-store size. Combined with the column projection (read only 1-2 columns out of 14), and combined with the chunk pruning (read only 1 chunk out of 30), you get the multiplicative win the chapter opened with.
How TimescaleDB stitches it all together
TimescaleDB is the easiest concrete system to point at because it makes the layout visible — you can SELECT * FROM timescaledb_information.chunks and literally see the chunks. A hypertable is a regular Postgres table at the API layer, but underneath it is partitioned into child tables called chunks. The create_hypertable('metrics', 'ts', chunk_time_interval => INTERVAL '1 day') call sets up the partitioning machinery: a trigger on inserts routes each row to the correct chunk (creating it on the fly if needed), and the query planner has a custom expansion step that turns a query against the parent table into a UNION over the matching child chunks.
So far this is "just" partitioning — Postgres can do raw partitioning natively. The TimescaleDB-specific magic is the compression policy that converts an old chunk from row-store to columnar format. You declare:
ALTER TABLE metrics SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'host',
timescaledb.compress_orderby = 'ts DESC'
);
SELECT add_compression_policy('metrics', INTERVAL '1 day');
And TimescaleDB will, in the background, take chunks older than one day and rewrite them in columnar form. The resulting compressed chunk is a different on-disk layout: rows for the same host (the segment-by) are batched together, and within each batch the columns are stored as arrays — ts is one delta-of-delta-encoded array, cpu is one Gorilla-encoded array, and so on. A typical real-world compression ratio is 10x for mixed metric data, sometimes much higher. Queries against a compressed chunk transparently decompress the relevant columns; the SQL surface is unchanged. Why "segment by host": because most queries filter by host. Grouping rows by host within the chunk means the query for "cpu of host X" reads only the segment for host X — another layer of pruning beyond the chunk-level one. It is essentially a coarse secondary index built into the storage layout.
The query for "cpu of host X over the last hour" therefore does the following: chunk pruning narrows to one chunk; segment pruning narrows to one segment within that chunk; column projection reads only the cpu column array within that segment; the array decompresses to ~3600 doubles; the average is computed in microseconds. The end-to-end I/O is on the order of tens of kilobytes for what would have been hundreds of GB on the row-store.
A worked example: 200K writes/sec API latency monitoring
Time to make the argument concrete on a workload an Indian SaaS engineer might actually face.
The API latency monitoring system at Razorscale, a fictional Bangalore-based SaaS
Razorscale runs a payment gateway used by 5,000 Indian merchants. Every API call to their gateway emits a structured event: (ts, endpoint, status_code, latency_ms, region, customer_id, ...). At peak load — Diwali sale day, IPL final, anything that triggers a buying spree — the gateway handles 200,000 API calls per second. The SRE team needs to keep 30 days of these events for incident debugging and capacity planning. The dashboard the on-call engineer stares at every day shows p99 latency by endpoint over the last hour, refreshed every 30 seconds.
Sizing the workload. 200,000 events/sec x 86,400 sec/day x 30 days = 518 billion rows. With ~150 bytes per row (timestamp 8, endpoint 40, status 4, latency 8, region 16, customer_id 16, plus headers), that is 77 TB of raw data. There is no Postgres instance in the world that handles this on a single node, and even sharded Postgres would burn $30,000/month in NVMe alone.
The naive row-store cost of the dashboard query. SELECT endpoint, percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) FROM events WHERE ts > now() - INTERVAL '1 hour' GROUP BY endpoint. With 200K events/sec for one hour, that is 720 million rows to consider. On a row store, even with a (endpoint, ts) BTree index helping locate them, the query has to fetch each row's latency value from random heap pages. Even if the recent rows happen to be in cache (best case), the query touches the most-recent ~108 GB of heap data (1/30 of 77 TB) — and does so with index-driven random I/O. Realistic runtime: 2 to 5 minutes. That is unusable for a dashboard refreshing every 30 seconds.
The TimescaleDB hypertable design. Razorscale picks TimescaleDB on a single beefy node (96 cores, 512 GB RAM, 30 TB NVMe) with the following:
CREATE TABLE events (
ts TIMESTAMPTZ NOT NULL,
endpoint TEXT NOT NULL,
status_code SMALLINT,
latency_ms REAL,
region TEXT,
customer_id BIGINT
);
SELECT create_hypertable('events', 'ts', chunk_time_interval => INTERVAL '1 day');
ALTER TABLE events SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'endpoint',
timescaledb.compress_orderby = 'ts DESC'
);
SELECT add_compression_policy('events', INTERVAL '6 hours');
SELECT add_retention_policy('events', INTERVAL '30 days');
Each daily chunk holds 17 billion rows. Chunks older than 6 hours are automatically rewritten to columnar form. Chunks older than 30 days are automatically dropped. The dashboard query becomes:
SELECT endpoint, percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms)
FROM events
WHERE ts > now() - INTERVAL '1 hour'
GROUP BY endpoint;
What happens at runtime. TimescaleDB's planner first applies chunk exclusion: the only chunk with rows in [now() - 1 hour, now()] is today's chunk (the most recent uncompressed chunk, since data younger than 6 hours hasn't been compressed yet). All other 29 chunks are pruned. Why the recent chunk is uncompressed: the compression policy (6-hour delay) is intentional. Recent data is being heavily ingested and queried; rewriting it before it has cooled would waste CPU. Older data is read-mostly, so columnar rewrite pays off.
Within today's chunk, the planner uses the BRIN index on ts to identify the row range corresponding to the last hour. That is roughly 720M rows out of today's 17B — about 4% of the chunk. With segment-by endpoint even on the uncompressed chunk, rows for the same endpoint cluster together, so the per-endpoint percentile aggregation is a streaming pass.
The bytes touched: about 200 MB. Specifically: 720M timestamps (BRIN-narrowed, sequential scan of a small heap range) plus 720M latency_ms values (4 bytes each = 2.8 GB raw, but only the 4% range is touched and the page cache is hot for recent data — net read from disk is on the order of 100-200 MB) plus the endpoint dictionary lookup. Runtime: 100-300 milliseconds. That is dashboard-acceptable.
Compare:
| Layout | Bytes touched per "p99 last hour" query | Runtime | Disk used for 30 days |
|---|---|---|---|
| Vanilla Postgres row store | ~108 GB (random I/O) | 2-5 minutes | ~77 TB |
| Postgres + native partitioning by day | ~2.5 TB (today's chunk) | 30-60 seconds | ~77 TB |
| TimescaleDB hypertable (no compression) | ~200 MB (recent chunk + BRIN) | ~300 ms | ~77 TB |
| TimescaleDB hypertable + columnar compression | ~200 MB recent + columnar elsewhere | ~200 ms | ~7 TB |
The query speedup over vanilla Postgres is roughly 500x to 1500x, and on slightly older time ranges (where everything is columnar-compressed), the I/O reduction crosses 2500x. The disk savings — from 77 TB to 7 TB — are enough to fit the workload on a single 8 TB NVMe drive, with room for OS and indexes. A workload that would require a 6-node Postgres cluster on a row store fits comfortably on one TimescaleDB node.
The economic point. A 6-node Postgres cluster on AWS with the relevant NVMe storage runs roughly 25,000/month. A single TimescaleDB node of the spec described runs roughly3,500/month. The layout — time partitioning + within-chunk columnar — is what produces the 7x cost reduction, alongside the order-of-magnitude latency improvement. That is why every meaningful TSDB ships some variant of it.
How the other TSDBs implement the same idea
The combination of time-partitioned chunks + within-chunk columnar storage shows up in every major time-series database, with naming and implementation differences but the same shape.
InfluxDB's TSM engine lays out shards (its name for chunks) as TSM files — Time-Structured Merge files. Each TSM file contains data for a fixed time range, organised by series key (the equivalent of host + endpoint + ...), and within each series the values are stored as a column (delta-of-delta for timestamps, Gorilla-like for floats, RLE for strings). New data lands in an in-memory cache + WAL, periodically flushed to a TSM file. Background compaction merges small TSM files into larger ones, pruning duplicates. The data model — series key + column-per-field storage — is essentially the same as TimescaleDB's segment-by + compressed columns, just expressed differently. [2]
ClickHouse's MergeTree with PARTITION BY toYYYYMMDD(ts) and ORDER BY (host, ts) plays the same game. Each daily partition contains many parts (the LSM-style on-disk units), each part stores its data column-by-column with codecs declared per column (CODEC(DoubleDelta, ZSTD(3)) for timestamps, CODEC(Gorilla, ZSTD(3)) for floats, LowCardinality(String) for low-cardinality strings). Background merges combine parts within a partition. The combination of partition pruning + part-level min/max stats + column projection gives the same multiplicative speedup. ClickHouse's distinction is that it generalises beyond pure time-series — the same engine powers the Cloudflare HTTP analytics workload from the previous chapter.
VictoriaMetrics, designed specifically for Prometheus-style monitoring, uses per-day partitioning with columnar storage and Gorilla compression on doubles by default. Its claim to fame is a particularly aggressive metric-name + label dictionary that gets per-tenant compression ratios in the 30-70x range on monitoring data. [5]
QuestDB, a younger entrant, takes the same time-partitioned columnar layout but adds vectorised SIMD aggregation in C++ for very high single-node throughput on streaming financial data. The layout choice is the same; the implementation language and execution model is the differentiator.
The fact that all these systems converge on the same layout is the strongest evidence that the layout is the right answer for the workload. Time partitioning + columnar storage within chunks is to time-series what B+trees are to OLTP — the dominant design that the field has stabilised on after a decade of exploration.
Common confusions
-
"A columnar table is just a row table with a different index." No. The on-disk bytes are physically reorganised. In a row store, the cpu value of row 1 is followed by the memory value of row 1, then disk_io of row 1, then cpu of row 2. In a column store, all 17 billion cpu values sit contiguously in one file, then all 17 billion memory values in another. A query for
avg(cpu)reads one file in the column layout; in the row layout it reads every page of the heap, no matter what index you put on top. Indexes change lookup; columnar layout changes scan amplification, which is what time-series queries pay. -
"Time partitioning is the same as Postgres declarative partitioning, so why use TimescaleDB?" Native Postgres partitioning gives you the chunk-pruning step but stops there. The TimescaleDB-specific pieces — automatic chunk creation on insert, segment-by clustering inside compressed chunks, the columnar rewrite of cooled chunks (Gorilla, delta-of-delta, dictionary+RLE), continuous aggregates, retention policies — are not in vanilla Postgres. Native partitioning helps; it is not the whole game.
-
"Gorilla compression makes the values lossy." No. Gorilla is a bitwise lossless codec for IEEE-754 doubles. It XORs successive values and strips matching leading/trailing zero bits, but the original double is reconstructed exactly. The lossiness in TSDB pipelines comes from downsampling — keeping 1-minute averages instead of raw 1-second points — which is a separate stage (the next chapter), not from columnar compression itself.
-
"More chunks always equals faster queries because pruning gets sharper." Up to a point, yes; past it, no. Each chunk costs catalog space, a planning lookup, and a separate file handle. With chunks of one minute over 30 days you have 43,200 chunks — chunk-pruning planning time itself dominates the query, and the OS spends more time on
open()than on reading data. The TimescaleDB default of one day is calibrated so that the chunk fits in roughly 25% of memory — that keeps the recent chunk's working set hot in the page cache. Smaller is not always better. -
"Columnar layout will speed up my OLTP table too — let me convert." It will destroy it. Columnar storage assumes batch-immutable data: you write a chunk, compress it, never mutate it again. Single-row updates require decompressing a column block, mutating one entry, and recompressing — orders of magnitude slower than a row-store in-place update. Even TimescaleDB's docs warn that compressed chunks are expensive to UPDATE/DELETE row-by-row, and recommend operating on the chunk granularity. If your workload has frequent point mutations, you want a B+tree row store, not a columnar layout.
-
"Dropping old data takes a long DELETE." Not in this layout. Retention is
DROP TABLE metrics_chunk_092— a metadata operation that completes in milliseconds and frees the disk via the filesystem. There is no scan, no WAL spike, no vacuum. This is one of the under-advertised wins of time partitioning: it turns retention enforcement from an O(rows-deleted) operation into an O(1) one. Compare to runningDELETE FROM metrics WHERE ts < now() - INTERVAL '30 days'on a row store, which writes a tombstone for every deleted row and triggers a massive vacuum afterwards.
Going deeper
The chapter so far gave you the layout in working form. This section pulls on three threads — what happens at the chunk boundary when ingest is bursty, why the compression codecs were invented in the order they were, and the production failure mode you will hit when you misconfigure segment-by — to ground the design in the messy reality of running it.
Late-arriving data and the chunk boundary problem
The clean picture says "today's chunk holds rows with ts ∈ [00:00, 24:00)". Real ingest is messier. A monitoring agent on a flaky network buffers events for ten minutes and then dumps them; an IoT sensor in a Coimbatore warehouse sends backfilled data after a Wi-Fi outage; a batch ETL replays yesterday's data into the table. Now an insert with ts = yesterday 23:50 arrives during today's chunk. Where does it go?
The TSDB has to route the row to the chunk whose [min_ts, max_ts] covers it — which means the previous chunk, not today's. TimescaleDB handles this by routing on the trigger; the row lands in yesterday's chunk. If yesterday's chunk has already been compressed to columnar form, the insert path has two choices: decompress the chunk, append, recompress (slow but correct), or maintain a small uncompressed "tail" alongside the compressed body of each chunk (faster but more complex). TimescaleDB picks the second; InfluxDB's TSM engine merges late-arriving points into the most-recent applicable shard via the WAL + compaction loop. Why this matters operationally: a sudden burst of late data on an old chunk forces a decompression-recompression cycle that can spike CPU and stall ingest. Production ops teams set up monitoring on the late-write age — the gap between now() and the ts of the most-out-of-order insert in the last 5 minutes — because a creeping value there usually predicts a chunk-rewrite storm.
The history of the codecs — why they exist
The compression codecs in this chapter did not appear by accident; each was invented to solve a specific failure of the one before it.
Run-length encoding is the oldest, dating to the 1960s — it works on any sequence with consecutive repeats. RLE alone fails on monotonically-increasing timestamps (no repeats) and on slowly-varying floats (no exact repeats either).
Delta encoding, popular in the 1980s for image and audio, fixed the "monotonic but constant-step" problem. Store the first value, then each subsequent value as value[i] - value[i-1]. For perfectly-uniform timestamps the deltas are constant — but they are still N constants, which is wasteful.
Delta-of-delta, formalised by Pelkonen et al. in the 2015 Gorilla paper, takes the second derivative. For a stream of timestamps with constant cadence, the delta-of-delta sequence is [0, 0, 0, 0, 0, ...] — a single repeated value, which RLE on top crushes to a dozen bytes for an entire day.
Gorilla XOR in the same paper solved the float problem. Earlier TSDBs used straight delta on doubles, which is broken: IEEE-754 subtraction is not associative, so the residuals do not stay small the way they do for integers. Pelkonen's insight — XOR consecutive doubles, then strip matching leading and trailing zero bits — uses the bit pattern similarity rather than arithmetic similarity, which is what real telemetry actually has. Gorilla halved Sociogram's monitoring storage cost overnight. The paper is a 12-page worked example of how a single observation about real data can earn a 10x infrastructure dividend.
ZSTD on top is the modern catch-all. The codec stack [Gorilla → ZSTD] typically achieves 12-20x on telemetry; [delta-of-delta → ZSTD] 30-50x on timestamps. The combination is in production at TimescaleDB, ClickHouse, VictoriaMetrics, QuestDB, M3DB, and Druid.
The segment-by misconfiguration that kills compression
The most common production failure with TimescaleDB compression is choosing the wrong compress_segmentby. Suppose you accidentally pick customer_id (a high-cardinality column with millions of distinct values) as the segment key. Each compressed batch within a chunk now corresponds to one customer's rows in that chunk — perhaps 10 rows. Compression codecs need long sequences to find redundancy; on 10-row batches, Gorilla overhead exceeds Gorilla savings, dictionary preambles dominate, and you can end up with a "compressed" chunk larger than the uncompressed source. Real horror story: a Bengaluru SaaS team reported their compression ratio came out at 0.85x (i.e., 17% growth) after picking request_id as segment-by.
The fix is to pick segment-by columns that have moderate cardinality (a few hundred to a few thousand distinct values per chunk) and are commonly filtered on. host, endpoint, region, tenant_id work; request_id, customer_id, trace_id do not. The TimescaleDB docs put this rule in italics; production teams learn it the hard way.
Where the layout finally breaks
The combination of time-partitioning and columnar storage has hard limits. High-cardinality tag explosion is the canonical failure: a metric with (user_id, device_id, app_version, country, ...) as labels can produce billions of unique series, and the dictionary that maps each series to a column-group exceeds RAM. InfluxDB famously hit this in 2020 (their 1.x storage engine could not scale past ~100M unique series), driving the design of InfluxDB IOx and the move to Apache Parquet + DataFusion. The chunk model is happy; the index of which series live in which chunk is not.
Multi-table joins across chunks is another hard edge. Joining a 10TB events hypertable to a 1TB rules hypertable is fine within one chunk, painful across chunks, broken at scale. The honest answer is "do not do that" — copy the join key into the events table at write time (denormalise), or run the join in a separate analytics engine like ClickHouse. The next chapter on continuous aggregates shows the standard pattern for avoiding the join: precompute the rollup at write time.
What you should take away
-
Time-series databases win not by being faster at one trick, but by stacking two complementary layouts. Time partitioning prunes whole chunks for range queries; columnar storage within each chunk reads only the columns the query needs. The wins multiply.
-
Chunking is metadata-cheap. Each chunk has a
(min_ts, max_ts)tracked in RAM. Pruning is a microsecond-scale lookup, not an I/O. Chunk granularity (day, week, hour) trades off chunk count vs chunk size — most TSDBs auto-tune. -
Per-column codecs dominate compression ratios. Delta-of-delta for monotonic timestamps (~30x), Gorilla XOR for slowly-varying floats (~10x), dictionary + RLE for low-cardinality strings (~50-1000x). These codecs are picked for the data shape; mixing them in one row store is impossible.
-
TimescaleDB hypertables, InfluxDB TSM shards, and ClickHouse partitions are the same idea wearing different vocabulary. Each rewrites the SQL/query plan to scan only relevant chunks, then uses columnar layout inside each chunk for I/O minimisation.
-
The 2500x reduction in I/O for the API-latency dashboard is not a benchmark; it is what falls out of the math. 1/30 chunk pruning, x 1/14 column projection, x 1/10 compression = 1/4200. Real workloads see 1000x to 5000x routinely.
-
The workloads where this layout wins are exactly the ones described in the previous chapter: append-mostly writes, time-ranged reads, finite retention. For OLTP, the same layout would be a disaster — point updates on compressed columnar chunks are painful, transactions across chunks are expensive. Picking the right tool for the access pattern is the engineering job.
In the next chapter you will see what the time-partitioned columnar layout enables on top of it — continuous aggregates and downsampling, where the database maintains pre-computed rollups (one-minute aggregates from raw data, one-hour aggregates from one-minute, etc.) so dashboards over months of data run as fast as dashboards over hours.
References
- TimescaleDB — Hypertables and chunks (official docs)
- InfluxData — TSM (Time-Structured Merge tree) storage engine
- ClickHouse — Custom partitioning key (MergeTree)
- Pelkonen et al. (Sociogram) — Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015)
- VictoriaMetrics — Architecture and storage layout
- TimescaleDB — Native compression internals (blog)