Time-partitioned columnar layout

In short

The previous chapter described what makes time-series workloads structurally different — append-mostly writes, time-ranged reads, finite retention. This chapter is about the storage layout that takes advantage of that shape: time partitioning combined with within-chunk columnar storage. Used together, they convert a "scan the whole table" workload into one that touches a few megabytes per query, even when the table holds tens of terabytes.

The two ideas, separately, are old. Time partitioning means physically splitting the table into chunks where each chunk holds a contiguous time range — typically one day or one week. A query that filters on time > now() - 1 hour checks chunk metadata, identifies the one or two chunks whose time range overlaps that window, and does not open any of the other chunks at all. TimescaleDB calls these chunks hypertables; InfluxDB calls them shards; ClickHouse calls them partitions (though it adds another layer of parts inside). The names differ; the principle is identical.

Columnar storage within each chunk means each chunk physically lays out its data column by column rather than row by row. A query for "average CPU last hour" reads only the cpu column of the recent chunk — not the memory column, not the disk-io column, not the hostname column. Each column compresses independently with codecs tuned to its data type: Gorilla for floats (XOR successive values, leading-zero-strip), delta-of-delta for monotonically-increasing timestamps, run-length encoding for low-cardinality strings like hostnames or status codes. Compression ratios of 8x to 20x are routine on real telemetry.

The combination is multiplicative. If a 30-day, 1-billion-row table is split into 30 daily chunks, each in columnar form, then a query for "average CPU last hour" reads roughly (1 chunk out of 30) x (1 column out of 12) x (1/10 compression) = 1/3600 of the raw data size. A 100 GB table answers the query reading on the order of 30 MB. This is the entire economic argument for time-series databases as a separate category: not "they are mildly faster," but "they are three to four orders of magnitude faster on the workload they were built for."

This chapter walks through the chunking model, the column layout, the metadata index that makes chunk pruning a near-zero-cost operation, and the worked example of a 200K-writes-per-second Indian SaaS monitoring workload where the design wins by 2500x.

The previous chapter ended with the observation that time-series workloads are different in three ways simultaneously: writes are append-only and ordered (or nearly so) by time, reads almost always filter on a recent time range, and the data has a finite useful lifespan. Each of those properties, on its own, would point toward a different storage trick. Append-only writes suggest LSM trees. Time-range reads suggest some kind of time-based index. Finite lifespans suggest aggressive compression and easy bulk deletion. Time-series databases use all three at once, and the layout this chapter describes is what binds them together.

The thesis in one sentence: a time-series database stores its data as a sequence of time-bounded chunks, and each chunk is internally a column store. From that one design choice flow most of the performance characteristics — the chunk pruning that makes "last hour" queries cheap, the per-column compression that makes 18 months of metrics fit on a single SSD, the easy retention story (drop a chunk = rm -rf a directory), the natural fit for tiered storage (recent chunks on NVMe, old chunks on S3). It is a layout that does not work well for arbitrary OLTP — point updates are awkward, foreign keys are painful — but on the workload it was built for it wins by orders of magnitude.

Why a row store loses on this workload

Before showing the time-partitioned columnar layout, it is worth being precise about what it is replacing. Imagine you naively store the time-series workload in a Postgres table with the obvious schema:

CREATE TABLE metrics (
    ts          TIMESTAMPTZ NOT NULL,
    host        TEXT NOT NULL,
    region      TEXT NOT NULL,
    cpu         DOUBLE PRECISION,
    memory      DOUBLE PRECISION,
    disk_io     DOUBLE PRECISION,
    network     DOUBLE PRECISION,
    -- ...10 more measurement columns...
    PRIMARY KEY (host, ts)
);
CREATE INDEX ON metrics (ts);

You ingest 200K rows per second for 30 days. The math is brutal: 200,000 rows/second x 86,400 seconds/day x 30 days = 518 billion rows. At ~120 bytes per row (timestamp 8, host 30, region 10, fourteen doubles 112, header overhead 12), that is 62 TB of raw heap data, plus the secondary index on ts.

A query for SELECT avg(cpu) FROM metrics WHERE ts > now() - INTERVAL '1 hour' will use the BTree index on ts to find the row IDs of the recent rows. There are about 720 million such rows in the last hour. Each one is a tuple ID pointing at a random page in the 62 TB heap. The query plan therefore performs ~720 million random heap accesses, and when it reads each tuple it pulls all fourteen columns into memory, even though it only needs the cpu column. With NVMe at 200K random reads per second, that is an hour of pure I/O for one query. Why row stores are bad here: the storage was designed assuming OLTP — that when you read a row, you generally want all of its columns, and that you read rows in patterns close to their on-disk order. Time-series queries violate both assumptions. They want one column out of many, and they want a contiguous time slice that is not contiguous on disk.

You can patch this with CLUSTER metrics USING idx_ts to physically reorder the heap by timestamp — but CLUSTER is a one-shot reorganisation that locks the table. You can use BRIN indexes to summarise time ranges per page — that helps the index lookup but does nothing about the row-store scan amplification. No combination of vanilla Postgres tuning makes this workload fast. What you want is a fundamentally different layout, and that layout has two parts.

Part one: time partitioning

The first piece is to physically split the table into chunks where each chunk holds a fixed time range. The natural unit is a day or a week, depending on ingest rate. At 200K writes/second, one day is ~17 billion rows — perhaps too large; six-hour chunks at 4 billion rows might be better. The principle is the same: each chunk is a self-contained sub-table covering a contiguous, non-overlapping time range.

The bookkeeping for time partitioning is small but essential: somewhere — usually a metadata table kept in RAM — the database tracks chunk_id -> (min_ts, max_ts, file_path) for every chunk. When a query arrives with a WHERE ts > X predicate, the planner walks this metadata, identifies which chunks have max_ts > X, and only opens those chunks. The pruning is a metadata lookup measured in microseconds, not an I/O measured in seconds. Why this is so cheap: with 30 chunks and a metadata entry of perhaps 100 bytes per chunk, the entire chunk index is 3 KB — it lives in L1 cache. The planner's chunk-pruning step costs less than parsing the SQL statement. Compare to a BTree index lookup over 500 billion rows, which has to traverse multiple disk pages of internal nodes.

Each TSDB names this differently. TimescaleDB hypertables are the most familiar to Postgres users: a hypertable is a logical view over many child tables (the chunks), each a regular Postgres table containing one time slice. The create_hypertable('metrics', 'ts', chunk_time_interval => INTERVAL '1 day') call sets the chunk size; subsequent inserts get routed to the right chunk by the trigger machinery. The query planner has a custom node — chunk exclusion — that consults the catalog and rewrites the plan to scan only the chunks that match the predicate. To the user, SELECT * FROM metrics WHERE ts > now() - INTERVAL '1 hour' looks like a single-table query; the planner quietly turns it into a scan of just the most recent chunk.

InfluxDB's TSM engine (time-structured merge tree) divides storage into shards, each covering a shard group duration (default 1 week for unbounded retention, 1 hour for short retention). Each shard is itself a tree of TSM files. A query selects the shards whose time range overlaps the query range and processes only those.

ClickHouse uses PARTITION BY to declare the chunking. A common time-series schema looks like PARTITION BY toYYYYMMDD(ts) ORDER BY (host, ts), which gives one partition per day. The MergeTree engine keeps partition metadata in memory and prunes whole partitions from the query plan. VictoriaMetrics does similar partitioning automatically, with a write-ahead log and per-day partitioning baked in.

The chunking trick alone is worth perhaps a 30x to 100x speedup for time-range queries on a 30-day retention. But the second trick — columnar storage inside each chunk — multiplies that further.

Part two: columnar storage within each chunk

The chunk you have just selected (today's chunk) still holds 17 billion rows in our hypothetical workload. If those rows are stored row-by-row, every column-projection query still pulls all 14 measurement columns. The second piece of the layout is lay each chunk out as a column store: physically segregate each column's values into its own file or block, so a query for avg(cpu) reads only the cpu column.

The compression is where the magic compounds. A column of timestamps in a steady-state ingest is almost monotonically increasing with almost constant deltas. Delta-of-delta encoding stores the second derivative of the timestamp sequence: if timestamps are [1000, 1010, 1020, 1030, 1041], the deltas are [10, 10, 10, 11] and the delta-of-deltas are [0, 0, 0, 1]. That sequence of mostly-zeroes compresses to almost nothing under any general-purpose compressor. Real-world ratios are 20-50x for time columns. Why this works so well: most monitoring agents send data on a fixed cadence (every 10 seconds, every 1 second). The stream is therefore arithmetically progressing, and the residual after delta-of-delta is genuinely close to zero — not just compressible but informationally tiny.

For floating-point telemetry — CPU percentages, temperatures, latencies, throughputs — the compression of choice is Gorilla (named after Facebook's monitoring system that introduced it [4]). Gorilla observes that consecutive measurements are usually close in value, and that "close in value" for IEEE 754 doubles means they share most of their leading bits. Gorilla XORs each value with the previous one, then strips off the matching leading and trailing zeros, storing only the meaningful middle bits. Real telemetry compresses to ~1-2 bytes per double — a 4x to 8x ratio before applying ZSTD on top, and 8x to 12x after. Why XOR rather than delta: floats are not arithmetic — (a+1) - a is not always 1 in IEEE 754. XOR sidesteps the rounding issue entirely while exploiting the same "consecutive values are similar" observation.

Low-cardinality string columns — host names (a few hundred unique values), region codes (~30), status names (~10) — are dictionary-encoded into integer codes, then run-length-encoded. If a host emits a metric every second for an hour, that is 3600 rows with the same host value, which RLE collapses to (host_code, 3600) — six bytes for what was 36 KB raw. Compression ratios of 50x to 1000x are routine for these columns.

Putting the codec choices together, a typical real-world chunk compresses by 10x to 30x overall versus the row-store size. Combined with the column projection (read only 1-2 columns out of 14), and combined with the chunk pruning (read only 1 chunk out of 30), you get the multiplicative win the chapter opened with.

How TimescaleDB stitches it all together

TimescaleDB is the easiest concrete system to point at because it makes the layout visible — you can SELECT * FROM timescaledb_information.chunks and literally see the chunks. A hypertable is a regular Postgres table at the API layer, but underneath it is partitioned into child tables called chunks. The create_hypertable('metrics', 'ts', chunk_time_interval => INTERVAL '1 day') call sets up the partitioning machinery: a trigger on inserts routes each row to the correct chunk (creating it on the fly if needed), and the query planner has a custom expansion step that turns a query against the parent table into a UNION over the matching child chunks.

So far this is "just" partitioning — Postgres can do raw partitioning natively. The TimescaleDB-specific magic is the compression policy that converts an old chunk from row-store to columnar format. You declare:

ALTER TABLE metrics SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'host',
    timescaledb.compress_orderby = 'ts DESC'
);
SELECT add_compression_policy('metrics', INTERVAL '1 day');

And TimescaleDB will, in the background, take chunks older than one day and rewrite them in columnar form. The resulting compressed chunk is a different on-disk layout: rows for the same host (the segment-by) are batched together, and within each batch the columns are stored as arrays — ts is one delta-of-delta-encoded array, cpu is one Gorilla-encoded array, and so on. A typical real-world compression ratio is 10x for mixed metric data, sometimes much higher. Queries against a compressed chunk transparently decompress the relevant columns; the SQL surface is unchanged. Why "segment by host": because most queries filter by host. Grouping rows by host within the chunk means the query for "cpu of host X" reads only the segment for host X — another layer of pruning beyond the chunk-level one. It is essentially a coarse secondary index built into the storage layout.

The query for "cpu of host X over the last hour" therefore does the following: chunk pruning narrows to one chunk; segment pruning narrows to one segment within that chunk; column projection reads only the cpu column array within that segment; the array decompresses to ~3600 doubles; the average is computed in microseconds. The end-to-end I/O is on the order of tens of kilobytes for what would have been hundreds of GB on the row-store.

A worked example: 200K writes/sec API latency monitoring

Time to make the argument concrete on a workload an Indian SaaS engineer might actually face.

The API latency monitoring system at Razorscale, a fictional Bangalore-based SaaS

Razorscale runs a payment gateway used by 5,000 Indian merchants. Every API call to their gateway emits a structured event: (ts, endpoint, status_code, latency_ms, region, customer_id, ...). At peak load — Diwali sale day, IPL final, anything that triggers a buying spree — the gateway handles 200,000 API calls per second. The SRE team needs to keep 30 days of these events for incident debugging and capacity planning. The dashboard the on-call engineer stares at every day shows p99 latency by endpoint over the last hour, refreshed every 30 seconds.

Sizing the workload. 200,000 events/sec x 86,400 sec/day x 30 days = 518 billion rows. With ~150 bytes per row (timestamp 8, endpoint 40, status 4, latency 8, region 16, customer_id 16, plus headers), that is 77 TB of raw data. There is no Postgres instance in the world that handles this on a single node, and even sharded Postgres would burn $30,000/month in NVMe alone.

The naive row-store cost of the dashboard query. SELECT endpoint, percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) FROM events WHERE ts > now() - INTERVAL '1 hour' GROUP BY endpoint. With 200K events/sec for one hour, that is 720 million rows to consider. On a row store, even with a (endpoint, ts) BTree index helping locate them, the query has to fetch each row's latency value from random heap pages. Even if the recent rows happen to be in cache (best case), the query touches the most-recent ~108 GB of heap data (1/30 of 77 TB) — and does so with index-driven random I/O. Realistic runtime: 2 to 5 minutes. That is unusable for a dashboard refreshing every 30 seconds.

The TimescaleDB hypertable design. Razorscale picks TimescaleDB on a single beefy node (96 cores, 512 GB RAM, 30 TB NVMe) with the following:

CREATE TABLE events (
    ts          TIMESTAMPTZ NOT NULL,
    endpoint    TEXT NOT NULL,
    status_code SMALLINT,
    latency_ms  REAL,
    region      TEXT,
    customer_id BIGINT
);
SELECT create_hypertable('events', 'ts', chunk_time_interval => INTERVAL '1 day');
ALTER TABLE events SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'endpoint',
    timescaledb.compress_orderby = 'ts DESC'
);
SELECT add_compression_policy('events', INTERVAL '6 hours');
SELECT add_retention_policy('events', INTERVAL '30 days');

Each daily chunk holds 17 billion rows. Chunks older than 6 hours are automatically rewritten to columnar form. Chunks older than 30 days are automatically dropped. The dashboard query becomes:

SELECT endpoint, percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms)
FROM events
WHERE ts > now() - INTERVAL '1 hour'
GROUP BY endpoint;

What happens at runtime. TimescaleDB's planner first applies chunk exclusion: the only chunk with rows in [now() - 1 hour, now()] is today's chunk (the most recent uncompressed chunk, since data younger than 6 hours hasn't been compressed yet). All other 29 chunks are pruned. Why the recent chunk is uncompressed: the compression policy (6-hour delay) is intentional. Recent data is being heavily ingested and queried; rewriting it before it has cooled would waste CPU. Older data is read-mostly, so columnar rewrite pays off.

Within today's chunk, the planner uses the BRIN index on ts to identify the row range corresponding to the last hour. That is roughly 720M rows out of today's 17B — about 4% of the chunk. With segment-by endpoint even on the uncompressed chunk, rows for the same endpoint cluster together, so the per-endpoint percentile aggregation is a streaming pass.

The bytes touched: about 200 MB. Specifically: 720M timestamps (BRIN-narrowed, sequential scan of a small heap range) plus 720M latency_ms values (4 bytes each = 2.8 GB raw, but only the 4% range is touched and the page cache is hot for recent data — net read from disk is on the order of 100-200 MB) plus the endpoint dictionary lookup. Runtime: 100-300 milliseconds. That is dashboard-acceptable.

Compare:

Layout	Bytes touched per "p99 last hour" query	Runtime	Disk used for 30 days
Vanilla Postgres row store	~108 GB (random I/O)	2-5 minutes	~77 TB
Postgres + native partitioning by day	~2.5 TB (today's chunk)	30-60 seconds	~77 TB
TimescaleDB hypertable (no compression)	~200 MB (recent chunk + BRIN)	~300 ms	~77 TB
TimescaleDB hypertable + columnar compression	~200 MB recent + columnar elsewhere	~200 ms	~7 TB

The query speedup over vanilla Postgres is roughly 500x to 1500x, and on slightly older time ranges (where everything is columnar-compressed), the I/O reduction crosses 2500x. The disk savings — from 77 TB to 7 TB — are enough to fit the workload on a single 8 TB NVMe drive, with room for OS and indexes. A workload that would require a 6-node Postgres cluster on a row store fits comfortably on one TimescaleDB node.

The economic point. A 6-node Postgres cluster on AWS with the relevant NVMe storage runs roughly 25,000/month. A single TimescaleDB node of the spec described runs roughly3,500/month. The layout — time partitioning + within-chunk columnar — is what produces the 7x cost reduction, alongside the order-of-magnitude latency improvement. That is why every meaningful TSDB ships some variant of it.

How the other TSDBs implement the same idea

The combination of time-partitioned chunks + within-chunk columnar storage shows up in every major time-series database, with naming and implementation differences but the same shape.

InfluxDB's TSM engine lays out shards (its name for chunks) as TSM files — Time-Structured Merge files. Each TSM file contains data for a fixed time range, organised by series key (the equivalent of host + endpoint + ...), and within each series the values are stored as a column (delta-of-delta for timestamps, Gorilla-like for floats, RLE for strings). New data lands in an in-memory cache + WAL, periodically flushed to a TSM file. Background compaction merges small TSM files into larger ones, pruning duplicates. The data model — series key + column-per-field storage — is essentially the same as TimescaleDB's segment-by + compressed columns, just expressed differently. [2]

ClickHouse's MergeTree with PARTITION BY toYYYYMMDD(ts) and ORDER BY (host, ts) plays the same game. Each daily partition contains many parts (the LSM-style on-disk units), each part stores its data column-by-column with codecs declared per column (CODEC(DoubleDelta, ZSTD(3)) for timestamps, CODEC(Gorilla, ZSTD(3)) for floats, LowCardinality(String) for low-cardinality strings). Background merges combine parts within a partition. The combination of partition pruning + part-level min/max stats + column projection gives the same multiplicative speedup. ClickHouse's distinction is that it generalises beyond pure time-series — the same engine powers the Cloudflare HTTP analytics workload from the previous chapter.

VictoriaMetrics, designed specifically for Prometheus-style monitoring, uses per-day partitioning with columnar storage and Gorilla compression on doubles by default. Its claim to fame is a particularly aggressive metric-name + label dictionary that gets per-tenant compression ratios in the 30-70x range on monitoring data. [5]

QuestDB, a younger entrant, takes the same time-partitioned columnar layout but adds vectorised SIMD aggregation in C++ for very high single-node throughput on streaming financial data. The layout choice is the same; the implementation language and execution model is the differentiator.

The fact that all these systems converge on the same layout is the strongest evidence that the layout is the right answer for the workload. Time partitioning + columnar storage within chunks is to time-series what B+trees are to OLTP — the dominant design that the field has stabilised on after a decade of exploration.

What you should take away

Time-series databases win not by being faster at one trick, but by stacking two complementary layouts. Time partitioning prunes whole chunks for range queries; columnar storage within each chunk reads only the columns the query needs. The wins multiply.
Chunking is metadata-cheap. Each chunk has a (min_ts, max_ts) tracked in RAM. Pruning is a microsecond-scale lookup, not an I/O. Chunk granularity (day, week, hour) trades off chunk count vs chunk size — most TSDBs auto-tune.
Per-column codecs dominate compression ratios. Delta-of-delta for monotonic timestamps (~30x), Gorilla XOR for slowly-varying floats (~10x), dictionary + RLE for low-cardinality strings (~50-1000x). These codecs are picked for the data shape; mixing them in one row store is impossible.
TimescaleDB hypertables, InfluxDB TSM shards, and ClickHouse partitions are the same idea wearing different vocabulary. Each rewrites the SQL/query plan to scan only relevant chunks, then uses columnar layout inside each chunk for I/O minimisation.
The 2500x reduction in I/O for the API-latency dashboard is not a benchmark; it is what falls out of the math. 1/30 chunk pruning, x 1/14 column projection, x 1/10 compression = 1/4200. Real workloads see 1000x to 5000x routinely.
The workloads where this layout wins are exactly the ones described in the previous chapter: append-mostly writes, time-ranged reads, finite retention. For OLTP, the same layout would be a disaster — point updates on compressed columnar chunks are painful, transactions across chunks are expensive. Picking the right tool for the access pattern is the engineering job.

In the next chapter you will see what the time-partitioned columnar layout enables on top of it — continuous aggregates and downsampling, where the database maintains pre-computed rollups (one-minute aggregates from raw data, one-hour aggregates from one-minute, etc.) so dashboards over months of data run as fast as dashboards over hours.