In short

A time-series workload looks like this: ten thousand servers in three Indian data-centres each emit one CPU sample per second, plus one memory sample, plus one disk-IO sample, plus another two hundred metrics. That is two million writes per second, every second, forever. Every write is an INSERT — there is never an UPDATE, almost never a DELETE until the daily retention sweep. Every query is some variant of "average CPU per minute, last 24 hours, grouped by region" — a tight range filter on the timestamp followed by a bucketed aggregate. Nothing about this resembles the OLTP shape that Postgres and MySQL were designed for, and nothing about it resembles the analytical shape that Snowflake and BigQuery were designed for either.

If you try to run this workload on Postgres without thinking, three things break in order. First, insert throughput plateaus around 80–120K/sec on a single node because every insert touches a B-tree on (timestamp, host_id) and the WAL fsyncs serialise. Second, the index on (host_id, timestamp) becomes larger than the table itself within a week, because every row contributes a 32-byte index entry to a table whose rows are themselves 24 bytes. Third, the range query "give me hourly averages for host A over the last day" needs to read 86,400 rows, build a hash group, and aggregate — which is fine for one query, but ten thousand engineers refreshing dashboards every fifteen seconds turns the database into a tar pit.

The fix is not to throw a bigger machine at the problem. The fix is a dedicated time-series database — Prometheus for metrics scraping, InfluxDB or QuestDB for SQL-like ingest, TimescaleDB as a Postgres extension when you want to keep your relational habits, ClickHouse when the volume is genuinely huge, VictoriaMetrics when you outgrow Prometheus's single-node store. All of them, despite very different surfaces, share the same five tricks: time-partitioned chunks so the optimiser can prune to the relevant minutes; columnar layout inside each chunk so a query that touches one column reads one column; delta + Gorilla compression that shrinks (timestamp, double) pairs to under two bytes each on average; continuous aggregates that pre-compute the per-minute and per-hour rollups so dashboards do not re-aggregate raw points; and tiered retention that drops or down-samples data older than N days automatically.

This chapter is the workload-characterisation chapter. It says what makes time-series different and why general-purpose databases lose. The next four chapters of Build 21 build the optimisations one at a time: ch.165 the time-partitioned columnar storage, ch.166 the downsampling and continuous aggregate machinery, ch.167 retention and tiered storage to S3, ch.168 the alerting integration that turns a TSDB into a production observability stack.

You have spent the previous twenty builds learning the shape of two database families. The OLTP family — Postgres, MySQL, Aurora — is optimised for point queries and small transactions: read this row by primary key, update this account balance, insert this order. The OLAP family — Snowflake, BigQuery, ClickHouse — is optimised for scans and aggregates over wide tables: total revenue by region by quarter, joining a fact table to four dimensions.

Time-series is a third shape. It looks superficially like OLTP because the writes arrive constantly in small pieces, and superficially like OLAP because the queries are always aggregates. But it is neither. The writes are append-only and never updated, which OLTP storage engines do not exploit. The queries always filter on a single column (the timestamp) in a tight range, which OLAP planners do not specifically optimise for. The data shape — a few tag columns repeating for billions of rows, plus one or two narrow value columns — is wasteful in row format and only makes sense in column format with heavy compression.

The thesis of this chapter, and of Build 21 as a whole, is that the workload pattern is so peculiar and so common that it deserves its own engine. You will see that engine come together in the next four chapters. First you have to see why the obvious choice — "just use Postgres, we already know it" — falls apart at scale.

The shape of a time-series workload

Three properties define a time-series workload. Together they are exhaustive: if all three hold, you have a time-series workload, and a TSDB will beat a general-purpose database. If any one is missing, you probably do not have a time-series workload, and a TSDB might be the wrong tool.

Property 1 — write-heavy and append-only. The writes vastly outnumber the reads. A single Kubernetes cluster with five hundred pods, each exporting two hundred metrics every fifteen seconds, generates 500 × 200 / 15 = 6,666 writes per second from one cluster alone. Multiply by the dozens of clusters a serious engineering org runs and the number is in the hundreds of thousands per second. The writes are pure appends — the timestamp on each new row is monotonically the largest the system has ever seen. There is no contention on hot rows because there are no hot rows; every write goes to the end. Why this matters: an MVCC engine like Postgres pays for row-level concurrency control on every insert — checking that no other transaction has the row, taking a tuple lock, writing a xmin for visibility. If your workload guarantees no two writers ever touch the same row, you are paying for protection you do not need. A TSDB drops the protection and gets 10–30× more inserts per CPU second.

Property 2 — queries are range-scans on the timestamp. Every interesting query has a WHERE timestamp BETWEEN x AND y predicate, and the range is usually short (the last hour, the last day) relative to the table (which holds years). Crucially, the query almost never asks for individual rows — it asks for an aggregate over a bucketed window: avg(cpu_pct) GROUP BY time_bucket('1 minute', ts). Why this matters: the planner can prune to the chunks that overlap the time range without consulting any other index. A B-tree on (host, timestamp) works, but the TSDB approach — partition the table by time, then keep only a small per-chunk min/max index — is dramatically cheaper because the index itself shrinks from gigabytes to kilobytes.

Property 3 — high cardinality on tags, low cardinality on schema. Each row is (timestamp, host, region, metric_name, value) or similar. The tag columns (host, region, metric_name) repeat for billions of rows but take only a few hundred or a few thousand distinct values across the dataset. The value columns are usually one or two doubles. The schema barely changes — you add a new metric every few weeks at most. Why this matters: dictionary encoding turns the tag columns from 32-byte strings into 1- or 2-byte integers, and run-length encoding on the time-sorted data compresses runs of the same tag value to single entries. Together they shrink the on-disk footprint by 10–50× compared to a row-oriented Postgres table with the same rows.

Writes 2M / sec sustained append-only, monotone ts 10:00:01.001 host-3 cpu 41.2 10:00:01.001 host-3 mem 62.0 10:00:01.002 host-7 cpu 88.4 10:00:01.002 host-7 net 142M 10:00:01.003 host-1 cpu 12.7 10:00:01.004 host-9 cpu 74.1 10:00:01.004 host-9 mem 81.5 10:00:01.005 host-2 cpu 55.3 ...continuous stream... no UPDATE no DELETE (until retention) Storage time-partitioned chunks columnar inside chunk chunk: 2026-04-25 (today) chunk: 2026-04-24 chunk: 2026-04-23 chunk: 2026-04-22 (compressed) ...older chunks compressed... chunk: 2026-01-25 (S3 tier) retention drops > 90 days Reads range-scan + aggregate last hour / last day SELECT time_bucket('1m', ts), avg(cpu_pct) FROM metrics WHERE ts > now() - '1h' AND region = 'mumbai' GROUP BY 1 ORDER BY 1; touches 1 chunk scans 60 buckets returns 60 rows

The three panels are the entire mental model of a TSDB. Writes pour in monotonically and are appended to today's chunk. Storage keeps recent chunks uncompressed and writeable, compresses older chunks aggressively, ships truly old chunks to object storage. Reads almost always touch one or two chunks, scan a small window, return a small aggregate.

The data shape: timestamp + value, with tags

Compare a row from a time-series table with a row from an OLTP table. The OLTP row — say, an order in an e-commerce system — has fifteen columns of varied types: customer ID, items, totals, addresses, statuses, timestamps, references to other rows. Each column is meaningful, each row is unique, each row may be updated as the order moves through fulfilment. Storing it row-oriented in Postgres is correct because every read or write touches the whole logical entity.

A time-series row is the opposite. It has, conceptually, two parts: the measurement (a timestamp and one or two numeric values) and the series identity (a small set of tag columns identifying which sensor or host or device produced the measurement). The measurement changes every row. The series identity repeats unchanged across millions of consecutive rows.

A time-series row is two halves glued together Measurement (per row) changes every single row timestamp: 2026-04-25 10:00:01.234 value: 41.2 (cpu_pct) ~16 bytes per row monotone timestamps → delta-encode to ~2 bytes → Gorilla XOR for value Series identity (per series) same value for millions of rows host = "web-mumbai-37" region = "ap-south-1" env = "production" metric = "cpu_pct" → dictionary-encode tags → 4-byte series_id, RLE-encoded

In Postgres you would store this as one wide row per measurement: (ts, host_str, region_str, env_str, metric_str, value). The string columns alone consume 30–60 bytes each, so a 16-byte logical measurement balloons to 100–200 bytes on disk, repeated for every sample. Across a billion samples that is 100–200 GB of repeated tag strings. Why this matters: storage cost scales linearly with how repetitive the columns are. Time-series data is deeply repetitive in tags and minimally repetitive in values. Row format wastes space proportional to the gap. Column format with dictionary + RLE on tags and Gorilla on values closes the gap to the information-theoretic minimum — typically 1–2 bytes per measurement on disk.

A TSDB exploits this by separating the two halves at the storage layer. InfluxDB and Prometheus store a series as a unit: one entry per (host, region, env, metric) combination, identified by a small integer ID, plus an array of (timestamp, value) measurements. TimescaleDB compresses chunks into arrays per column so the tag column is effectively a single dictionary entry per chunk plus a tiny RLE bitmap. The two-half structure is the data model that the storage layout follows; you will see the layout in detail in chapter 165.

Why general-purpose databases lose

Three families of database engineers reach for when they first hit time-series traffic, three failure modes.

Postgres. A single Postgres node on modern hardware sustains roughly 80–120K row inserts per second on a wide table with two indexes, before WAL fsyncs and B-tree page splits start dominating CPU. You can push to 200K with synchronous_commit = off, larger WAL buffers, and unlogged tables, at the cost of durability. Past that you must shard, which means an application-layer router, schema discipline across shards, and pain when a query needs cross-shard aggregation. The deeper problem is that the indexes grow faster than the table: a 24-byte row plus a 32-byte B-tree entry per index means two indexes weigh 2.6× as much as the data. By month three the database is mostly indexes, and the index pages dominate the buffer cache, evicting the data pages that queries actually want. Why this matters: Postgres' page-based storage and B-tree indexes are designed for read-mostly OLTP workloads where a few percent of rows are hot. Time-series has no hot rows — every row is read at most once, soon after being written. The indexes are pure overhead because a time-partitioned chunk plus a min/max index gives the same pruning benefit at one-thousandth the space.

MongoDB. Document stores look attractive because "my measurement is just a small JSON object". But a BSON document of {"ts": ..., "host": "web-mumbai-37", "region": "ap-south-1", "metric": "cpu_pct", "value": 41.2} is about 110 bytes on the wire and 130 bytes on disk, plus the _id index (24 bytes) and any secondary indexes. The repeated key names alone ("host", "region", "metric", "value") consume 30 bytes per document. Multiply by a billion documents and you have 30 GB of JSON keys, repeated. MongoDB later added a time-series collection feature in 5.0 that internally bucketed measurements and shrank this overhead, which is implicit acknowledgment that the document model loses on this workload.

Snowflake / BigQuery. These are excellent for analytical batch queries on terabytes of data. They are not built for sustained millions-of-rows-per-second ingest — Snowflake recommends micro-batches via Snowpipe with seconds-to-minutes of latency, BigQuery has a streaming insert quota measured in MB/sec per table. They also charge per query, which makes the "every dashboard refreshes every fifteen seconds" pattern of a TSDB workload economically painful. Why this matters: a TSDB serves dashboards and alerts; the queries are tiny (one minute of data) but extremely frequent. An OLAP warehouse serves analysts; the queries are huge (a quarter of data) but infrequent. Pricing models follow the workload — and the wrong tool for one becomes ruinously expensive at scale.

The TSDB optimisations, in one preview

Every dedicated TSDB — Prometheus, InfluxDB, TimescaleDB, ClickHouse, QuestDB, VictoriaMetrics — exploits the three workload properties via the same five mechanisms. The next four chapters of Build 21 dig into each in turn; here is the map.

Time partitioning into chunks. Split the logical table into physical pieces, one per day or week. The planner prunes the irrelevant chunks immediately on the WHERE ts BETWEEN ... predicate, without consulting any index — just a chunk-level min/max metadata entry. Inserts always go to the most recent chunk, which fits in memory and accepts writes at memory speed. Old chunks are read-only and can be aggressively compressed. Chapter 165 builds this.

Columnar layout inside each chunk. Once a chunk is closed for writing, store each column as a contiguous array. A query that touches cpu_pct reads only the cpu_pct column, not the seven other metric columns in the row. Combine with dictionary encoding on tag columns (the four hosts that appear in this chunk become integers 0..3) and Gorilla XOR encoding on the value column (most consecutive doubles XOR to a few zero bits) and the on-disk size shrinks 10–30× from the row format. Chapter 165 again, second half.

Continuous aggregates. Most dashboard queries ask for minute averages, not raw second-by-second points. Pre-compute the per-minute, per-five-minute, per-hour aggregates as the data arrives, store them alongside the raw chunks, route queries to whichever resolution matches their time range. A 24-hour query reads 1,440 minute-aggregates instead of 86,400 raw points — a 60× speedup for free. Chapter 166.

Tiered retention. Define keep raw for 7 days, keep minute-aggregates for 90 days, keep hour-aggregates for 5 years, drop everything older. The TSDB enforces this automatically, dropping old chunks at chunk-granularity (which is much cheaper than DELETE WHERE ts < ...). Optionally tier truly old chunks to S3 — they are still queryable, just slower. Chapter 167.

Alerting integration. A TSDB sits next to an alerting engine — Prometheus has Alertmanager built in, Grafana has Grafana Alerting, InfluxDB has Kapacitor. The alerting engine runs the same time-bucketed aggregate queries every fifteen seconds, compares against thresholds, fires notifications. The TSDB and alerting are co-designed because the query patterns are identical. Chapter 168.

An Indian fintech monitors transaction latency

You are running the payments backend for a Bengaluru-based fintech that processes 100,000 UPI-style transactions per second at peak (Diwali week — every shop is doing card-on-delivery refunds, every gig worker is settling end-of-day, every parent is sending money to children studying abroad). For each transaction you want to record the latency: how many milliseconds between the API gateway accepting the request and the database commit returning. That is 100,000 (timestamp, latency_ms) rows per second, plus a few tag columns: endpoint, merchant_category, region, payment_method.

You start on Postgres. The schema is obvious:

CREATE TABLE tx_latency (
  ts                TIMESTAMPTZ NOT NULL,
  endpoint          TEXT,
  merchant_category TEXT,
  region            TEXT,
  payment_method    TEXT,
  latency_ms        DOUBLE PRECISION
);
CREATE INDEX ON tx_latency (ts);
CREATE INDEX ON tx_latency (region, ts);

You load-test at 100K inserts/sec. Two things happen:

  1. CPU sits at 50%, all of it on insert overhead. pg_stat_statements shows 30% on the INSERT itself (tuple construction, MVCC bookkeeping), 12% on B-tree maintenance for the two indexes, 8% on WAL fsync. Aggregate scans on the latency dashboard (the SRE team's "p95 latency by region, last 1 hour" query) take 18 seconds because Postgres reads 360 million rows from the table heap to compute the percentile.
  2. Disk usage grows 240 GB per day. Of that, roughly 90 GB is the table heap (24 bytes per row average, after the wide tag strings) and 150 GB is the two indexes. Two weeks in, you have 3.4 TB of database and you have not even configured retention.

You switch to TimescaleDB — same Postgres, just the extension installed and the table converted to a hypertable:

SELECT create_hypertable('tx_latency', 'ts', chunk_time_interval => INTERVAL '1 day');
ALTER TABLE tx_latency SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'region, payment_method'
);
SELECT add_compression_policy('tx_latency', INTERVAL '2 days');

Three things change:

  1. Insert throughput climbs to 500K rows/sec on the same hardware. The hypertable routes inserts to today's chunk, which is small enough to fit in memory; B-tree depth on a per-chunk index is 2 instead of 5; WAL volume drops because columnar staging reduces redundant writes.
  2. The 1-hour p95 query drops from 18 seconds to 0.18 seconds. The query planner prunes to the single chunk that overlaps the last hour, which holds 360M rows compressed to ~12 GB columnar. Reading the latency_ms column alone (one of six) and the region segmentby column gives the percentile from a 2 GB scan instead of an 80 GB scan.
  3. Disk usage drops to 30 GB per day after compression kicks in. Chunks older than 48 hours compress to roughly 8× smaller. After three months with a retention policy of 90 days, total storage is 2.7 TB instead of 21 TB.

The Postgres schema, the SQL queries, the connection driver, the Grafana dashboard — all unchanged. The optimisation is purely below the storage layer, and it is purely a matter of recognising that this workload is time-series-shaped and treating it accordingly.

When you outgrow a single TimescaleDB node, the next steps are: VictoriaMetrics or Mimir for horizontal-scale Prometheus-style metrics, ClickHouse for SQL queries on tens of TB, or InfluxDB Cloud's clustered tier. The optimisations are the same; the scale-out story differs.

Real systems and where to use them

The TSDB ecosystem in 2026 is mature but fragmented. Brief field guide:

The right choice depends on volume, ecosystem, and team familiarity. For most startups the answer is "Prometheus until it hurts, then TimescaleDB or VictoriaMetrics". For trillion-row scale the answer is "ClickHouse". For an existing Postgres shop the answer is "TimescaleDB from day one".

What Build 21 builds next

You now have the workload model: write-heavy, append-only, range-scan-with-aggregate, two-halves data shape. You also have the high-level menu of optimisations: time partitioning, columnar inside chunks, compression, continuous aggregates, retention. The remaining four chapters of Build 21 turn the menu into an engine.

By the end of Build 21 you will have built, conceptually, the storage and query path of TimescaleDB plus the alerting path of Prometheus — the full observability stack that every modern engineering org runs.

References

  1. TimescaleDB blog — Time-series data: Why and how to use a relational database instead of NoSQL
  2. InfluxDB documentation — Key concepts and terminology
  3. Akhanda, Mohammad Sazid. Time-Series Databases: Concepts, Design and Implementation. 2023.
  4. Prometheus documentation — Storage
  5. OutSystems blog — What is a Time-Series Database?
  6. Pelkonen, Franklin, Teller, et al. Gorilla: A Fast, Scalable, In-Memory Time Series Database. VLDB 2015