In short
A time-series workload looks like this: ten thousand servers in three Indian data-centres each emit one CPU sample per second, plus one memory sample, plus one disk-IO sample, plus another two hundred metrics. That is two million writes per second, every second, forever. Every write is an INSERT — there is never an UPDATE, almost never a DELETE until the daily retention sweep. Every query is some variant of "average CPU per minute, last 24 hours, grouped by region" — a tight range filter on the timestamp followed by a bucketed aggregate. Nothing about this resembles the OLTP shape that Postgres and MySQL were designed for, and nothing about it resembles the analytical shape that Snowflake and BigQuery were designed for either.
If you try to run this workload on Postgres without thinking, three things break in order. First, insert throughput plateaus around 80–120K/sec on a single node because every insert touches a B-tree on (timestamp, host_id) and the WAL fsyncs serialise. Second, the index on (host_id, timestamp) becomes larger than the table itself within a week, because every row contributes a 32-byte index entry to a table whose rows are themselves 24 bytes. Third, the range query "give me hourly averages for host A over the last day" needs to read 86,400 rows, build a hash group, and aggregate — which is fine for one query, but ten thousand engineers refreshing dashboards every fifteen seconds turns the database into a tar pit.
The fix is not to throw a bigger machine at the problem. The fix is a dedicated time-series database — Prometheus for metrics scraping, InfluxDB or QuestDB for SQL-like ingest, TimescaleDB as a Postgres extension when you want to keep your relational habits, ClickHouse when the volume is genuinely huge, VictoriaMetrics when you outgrow Prometheus's single-node store. All of them, despite very different surfaces, share the same five tricks: time-partitioned chunks so the optimiser can prune to the relevant minutes; columnar layout inside each chunk so a query that touches one column reads one column; delta + Gorilla compression that shrinks (timestamp, double) pairs to under two bytes each on average; continuous aggregates that pre-compute the per-minute and per-hour rollups so dashboards do not re-aggregate raw points; and tiered retention that drops or down-samples data older than N days automatically.
This chapter is the workload-characterisation chapter. It says what makes time-series different and why general-purpose databases lose. The next four chapters of Build 21 build the optimisations one at a time: ch.165 the time-partitioned columnar storage, ch.166 the downsampling and continuous aggregate machinery, ch.167 retention and tiered storage to S3, ch.168 the alerting integration that turns a TSDB into a production observability stack.
You have spent the previous twenty builds learning the shape of two database families. The OLTP family — Postgres, MySQL, Aurora — is optimised for point queries and small transactions: read this row by primary key, update this account balance, insert this order. The OLAP family — Snowflake, BigQuery, ClickHouse — is optimised for scans and aggregates over wide tables: total revenue by region by quarter, joining a fact table to four dimensions.
Time-series is a third shape. It looks superficially like OLTP because the writes arrive constantly in small pieces, and superficially like OLAP because the queries are always aggregates. But it is neither. The writes are append-only and never updated, which OLTP storage engines do not exploit. The queries always filter on a single column (the timestamp) in a tight range, which OLAP planners do not specifically optimise for. The data shape — a few tag columns repeating for billions of rows, plus one or two narrow value columns — is wasteful in row format and only makes sense in column format with heavy compression.
The thesis of this chapter, and of Build 21 as a whole, is that the workload pattern is so peculiar and so common that it deserves its own engine. You will see that engine come together in the next four chapters. First you have to see why the obvious choice — "just use Postgres, we already know it" — falls apart at scale.
The shape of a time-series workload
Three properties define a time-series workload. Together they are exhaustive: if all three hold, you have a time-series workload, and a TSDB will beat a general-purpose database. If any one is missing, you probably do not have a time-series workload, and a TSDB might be the wrong tool.
Property 1 — write-heavy and append-only. The writes vastly outnumber the reads. A single Kubernetes cluster with five hundred pods, each exporting two hundred metrics every fifteen seconds, generates 500 × 200 / 15 = 6,666 writes per second from one cluster alone. Multiply by the dozens of clusters a serious engineering org runs and the number is in the hundreds of thousands per second. The writes are pure appends — the timestamp on each new row is monotonically the largest the system has ever seen. There is no contention on hot rows because there are no hot rows; every write goes to the end. Why this matters: an MVCC engine like Postgres pays for row-level concurrency control on every insert — checking that no other transaction has the row, taking a tuple lock, writing a xmin for visibility. If your workload guarantees no two writers ever touch the same row, you are paying for protection you do not need. A TSDB drops the protection and gets 10–30× more inserts per CPU second.
Property 2 — queries are range-scans on the timestamp. Every interesting query has a WHERE timestamp BETWEEN x AND y predicate, and the range is usually short (the last hour, the last day) relative to the table (which holds years). Crucially, the query almost never asks for individual rows — it asks for an aggregate over a bucketed window: avg(cpu_pct) GROUP BY time_bucket('1 minute', ts). Why this matters: the planner can prune to the chunks that overlap the time range without consulting any other index. A B-tree on (host, timestamp) works, but the TSDB approach — partition the table by time, then keep only a small per-chunk min/max index — is dramatically cheaper because the index itself shrinks from gigabytes to kilobytes.
Property 3 — high cardinality on tags, low cardinality on schema. Each row is (timestamp, host, region, metric_name, value) or similar. The tag columns (host, region, metric_name) repeat for billions of rows but take only a few hundred or a few thousand distinct values across the dataset. The value columns are usually one or two doubles. The schema barely changes — you add a new metric every few weeks at most. Why this matters: dictionary encoding turns the tag columns from 32-byte strings into 1- or 2-byte integers, and run-length encoding on the time-sorted data compresses runs of the same tag value to single entries. Together they shrink the on-disk footprint by 10–50× compared to a row-oriented Postgres table with the same rows.
The three panels are the entire mental model of a TSDB. Writes pour in monotonically and are appended to today's chunk. Storage keeps recent chunks uncompressed and writeable, compresses older chunks aggressively, ships truly old chunks to object storage. Reads almost always touch one or two chunks, scan a small window, return a small aggregate.
The data shape: timestamp + value, with tags
Compare a row from a time-series table with a row from an OLTP table. The OLTP row — say, an order in an e-commerce system — has fifteen columns of varied types: customer ID, items, totals, addresses, statuses, timestamps, references to other rows. Each column is meaningful, each row is unique, each row may be updated as the order moves through fulfilment. Storing it row-oriented in Postgres is correct because every read or write touches the whole logical entity.
A time-series row is the opposite. It has, conceptually, two parts: the measurement (a timestamp and one or two numeric values) and the series identity (a small set of tag columns identifying which sensor or host or device produced the measurement). The measurement changes every row. The series identity repeats unchanged across millions of consecutive rows.
In Postgres you would store this as one wide row per measurement: (ts, host_str, region_str, env_str, metric_str, value). The string columns alone consume 30–60 bytes each, so a 16-byte logical measurement balloons to 100–200 bytes on disk, repeated for every sample. Across a billion samples that is 100–200 GB of repeated tag strings. Why this matters: storage cost scales linearly with how repetitive the columns are. Time-series data is deeply repetitive in tags and minimally repetitive in values. Row format wastes space proportional to the gap. Column format with dictionary + RLE on tags and Gorilla on values closes the gap to the information-theoretic minimum — typically 1–2 bytes per measurement on disk.
A TSDB exploits this by separating the two halves at the storage layer. InfluxDB and Prometheus store a series as a unit: one entry per (host, region, env, metric) combination, identified by a small integer ID, plus an array of (timestamp, value) measurements. TimescaleDB compresses chunks into arrays per column so the tag column is effectively a single dictionary entry per chunk plus a tiny RLE bitmap. The two-half structure is the data model that the storage layout follows; you will see the layout in detail in chapter 165.
Why general-purpose databases lose
Three families of database engineers reach for when they first hit time-series traffic, three failure modes.
Postgres. A single Postgres node on modern hardware sustains roughly 80–120K row inserts per second on a wide table with two indexes, before WAL fsyncs and B-tree page splits start dominating CPU. You can push to 200K with synchronous_commit = off, larger WAL buffers, and unlogged tables, at the cost of durability. Past that you must shard, which means an application-layer router, schema discipline across shards, and pain when a query needs cross-shard aggregation. The deeper problem is that the indexes grow faster than the table: a 24-byte row plus a 32-byte B-tree entry per index means two indexes weigh 2.6× as much as the data. By month three the database is mostly indexes, and the index pages dominate the buffer cache, evicting the data pages that queries actually want. Why this matters: Postgres' page-based storage and B-tree indexes are designed for read-mostly OLTP workloads where a few percent of rows are hot. Time-series has no hot rows — every row is read at most once, soon after being written. The indexes are pure overhead because a time-partitioned chunk plus a min/max index gives the same pruning benefit at one-thousandth the space.
MongoDB. Document stores look attractive because "my measurement is just a small JSON object". But a BSON document of {"ts": ..., "host": "web-mumbai-37", "region": "ap-south-1", "metric": "cpu_pct", "value": 41.2} is about 110 bytes on the wire and 130 bytes on disk, plus the _id index (24 bytes) and any secondary indexes. The repeated key names alone ("host", "region", "metric", "value") consume 30 bytes per document. Multiply by a billion documents and you have 30 GB of JSON keys, repeated. MongoDB later added a time-series collection feature in 5.0 that internally bucketed measurements and shrank this overhead, which is implicit acknowledgment that the document model loses on this workload.
Snowflake / BigQuery. These are excellent for analytical batch queries on terabytes of data. They are not built for sustained millions-of-rows-per-second ingest — Snowflake recommends micro-batches via Snowpipe with seconds-to-minutes of latency, BigQuery has a streaming insert quota measured in MB/sec per table. They also charge per query, which makes the "every dashboard refreshes every fifteen seconds" pattern of a TSDB workload economically painful. Why this matters: a TSDB serves dashboards and alerts; the queries are tiny (one minute of data) but extremely frequent. An OLAP warehouse serves analysts; the queries are huge (a quarter of data) but infrequent. Pricing models follow the workload — and the wrong tool for one becomes ruinously expensive at scale.
The TSDB optimisations, in one preview
Every dedicated TSDB — Prometheus, InfluxDB, TimescaleDB, ClickHouse, QuestDB, VictoriaMetrics — exploits the three workload properties via the same five mechanisms. The next four chapters of Build 21 dig into each in turn; here is the map.
Time partitioning into chunks. Split the logical table into physical pieces, one per day or week. The planner prunes the irrelevant chunks immediately on the WHERE ts BETWEEN ... predicate, without consulting any index — just a chunk-level min/max metadata entry. Inserts always go to the most recent chunk, which fits in memory and accepts writes at memory speed. Old chunks are read-only and can be aggressively compressed. Chapter 165 builds this.
Columnar layout inside each chunk. Once a chunk is closed for writing, store each column as a contiguous array. A query that touches cpu_pct reads only the cpu_pct column, not the seven other metric columns in the row. Combine with dictionary encoding on tag columns (the four hosts that appear in this chunk become integers 0..3) and Gorilla XOR encoding on the value column (most consecutive doubles XOR to a few zero bits) and the on-disk size shrinks 10–30× from the row format. Chapter 165 again, second half.
Continuous aggregates. Most dashboard queries ask for minute averages, not raw second-by-second points. Pre-compute the per-minute, per-five-minute, per-hour aggregates as the data arrives, store them alongside the raw chunks, route queries to whichever resolution matches their time range. A 24-hour query reads 1,440 minute-aggregates instead of 86,400 raw points — a 60× speedup for free. Chapter 166.
Tiered retention. Define keep raw for 7 days, keep minute-aggregates for 90 days, keep hour-aggregates for 5 years, drop everything older. The TSDB enforces this automatically, dropping old chunks at chunk-granularity (which is much cheaper than DELETE WHERE ts < ...). Optionally tier truly old chunks to S3 — they are still queryable, just slower. Chapter 167.
Alerting integration. A TSDB sits next to an alerting engine — Prometheus has Alertmanager built in, Grafana has Grafana Alerting, InfluxDB has Kapacitor. The alerting engine runs the same time-bucketed aggregate queries every fifteen seconds, compares against thresholds, fires notifications. The TSDB and alerting are co-designed because the query patterns are identical. Chapter 168.
An Indian fintech monitors transaction latency
You are running the payments backend for a Bengaluru-based fintech that processes 100,000 UPI-style transactions per second at peak (Diwali week — every shop is doing card-on-delivery refunds, every gig worker is settling end-of-day, every parent is sending money to children studying abroad). For each transaction you want to record the latency: how many milliseconds between the API gateway accepting the request and the database commit returning. That is 100,000 (timestamp, latency_ms) rows per second, plus a few tag columns: endpoint, merchant_category, region, payment_method.
You start on Postgres. The schema is obvious:
CREATE TABLE tx_latency (
ts TIMESTAMPTZ NOT NULL,
endpoint TEXT,
merchant_category TEXT,
region TEXT,
payment_method TEXT,
latency_ms DOUBLE PRECISION
);
CREATE INDEX ON tx_latency (ts);
CREATE INDEX ON tx_latency (region, ts);
You load-test at 100K inserts/sec. Two things happen:
- CPU sits at 50%, all of it on insert overhead.
pg_stat_statementsshows 30% on theINSERTitself (tuple construction, MVCC bookkeeping), 12% on B-tree maintenance for the two indexes, 8% on WAL fsync. Aggregate scans on the latency dashboard (the SRE team's "p95 latency by region, last 1 hour" query) take 18 seconds because Postgres reads 360 million rows from the table heap to compute the percentile. - Disk usage grows 240 GB per day. Of that, roughly 90 GB is the table heap (24 bytes per row average, after the wide tag strings) and 150 GB is the two indexes. Two weeks in, you have 3.4 TB of database and you have not even configured retention.
You switch to TimescaleDB — same Postgres, just the extension installed and the table converted to a hypertable:
SELECT create_hypertable('tx_latency', 'ts', chunk_time_interval => INTERVAL '1 day');
ALTER TABLE tx_latency SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'region, payment_method'
);
SELECT add_compression_policy('tx_latency', INTERVAL '2 days');
Three things change:
- Insert throughput climbs to 500K rows/sec on the same hardware. The hypertable routes inserts to today's chunk, which is small enough to fit in memory; B-tree depth on a per-chunk index is 2 instead of 5; WAL volume drops because columnar staging reduces redundant writes.
- The 1-hour p95 query drops from 18 seconds to 0.18 seconds. The query planner prunes to the single chunk that overlaps the last hour, which holds 360M rows compressed to ~12 GB columnar. Reading the
latency_mscolumn alone (one of six) and theregionsegmentby column gives the percentile from a 2 GB scan instead of an 80 GB scan. - Disk usage drops to 30 GB per day after compression kicks in. Chunks older than 48 hours compress to roughly 8× smaller. After three months with a retention policy of 90 days, total storage is 2.7 TB instead of 21 TB.
The Postgres schema, the SQL queries, the connection driver, the Grafana dashboard — all unchanged. The optimisation is purely below the storage layer, and it is purely a matter of recognising that this workload is time-series-shaped and treating it accordingly.
When you outgrow a single TimescaleDB node, the next steps are: VictoriaMetrics or Mimir for horizontal-scale Prometheus-style metrics, ClickHouse for SQL queries on tens of TB, or InfluxDB Cloud's clustered tier. The optimisations are the same; the scale-out story differs.
Real systems and where to use them
The TSDB ecosystem in 2026 is mature but fragmented. Brief field guide:
- Prometheus — the de-facto standard for metrics in cloud-native stacks. Pull-based scraping, single-binary deploy, PromQL query language, retention typically 15–30 days on local disk. Scales vertically to a few million active series per node, not designed to scale horizontally on its own.
- VictoriaMetrics, Mimir, Thanos — horizontal-scale Prometheus-compatible engines. Use when you outgrow a single Prometheus node or need multi-tenancy.
- InfluxDB — older, commercial-friendly, custom Flux query language (or InfluxQL/SQL in newer versions). Good for IoT and DevOps overlap.
- TimescaleDB — Postgres extension. Ideal when the team already runs Postgres and wants to keep relational habits, JOINs, full SQL, and pgAdmin. The cleanest migration path from a Postgres-based monolith.
- ClickHouse — general-purpose OLAP that is exceptionally good at time-series at huge scale. Used by Cloudflare, Uber, and many ad-tech firms for trillion-row time-series workloads. Lower-level tuning required, no built-in retention DSL.
- QuestDB — Java-based, SQL surface, single-binary, optimised for high-cardinality ingest. Strong on financial tick data.
The right choice depends on volume, ecosystem, and team familiarity. For most startups the answer is "Prometheus until it hurts, then TimescaleDB or VictoriaMetrics". For trillion-row scale the answer is "ClickHouse". For an existing Postgres shop the answer is "TimescaleDB from day one".
What Build 21 builds next
You now have the workload model: write-heavy, append-only, range-scan-with-aggregate, two-halves data shape. You also have the high-level menu of optimisations: time partitioning, columnar inside chunks, compression, continuous aggregates, retention. The remaining four chapters of Build 21 turn the menu into an engine.
- Chapter 165 — Time-partitioned columnar layout. How the chunk boundary is chosen, how rows are written to today's chunk in row format and re-encoded into columnar arrays when the chunk closes, the dictionary + RLE + Gorilla compression pipeline that gives 10–30× shrinkage. You will end the chapter with a working Python prototype that ingests 1M rows/sec into row-format chunks and compresses closed chunks to columnar.
- Chapter 166 — Continuous aggregates and downsampling. How to maintain per-minute, per-five-minute, per-hour rollups incrementally as data arrives, how the query planner picks the right resolution for a given time range, how late-arriving data is handled. The trick is that the rollups are themselves hypertables stored in the same engine.
- Chapter 167 — Retention and tiered storage. Dropping old chunks atomically, downsampling-in-place, shipping cold chunks to S3 while keeping them queryable. The economics: hot SSD storage at ₹8/GB/month vs. S3 at ₹2/GB/month makes tiering pay for itself within weeks.
- Chapter 168 — Alerting integration. How Prometheus' Alertmanager and Grafana Alerting evaluate rules over the same time-bucketed aggregates, the deduplication and grouping logic that keeps you from being woken up 200 times by one outage, the multi-window multi-burn-rate SLO alerts that Google's SRE handbook recommends.
By the end of Build 21 you will have built, conceptually, the storage and query path of TimescaleDB plus the alerting path of Prometheus — the full observability stack that every modern engineering org runs.
References
- TimescaleDB blog — Time-series data: Why and how to use a relational database instead of NoSQL
- InfluxDB documentation — Key concepts and terminology
- Akhanda, Mohammad Sazid. Time-Series Databases: Concepts, Design and Implementation. 2023.
- Prometheus documentation — Storage
- OutSystems blog — What is a Time-Series Database?
- Pelkonen, Franklin, Teller, et al. Gorilla: A Fast, Scalable, In-Memory Time Series Database. VLDB 2015