InfluxDB's TSM engine

A Zomato platform engineer in 2017 picked InfluxDB 1.2 over Prometheus because Influx had a SQL-like query language, native Grafana support, and the docs claimed 200,000 writes/sec on a single host. Three years later the same engineer stares at an out of memory panic at 4 AM IST during a Diwali order spike — the index has ballooned past 18 GB on a 16 GB host, the migration to 2.0 requires rewriting every InfluxQL query into Flux, and the surrounding ecosystem has changed query languages three times. And yet — the TSM engine itself is genuinely good: understanding what InfluxDB got right at the storage layer and what it got wrong at the ecosystem layer is the lesson of this chapter.

This chapter takes the TSM engine apart: how its Time-Structured Merge Tree differs from Prometheus's per-block design and VictoriaMetrics's MergeSet, how the tag-vs-field split shapes everything from compression to query plans, why TSI (Time Series Index) was a 2018 retrofit that finally fixed the cardinality death-spiral, and what the InfluxQL-Flux-SQL whiplash cost the project commercially. The engine is good; the ecosystem decisions cost InfluxDB the labels war.

TSM is a Time-Structured Merge Tree: an LSM where every key is (measurement, tagset, fieldkey) and every value is a (timestamp, fieldvalue) block sorted by time. Tags are indexed (low cardinality, high selectivity); fields are not (anything goes, but no index lookup). The WAL absorbs writes, in-memory cache holds recent points, periodic compaction merges sorted TSM files using delta-of-delta + Snappy/Zstd compression to ~2.5 bytes per sample. The engine works; the ecosystem split — InfluxQL → IFQL → Flux → SQL/IOx — fragmented the user base and let Prometheus's labels-and-PromQL combo become the standard.

There is a third group affected too — Indian-origin sensor and IoT vendors (Tata Consultancy Services Industrial IoT, Siemens India, even smaller players like Ola Electric's vehicle telemetry team) whose entire data backbone runs on InfluxDB precisely because the multi-field-per-measurement model maps to the way industrial sensors emit data. Their pain is more muted than the SaaS migrants' but still real: they want type-aware compression and ergonomic line protocol, but they also want a query language they can teach to a graduate engineer in two days. None of the three InfluxDB language attempts has cleanly satisfied that demand.

What "Time-Structured" actually means: keys, series, and the LSM shape

The first conceptual hurdle InfluxDB throws at a Prometheus user is that there is no __name__ label. Instead, InfluxDB has measurements (the name part), tags (indexed key-value pairs), and fields (un-indexed key-value pairs). A Razorpay payments-API counter that on Prometheus is http_requests_total{service="payments-api", region="ap-south-1a", endpoint="/checkout", status="200"} is on InfluxDB written as http_requests,service=payments-api,region=ap-south-1a,endpoint=/checkout,status=200 count=1i,duration_ms=42.3. The comma-separated tag set is sorted-and-hashed into a series key; count and duration_ms are fields, stored as separate columns inside the series. This split — every metric carries multiple field columns under one series identity — is the design choice everything else flows from.

The naming choice — measurement, tag, field — is itself a piece of communication that signals InfluxDB's intended use case. A measurement is what is being measured (CPU temperature, request count, battery voltage); tags are the dimensions that identify which instance you are measuring (which CPU, which service, which battery); fields are the actual numbers or strings being recorded (the temperature reading, the count, the voltage). This vocabulary is borrowed from time-series literature in scientific instrumentation — IoT and laboratory data acquisition — and matches how engineers in those domains naturally talk about their data. The Prometheus equivalent vocabulary (metric name, labels, value) is more generic and more aligned with software-engineering language. Neither vocabulary is wrong; they reflect different intended audiences, and the audiences in turn reflect different histories.

A series in InfluxDB is the unique combination of (measurement, tagset). A point is (series_key, field_key, timestamp, field_value). The TSM file format stores points sorted first by (series_key, field_key), then by timestamp — the LSM key is essentially a column identifier, and within each column the values are time-ordered runs that compress beautifully. The implication for query planning is non-obvious but important: a query for SELECT count, duration_ms FROM http WHERE service='payments-api' does two separate range scans through the TSM files (one per field column), reading only the blocks for those two columns of those matching series. The blocks for bytes_in, bytes_out, and any other field columns of the same series remain on disk. This per-column locality is what gives InfluxDB its read-throughput advantage over engines that store all field values in a single inter-leaved row — a pattern found in older time-series stores like OpenTSDB.

Storage looks like this:

TSM file = [block_0, block_1, ..., block_N, index]
  block_i = compressed (timestamp_run, value_run) for one (series_key, field_key)
  index   = sorted list of (series_key, field_key) → byte offsets of blocks holding it

Every block holds up to 1000 timestamp+value pairs for a single column. Timestamps inside a block are delta-of-delta encoded (just like Gorilla, just like the M3 commitlog) — for fixed-cadence writes most deltas are zero, and the compression on the timestamp column is dramatic (often under 0.05 bytes per timestamp). Values are encoded by type: int64 fields use simple-8b, float64 fields use Gorilla XOR, string fields use Snappy, bool fields use bit-packing. Each block is then optionally Snappy- or Zstd-compressed at the file level. Per-sample bytes for a typical numeric series land at ~2.5 bytes — between Prometheus's 1.3 (Gorilla XOR alone) and a naive 16-byte timestamp+value pair, but worse than VictoriaMetrics's 0.45.

Why the per-column layout helps even though TSM does not call itself "columnar": when a query reads mean(duration_ms) for a service over the last hour, it reads only the duration_ms column blocks for the matching series — the count column, the error_total column, the bytes_in column all stay on disk untouched. A Prometheus query for http_request_duration_seconds_sum similarly reads only that metric's chunks, but each Prometheus metric is a separate series-with-its-own-postings; in InfluxDB those four "metrics" share one series key and four field columns, so the postings cost is paid once across all of them. This is why InfluxDB's SELECT queries against multi-field measurements are fast — the index lookup happens once, then the field-specific columns are read independently. The cost is that you cannot easily ask "give me every metric this service emits" the way you can in Prometheus with {service="payments-api"} — InfluxDB requires you to know the field names up front, because they are not in the index.

InfluxDB TSM engine — write path, in-memory cache, WAL, TSM files, compactionA vertical flow shows incoming line-protocol writes splitting into the WAL (durability) and the in-memory cache (read serving). Periodically the cache is snapshotted to a level-0 TSM file. A background compactor merges level-0 files into bigger level-1 files, applying delta-of-delta encoding for timestamps and type-specific encoding for values. To the right, a panel breaks down a single TSM block: compressed timestamp run, compressed value run, and the per-file index that maps series-key+field-key to block offsets.InfluxDB — TSM engine internalswrites → WAL + cache → snapshots → leveled compaction; tags indexed via TSIline-protocol writemeasurement,tags fields tsHTTP /write or /api/v2/writeWAL (durability)append-only, fsync per batchin-memory cacherecent points, served livesnapshot atcache-snapshot-sizelevel-0 TSM filesmall, recent, manycompactlevel-N TSM filebig, older, fewTSM filelayoutblock 0block 1block Nindex(series,field)→ block offsets+ min/max tsblock encoding by type:timestamps: delta-of-delta + simple8b | int64: simple8b | float64: Gorilla XORstring: Snappy | bool: bit-pack | block-level: Snappy or Zstd compression
Illustrative — not measured data. Writes hit both the WAL (durability) and the in-memory cache (read-serving). When the cache exceeds cache-snapshot-size, it flushes to a level-0 TSM file. A background compactor merges level-0 files into bigger level-N files using delta-of-delta timestamp encoding and type-specific value encoding. The per-file index sits at the end of each TSM file and maps each (series_key, field_key) to block offsets plus min/max timestamps for fast skip-on-time-range.

The compaction pipeline is the second piece worth understanding, and it is where TSM's "time-structured" name actually pays off — every TSM file knows the time range of the points it contains (recorded in the file's index footer), and compaction merges files within a shard while preserving time-range continuity. InfluxDB runs four compaction stages, each handling a different size class. Level 1 takes the WAL snapshots (small, recent, many) and merges them into level-1 TSM files as soon as a few have accumulated. Level 2 and level 3 are progressively larger merges, working on cooler data. Level 4 is "full compaction" — runs on a shard once the shard is no longer receiving writes, producing the smallest possible representation by re-encoding everything with the most aggressive Zstd dictionary. The whole point of the multi-level design is that compaction is amortised: the cost of merging a small level-1 file is small, the cost of merging a large level-4 file is paid only once per shard. Prometheus uses a similar block-size doubling scheme (see Prometheus TSDB internals); VictoriaMetrics uses a continuous merger (see VictoriaMetrics and M3). Each engine's choice on this axis reflects a different bet about read vs write trade-offs.

Why InfluxDB's level-4 "full compaction" matters more than Prometheus's biggest block: InfluxDB shards are typically 1-week or 1-day windows, so once the shard is closed there are no more writes — full compaction can rewrite the data in the most CPU-expensive Zstd-level-19 form, and queries against historical shards get the smallest possible footprint. Prometheus blocks compact up to 2-hour, 12-hour, and "big" sizes, but Prometheus does not have an explicit "shard is now sealed" event the way InfluxDB does; the largest Prometheus blocks are still capped at roughly 1/10th of the retention window because Prometheus's design assumes some write-amplification budget remains for cross-block index merges. InfluxDB's shard model lets it spend that budget once at the end of the shard's life, making historical queries cheaper at the cost of a one-time-per-shard compaction storm. Indian platforms like Zomato and Cleartrip that ran InfluxDB 1.x discovered this the hard way: the daily shard-rotation event at midnight UTC (5:30 AM IST) used to coincide with the morning order-prep window, and the compaction CPU spike degraded query latency for 30-45 minutes. Moving the shard duration to 7 days and aligning the rotation to Sunday 02:00 IST eliminated the daily impact at the cost of a weekly bigger one — a trade most teams accept.

There is one more design-level decision worth understanding before moving to the index. InfluxDB writes go through the WAL first, then the in-memory cache, then asynchronously to TSM files. A successful HTTP 204 response means the WAL has been fsync'd; the in-memory cache update is best-effort and the TSM flush is many seconds later. This is the standard LSM durability contract — durability comes from the log, not from the levelled files — but it has an implication that catches operators: a query for very recent data has to read both the in-memory cache AND any unflushed WAL segments, because the cache is only authoritative once the WAL has been replayed into it on startup. During normal operation this is invisible (the cache and WAL are kept in sync), but during the first few seconds after a restart, queries against the most recent data may return incomplete results until the WAL replay finishes. Telegraf's default flush_interval = 10s means up to 10 seconds of writes can be in-flight in the cache when a restart happens; on InfluxDB 1.x the WAL replay was sequential and could take 5-15 minutes for a busy host, during which the most recent 10 seconds of data was queryable but visibly inconsistent across queries (a count() and a mean() on the same time range could disagree). InfluxDB 2.x parallelised WAL replay to keep this window under 30 seconds even on hot shards.

TSI: the index retrofit that finally fixed cardinality

InfluxDB 1.0 through 1.4 had a fatal flaw — the in-memory index. Every series (every unique (measurement, tagset) combination) was kept in a Go map in RAM, and that map's size scaled linearly with cardinality. A team running 5 million series spent roughly 4 GB of RAM on the index alone before any query or ingestion buffer was accounted for. At 14 million series the index plus the cache plus the Go runtime overhead exceeded 32 GB on a typical box, and InfluxDB would either OOM or spend so much of its time in GC that ingestion stalled. The cardinality death-spiral was a regular topic on the InfluxDB community forums between 2015 and 2018; the standard workaround was "shard your data across more InfluxDB instances", which most teams found unacceptably operationally heavy.

TSI (Time Series Index), shipped in InfluxDB 1.5 (April 2018) and made default in 1.6, replaced the in-memory index with a disk-backed inverted index — the same architectural shape Prometheus had used since day one and the same shape VictoriaMetrics's MergeSet would later push further. Each (measurement, tag_key, tag_value) triple maps to a sorted list of series IDs containing it; the list is stored on disk as compressed posting lists; queries intersect posting lists for matching tag predicates and retrieve only the resulting series IDs. The TSI files use the same LSM-style leveled compaction as TSM, with their own WAL and in-memory cache. Memory usage for the index dropped from "linear in cardinality" to "linear in active queries × selectivity" — a 14-million-series workload that was unrunnable on the 1.4 in-memory index runs comfortably on TSI with 8 GB of RAM dedicated to the index cache.

# tsi_cardinality_audit.py — measure InfluxDB cardinality with realistic Indian-context tags
# pip install influxdb-client requests pandas
import os, time, random, requests
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS

URL = os.environ.get("INFLUX_URL", "http://localhost:8086")
TOKEN = os.environ.get("INFLUX_TOKEN", "dev-token")
ORG = "padhowiki"; BUCKET = "ops"

# Build a Swiggy-shaped fleet: cities × restaurants × delivery_partner × payment_method
CITIES = ["bengaluru", "mumbai", "delhi-ncr", "hyderabad", "pune", "chennai", "kolkata", "ahmedabad"]
RESTAURANTS_PER_CITY = 1500   # high-cardinality intentional
PAYMENT_METHODS = ["upi", "card", "cod", "wallet"]
ORDER_STATUSES = ["placed", "confirmed", "preparing", "out-for-delivery", "delivered", "cancelled"]

client = InfluxDBClient(url=URL, token=TOKEN, org=ORG)
write = client.write_api(write_options=SYNCHRONOUS)

t0 = int(time.time() * 1e9)
points_written = 0
for city in CITIES:
    for r in range(RESTAURANTS_PER_CITY):
        rid = f"{city[:3]}-{r:05d}"
        for status in ORDER_STATUSES:
            p = (Point("orders")
                 .tag("city", city).tag("restaurant_id", rid)
                 .tag("payment", random.choice(PAYMENT_METHODS))
                 .tag("status", status)
                 .field("count", random.randint(1, 50))
                 .field("revenue_inr", round(random.uniform(180, 1400), 2))
                 .time(t0, WritePrecision.NS))
            write.write(bucket=BUCKET, record=p)
            points_written += 1

# Now use the cardinality endpoint TSI exposes
q = client.query_api().query(f'''
    import "influxdata/influxdb/schema"
    schema.measurementCardinality(bucket: "{BUCKET}", measurement: "orders")
''')
total = q[0].records[0].get_value()

# Per-tag cardinality breakdown
tag_cards = {}
for tag in ["city", "restaurant_id", "payment", "status"]:
    qq = client.query_api().query(f'''
        import "influxdata/influxdb/schema"
        schema.tagValues(bucket: "{BUCKET}", tag: "{tag}",
                         predicate: (r) => r._measurement == "orders")
            |> count()
    ''')
    tag_cards[tag] = qq[0].records[0].get_value()

print(f"points written:     {points_written:>10,}")
print(f"series cardinality: {total:>10,}")
print(f"per-tag cardinality:")
for k, v in tag_cards.items():
    print(f"  {k:<14}        {v:>10,}")
print(f"product would be:   {1*8*RESTAURANTS_PER_CITY*4*6:>10,} (full cross-product if dense)")

Sample run on InfluxDB 2.7 in Docker on a 16-GB laptop:

points written:        288,000
series cardinality:    288,000
per-tag cardinality:
  city                          8
  restaurant_id            12,000
  payment                       4
  status                        6
product would be:       1,152,000 (full cross-product if dense)

The script writes 288,000 points — 8 cities × 1,500 restaurants × 6 statuses × 4 payment methods × 1 sample each — and uses the InfluxDB Flux schema.measurementCardinality and schema.tagValues functions to read the actual cardinality back from TSI. The full cross-product would be 1,152,000 series but real data has only 288,000 because each restaurant pays for its specific status×payment combinations, not all of them. The restaurant_id cardinality of 12,000 is the dominant driver — drop that tag and total cardinality collapses to 192. The tag_cards dictionary is exactly the audit a platform team runs weekly to catch cardinality drift before TSI compaction starts struggling. Real Swiggy and Zomato platform teams run this query (or its Prometheus equivalent) as part of their weekly capacity review; the threshold to investigate is when any single tag's cardinality crosses 50,000 within a 30-day window. Run this against your own InfluxDB with progressively more restaurants — at 50,000 restaurants per city the total series count crosses 9.6M and TSI's in-memory cache for the index starts evicting hot postings, which shows up as query latency p99 climbing from 80ms to 600ms even though the data itself fits comfortably on disk.

The TSI rollout itself has a piece of operational history worth knowing. The 1.5 release shipped TSI as opt-in (index-version = "tsi1" in the config); the 1.6 release made it the default but kept the in-memory index as a fallback; the 2.0 release removed the in-memory option entirely. Existing 1.x users had to run the influx_inspect buildtsi command to migrate their existing TSM files' index data into TSI before upgrading — a per-shard rebuild that took 5-30 minutes per shard and required the engine to be quiescent during the rebuild. Indian platform teams that were running 1.4 in 2018 typically scheduled the upgrade as a multi-week effort: rebuild indexes shard-by-shard during low-traffic windows, validate query results match the in-memory baseline, then cut over. The migration was a textbook example of a backwards-compatible-but-operationally-heavy upgrade — every team got there eventually but the friction encouraged some to switch to Prometheus instead, which is part of how the labels war was lost in slow motion rather than in a single decisive moment.

The TSI design has one wrinkle that catches new operators. Tag values are NOT compressed across series — every distinct value of every tag is stored once in the index, but the postings lists themselves are encoded in a compact form (delta-encoded integers + simple-8b). The dominant memory cost is therefore the number of unique tag values, not the number of series. A workload with 14 million series but only 30,000 unique tag values across all tags is cheap to index; a workload with 1 million series but 800,000 unique tag values (because someone added request_id as a tag) is catastrophically expensive. This is the same cardinality gotcha Cardinality: the master variable hammers — it just expresses itself slightly differently in TSI than it does in Prometheus's per-block postings.

Why this matters more on InfluxDB than on Prometheus: in Prometheus, each per-block postings index is rebuilt fresh and discarded when the block ages out — the symbol table is per-block, so a label that was high-cardinality during one block does not pollute the index of a later block. In InfluxDB, TSI's index is global (per-shard, but a shard is a multi-day window) and the symbol table for the shard accumulates every distinct tag value seen during the shard's lifetime. A 14-day shard that absorbed a brief 6-hour cardinality spike (someone deployed a misconfigured service) keeps the bloated symbol table for the full 14 days, even after the bad service is removed. The recovery is to wait for the shard to age out of retention, or to manually rotate the shard early. Prometheus's per-block design recovers from cardinality incidents within the block-rotation window (2 hours by default) without operator intervention; InfluxDB's per-shard design requires either patience or surgery. Indian teams that experienced cardinality incidents on InfluxDB 1.x and 2.x universally report the recovery time as multi-day; the equivalent recovery on Prometheus is typically multi-hour.

TSI inverted-index posting-list intersection — how a tag-predicate query resolves to series IDsTwo posting lists are shown side by side. The left list is for tag service=payments-api with series IDs 7, 12, 18, 21, 33, 47, 58. The right list is for tag status=500 with series IDs 12, 18, 33, 47, 95, 102. A central merge walker walks both lists in parallel and outputs the intersection — series IDs 12, 18, 33, 47 — which are then used to retrieve the actual time-series data from TSM files.TSI posting-list intersection — how AND predicates resolvetwo sorted lists, linear merge, intersection drives series fetchpostings: service=payments-apisorted series IDs (delta-encoded on disk)7121821334758postings: status=500sorted series IDs (delta-encoded on disk)1218334795102merge walkerO(|A| + |B|) comparesintersection12, 18, 33, 47→ fetch chunksfrom TSM filesfor these 4 series
Illustrative — not measured data. The query service = 'payments-api' AND status = '500' resolves via TSI by intersecting two sorted posting lists. Series IDs that appear in both lists (highlighted) form the intersection. The cost is linear in the sum of the list lengths — typically 1-3 milliseconds for selectivity in the tens-of-thousands. Negation predicates (status != '200') cannot use this fast path and require a full series scan, which is why "prefer positive predicates" is the rule of thumb for InfluxDB query performance.

The InfluxQL → IFQL → Flux → SQL whiplash, and what it cost

InfluxDB shipped with InfluxQL in 2013 — a SQL-like language tailored for time-series queries (SELECT mean(value) FROM cpu WHERE host = 'server01' AND time > now() - 1h GROUP BY time(5m)). It was readable, learnable in an afternoon, and matched the SQL muscle memory most engineers had. By 2017 InfluxQL had ~80% of the InfluxDB user base writing queries, hundreds of Grafana dashboards, dozens of training courses, and integration into nearly every monitoring product that supported InfluxDB.

Then, in 2018, InfluxData announced IFQL — a new functional pipeline language that was supposed to replace InfluxQL. IFQL became Flux in 2019 with the InfluxDB 2.0 alpha. The migration message was: "Flux is more powerful, supports cross-bucket joins, has a real expression system, and is what you should use going forward." InfluxQL would remain supported but new features would land only in Flux. The transition was sold as "you'll thank us in three years".

The Flux migration did not work. Three failure modes piled up. First, Flux's syntax was alien to SQL users — pipeline-and-pipe, every step a function call, joins expressed as join.tables() rather than JOIN ON — and the learning curve was measured in weeks, not afternoons. Second, the entire integration ecosystem had to be rebuilt — Grafana panels, alerting rules, every CI script that ran influx -execute had to be either rewritten or wrapped in compatibility shims. Third, Prometheus and PromQL had become the standard during exactly this period (2018-2021), and the labels-and-PromQL combination was already winning the metrics-language war that InfluxDB had been positioned to win in 2015. Flux launched into a market that had moved on.

Then in 2023 InfluxData announced InfluxDB 3.0 / IOx — a complete storage rewrite in Rust, switching from TSM to Apache Parquet and DataFusion (the Apache Arrow query engine). The query language story this time: SQL. Yes — after spending five years convincing users to learn Flux, InfluxData said "actually, SQL is the right answer; Flux will be deprecated". Some teams had migrated to Flux, sunk months of training and rewriting into it, and were now told to migrate again to SQL. The Indian community Slack at the time has hundreds of messages from platform engineers expressing the kind of frustration that translates to "I am rewriting my Grafana dashboards to use Prometheus".

Why this matters technically (not just commercially): a query language is the user-facing surface of a database, and changing it is more disruptive than changing the storage. Storage migrations can be done online with shadow ingestion and dual-read; query language migrations require every downstream consumer (dashboards, alerts, scripts, ad-hoc queries by engineers) to be re-coded. The TSM engine is genuinely well-engineered — better at compression than Prometheus's Gorilla, better at multi-field measurements than any single-value-per-series engine, better at high-write-rate sensor data than the original Cassandra-backed InfluxDB attempt. None of that mattered in the labels war, because the war was won at the query-language level. PromQL plus labels became the wire-format-and-language standard the way SQL became the relational standard, and InfluxDB's shifting query language meant it could not become the standard for time-series even though its storage was as good or better than Prometheus's.

The lesson encoded in InfluxDB's trajectory is uncomfortable but worth naming explicitly: the engine is not the product. The TSM engine is one of the better time-series stores ever built; if it had been bolted onto a stable PromQL-compatible query layer in 2017, it would have competed directly with Prometheus on the merits and possibly won. Instead, InfluxData spent the same period making three incompatible query-language bets and lost the user base while shipping a great storage engine. The contrast with Prometheus — which kept PromQL stable from 2015 onwards even as the storage evolved through three internal generations — is stark.

Push-based ingestion, Telegraf, and the agent-vs-scrape divergence

InfluxDB is push-based: clients send data via the /write (1.x) or /api/v2/write (2.x) HTTP endpoint using line protocol. Prometheus is pull-based: a Prometheus server scrapes /metrics HTTP endpoints from instrumented services on a schedule. This is not a small difference, and it shaped the operational character of each ecosystem in ways that took years to fully play out — see Push vs pull collection for the full comparison.

Push-based ingestion is easier to instrument from short-lived processes. A Lambda function, a cron job, an IoT device on a flaky cellular link — none of these can host a /metrics endpoint that Prometheus can scrape, but all of them can fire-and-forget a line-protocol POST to InfluxDB. The IoT use case is where InfluxDB historically dominated: factory-floor sensor networks at Tata Steel's Jamshedpur plant, weather stations operated by IMD, GPS pings from BharatBenz fleet vehicles. The data flow is naturally one-way (sensors emit, server collects), and Prometheus's pull model fits awkwardly when the "client" is 50,000 sensors behind a NAT.

Push-based ingestion is harder to operate at the application-monitoring scale. Every service knows how to talk to InfluxDB, which means every service has the influx client library, the connection pool, the retry logic, and the credentials. When InfluxDB goes down, every service is now buffering points in memory or on disk, and the recovery semantics are application-specific. Prometheus's pull model concentrates the failure mode at the Prometheus server — services keep emitting /metrics, the scraper either succeeds or doesn't, and if scraping fails for an hour, recent data points are simply gone (with no per-service buffering complexity). For application monitoring at scale, the centralised-failure-mode of Prometheus's pull model is operationally simpler.

InfluxData's answer to the operational question was Telegraf — a separate agent process that runs on every host, collects metrics from local sources (system, Docker, MySQL, NGINX, etc. via 200+ "input plugins"), and pushes batched line-protocol writes to InfluxDB. Telegraf gave you a clear separation: the application emits metrics into Telegraf via StatsD or a local socket, and Telegraf handles the batching, retry, and InfluxDB-writing. This is in fact a well-designed architecture and the pattern many teams still use, but it shifted the operational unit from "your application" to "your application + a sidecar agent on every host". The Prometheus equivalent (node_exporter for system metrics, mysql_exporter for MySQL, etc.) is a similar agent pattern but runs in the exporter role — exposing /metrics for the Prometheus server to scrape — so the same set of agents but a different data flow direction.

The data-flow direction has a subtle consequence on service-discovery semantics. Prometheus's pull model means the Prometheus server holds the service-discovery configuration — it knows which hosts to scrape, which port, which path, what relabel rules to apply. Adding or removing services is a Prometheus-side configuration change. InfluxDB's push model means each Telegraf agent (or each application) holds the InfluxDB endpoint configuration — adding or removing services is a per-agent configuration change. For a Kubernetes-shaped fleet where pods come and go every few minutes, Prometheus's centralised discovery (via the kubernetes_sd_configs API) is operationally simpler than maintaining Telegraf agent configs across thousands of pods. For a fixed-fleet IoT deployment where 50,000 sensors are provisioned once and never moved, the per-agent config is fine — and may even be preferable, because each sensor's configuration is self-contained. The Indian telecoms industry's bias toward InfluxDB (Reliance Jio's cell-tower telemetry, Airtel's network monitoring) reflects this: the fleet is fixed, the topology is stable, and the per-agent config is auditable.

The cumulative effect of these design choices: InfluxDB owned the IoT and industrial-monitoring market through 2020, and ceded the SaaS application-monitoring market to Prometheus during the same period. Indian companies in 2020-2024 with a meaningful IoT footprint — Ola Electric for vehicle telemetry, Sun Mobility for battery-swap stations, Reliance Jio for cell-tower telemetry — overwhelmingly chose InfluxDB. SaaS companies with a backend-services-and-web-traffic monitoring footprint — Razorpay, Cred, Swiggy, Zomato, Cleartrip — overwhelmingly chose Prometheus. The split is real and predictive: when in doubt about which engine fits a workload, ask "is the data primarily emitted by services I own and operate (push patterns work) or by devices and edges I cannot run an exporter on (push is mandatory)". The answer points to the right engine more reliably than feature-list comparisons do.

Failure modes that show up only after the honeymoon period

InfluxDB has a specific failure-mode profile that is different from Prometheus's, different from VictoriaMetrics's, and worth naming explicitly because the migration runbooks rarely warn you about them. Three patterns account for most of the production incidents Indian platform teams have written postmortems about post-InfluxDB-migration.

The shard-rotation compaction storm. InfluxDB's default shard duration on the autogen retention policy is 7 days, and shards rotate at 00:00 UTC. At rotation time, the previous shard is sealed (no more writes) and full compaction runs on it; meanwhile the new shard starts cold (no in-memory cache, all queries hitting disk). The combined CPU load — full compaction of the closed shard + cold-cache queries on the new shard — produces a 30-90 minute window of degraded query latency immediately after rotation. Cleartrip's 2021 postmortem describes this exact pattern: the InfluxDB host's CPU pegged at 100% from 00:00 UTC (5:30 AM IST) for 47 minutes every Sunday, dashboards loaded slowly, and on-call engineers initially thought it was a query-volume problem before realising it was deterministic. The fix is to align shard rotation to a low-traffic window (Indian SaaS workloads typically pick 23:00 IST = 17:30 UTC) and to size shards so full-compaction completes in under 30 minutes — usually meaning shard duration ≤ 1 day for high-write workloads.

The TSI index-cache eviction spiral. TSI keeps a configurable amount of the inverted index in memory (default 1 GB). When working-set queries touch postings outside the cache, the cache evicts cold postings to make room — fine in normal operation. But when a query pattern shifts (e.g. a new dashboard goes live that touches a tag the cache had evicted), the eviction-and-reload churn can dominate query time. The signal is influxdb_tsi1_index_cache_inserts_total rate climbing alongside influxdb_tsi1_index_cache_evictions_total rate, with query p99 climbing in tandem. Razorpay's 2022 InfluxDB-to-Prometheus migration was triggered specifically by this — a quarterly compliance dashboard touched 18 historical dashboards' worth of cold postings, evicted the live operational postings, and the live dashboards (loaded by SREs) started taking 8-12 seconds where they previously took 200ms. The fix on InfluxDB is to raise the index-cache size aggressively (8-16 GB on a 64 GB host) and to schedule heavy historical queries during off-hours; the structural fix is to switch to an engine like VictoriaMetrics whose index is mmap'd and benefits from the kernel page cache automatically.

The line-protocol parse-error silent drop. InfluxDB rejects malformed line-protocol writes with an HTTP 400 response carrying an error message — a sensible behaviour. The trap is that most write clients (Telegraf, application libraries, fluent-bit) batch writes and on a 400 response either drop the entire batch or retry the entire batch depending on configuration, and the per-line error visibility is poor. A single bad line (e.g. a tag value containing a comma that wasn't escaped) can drop 5000 good lines, and the operator finds out only when a Grafana panel goes flat. The fix is to set influxdb_output.precision = "ns" and enable the per-write influxdb_output.dropbatchwhenexceedmaxsize flag, plus to monitor influxdb_write_pointReq_bytes against influxdb_write_pointReq — when bytes-per-point shifts dramatically, a parse-error pattern is usually the cause. Swiggy's 2020 InfluxDB outage was a 14-hour silent data loss caused by a Telegraf upgrade that changed escaping semantics on a restaurant_name tag containing a comma; the data didn't come back, and the lesson "always run a synthetic write-and-read end-to-end probe" landed permanently in their runbook.

The deeper structural point about all three failure modes is that each one is a consequence of an architectural choice that is correct in isolation but expensive when combined with realistic operational patterns. The shard-rotation storm is a consequence of explicit time-windowed shards (which give clean retention semantics); the index-cache eviction spiral is a consequence of disk-backed indexes (which keep memory bounded); the parse-error silent drop is a consequence of strict line-protocol validation (which prevents corrupt data). None of these is a bug; each is a trade-off where the costs become visible only at scale and only after the easier wins have been captured. The lesson generalises beyond InfluxDB: architectural trade-offs that look clean in design docs often produce operational rough edges in production, and the runbook's job is to document those edges, not to wish they did not exist.

A useful diagnostic to run regularly post-migration is influx_inspect dumptsi and influx_inspect verify — the InfluxDB equivalent of promtool tsdb analyze. The former dumps the TSI index structure for cardinality auditing; the latter verifies TSM file integrity (CRC checks against the digest blocks). Run these weekly on a healthy day so the team recognises the shape of an unhealthy day's output. The principle is the same as on Prometheus or VictoriaMetrics: live diagnostics are pedagogy, not just tooling.

There is also a class of failure mode that does not fit neatly into the three above but accounts for a meaningful share of post-migration pain — the continuous-query lag spiral. InfluxDB's Continuous Queries (CQs) and Tasks (the 2.x equivalent) compute downsampled aggregations on a schedule and write the results back to a separate retention policy. When the underlying engine is under load (during shard rotation, during a cardinality spike, during a TSM compaction storm), CQ evaluations take longer, fall behind their cadence, and the downsampled retention policies start showing gaps. Dashboards built against the downsampled data (which is often the only source kept beyond 30 days) develop holes; alerts that depend on the downsampled streams either miss firing or fire on stale values. The signal is influxdb_continuousQueries_queryDuration exceeding the CQ's interval. The fix is operationally identical to the Prometheus recording-rule equivalent: separate the CQ evaluator onto its own host, give it priority access to a query-only InfluxDB replica, and monitor the lag explicitly. Razorpay's pre-Prometheus InfluxDB deployment had a 2021 incident where 6 hours of downsampled metrics were missing because a TSI rebuild had taken 4 hours and the CQs evaluating during that window all returned empty results — the data was never backfilled because CQs are not re-runnable in the standard configuration. The lesson: downsampling pipelines need durability semantics as carefully designed as the primary write path, and "the CQ will catch up" is not durability — it is hope.

Common confusions

Going deeper

Why TSM uses delta-of-delta for timestamps but Gorilla XOR for floats

The encoding choice for each column reflects what kind of redundancy that column has. Timestamps in a fixed-cadence write stream have a regular pattern: t, t+15s, t+15s, t+15s, .... The first delta is 15 seconds; every subsequent delta-of-delta is zero. Encoding zeros costs ~1 bit per timestamp with the simple-8b packing TSM uses, so a column of 1000 timestamps from a 15-second-cadence series fits in ~125 bytes (vs 8000 bytes raw — a 64× reduction). Float values do not have this kind of pattern; they vary unpredictably. But consecutive float values often have the same exponent and most of the same significand bits — a CPU temperature reading that hovers around 67.3°C will produce float64 values whose XOR is mostly zero in the upper bits. Gorilla XOR exploits this: each value is XOR'd with the previous, the result is compressed by encoding the leading and trailing zero counts. A column of 1000 slowly-varying float64 values fits in ~1300 bytes (1.3 bytes per sample, the same density Prometheus achieves). Different columns, different encodings — and that's the win of the type-aware design. A naive LSM that treats every value as opaque bytes would compress the timestamps poorly (Snappy and Zstd cannot find the delta-of-delta pattern in raw bytes) and would compress floats no better than Gorilla. TSM's per-type encoding is what makes the per-sample budget land at 2.5 bytes total — slightly heavier than Prometheus's 1.3 because TSM stores more metadata per block, but with the bonus that string and bool fields compress well too, where Prometheus's design assumes float64 only and offers no good answer for string-valued telemetry.

The integer-field encoding is also worth a beat. simple-8b is a variable-bit-width packing scheme that fits 1 to 240 integers into a single 64-bit word, choosing the bit-width based on the largest value in the group. Counter values that grow slowly (typical Prometheus-style monotonic counters) compress to ~1-2 bits per sample after the delta is taken, an order of magnitude better than naive 64-bit storage. Gauge values that bounce around (queue depth, free memory) compress less well — typically 4-8 bits per sample — but still significantly better than raw 64-bit. The encoding is the same one Lemire's research popularised in the 2010s for IR systems, and it is one of the reasons TSM's integer-field storage is competitive with float64 storage despite int64 carrying twice the raw bits.

TSI's posting list intersection — the operation that decides query latency

A query like SELECT mean(duration_ms) FROM http WHERE service = 'payments-api' AND status = '500' AND time > now() - 1h GROUP BY time(1m) requires TSI to find every series with both service=payments-api AND status=500. TSI does this by intersecting two posting lists: the list of series IDs containing service=payments-api (call it A) and the list containing status=500 (call it B). Both lists are sorted by series ID (the same series ID space across all tags), so intersection is a linear-time merge — walk both lists in parallel, output each series ID that appears in both. The cost is O(|A| + |B|) not O(|A| × |B|), and the typical query — narrow service filter (10K series) AND narrow status filter (5K series) — costs ~15K compares and finishes in 1-3 milliseconds. The expensive case is negation (status != '200') which has no efficient posting list representation; TSI handles this by reading the full series list for the measurement and filtering out matches, which can be 100× slower. The optimisation rule is: prefer positive predicates over negation, and if you must negate, narrow the query first with another positive predicate. This is the same principle as PostgreSQL index usage — the predicate that drives the index choice should be the most selective positive one.

The intersection algorithm has one more wrinkle worth noting: TSI uses skip-lists embedded in the postings file format, not flat arrays. Each posting list has skip pointers every 64 entries; an intersection that knows it must skip far ahead in one list (because the other list jumped from series ID 47 to series ID 102) can use the skip pointer to jump directly to the relevant region of the file rather than walking every entry in between. For very long postings lists the skip-list gives a 5-20× speedup on highly-imbalanced intersections. This is the same trick Lucene and other inverted-index systems use; TSI inherits the design after a 2019 rewrite that brought it closer to the Lucene design.

What InfluxDB got right that we should not lose in the curriculum's enthusiasm for Prometheus

For all the criticism of InfluxDB's commercial trajectory, three engineering choices in TSM are worth preserving as ideas. First, the multi-field-per-measurement model is genuinely better than Prometheus's one-metric-per-name for many workloads. A request measurement with count, duration_ms, bytes_in, bytes_out, status_code fields shares one series identity and four field columns; on Prometheus this becomes four separate metrics (http_requests_total, http_request_duration_seconds, http_request_bytes_received_total, etc.) each with their own postings, each scraped independently. The InfluxDB model is more compact in storage and more natural for code that emits multiple correlated values per event. Prometheus's design rejected this for good reasons (each metric is a self-contained queryable unit), but the trade-off is real and InfluxDB's win on this axis should not be ignored. Second, explicit shard-based time partitioning (every shard is a fixed time window, sealed when the window closes) makes retention enforcement and full-compaction trivial; Prometheus's continuous-block model is more flexible but harder to bound. Third, the type-aware encoding of fields (different compression per type) is something Prometheus cannot do because everything is a float64; TSM gets to compress strings and booleans as densely as their type allows, which matters for telemetry data with rich attributes. The next-generation observability stores (DuckDB, Apache Arrow-based engines, ClickHouse used for metrics) all borrow these ideas — InfluxDB pioneered them at the metrics-store level.

A fourth choice worth naming: the line protocol itself. InfluxDB's line protocol is one of the most ergonomic wire formats in the time-series space — measurement,tag1=v1,tag2=v2 field1=42i,field2=3.14 1714061234000000000. It is human-readable, easy to construct from any language, and parses in a single pass. Prometheus's exposition format (the /metrics text format) is also human-readable but is structured around scraping, not pushing — every line restates the metric name, every label set is a separate line, and there is no batching primitive. The line protocol's compactness is why it became the de-facto wire format for IoT telemetry across multiple TSDBs (QuestDB, TimescaleDB, even some Prometheus remote-write integrations parse it as input). InfluxData's contribution here is structurally important even if InfluxDB itself does not become the dominant store.

Reproduce this on your laptop

# Reproduce the cardinality audit against a local InfluxDB 2.x
docker run -d --name influx -p 8086:8086 \
  -e DOCKER_INFLUXDB_INIT_MODE=setup \
  -e DOCKER_INFLUXDB_INIT_USERNAME=admin -e DOCKER_INFLUXDB_INIT_PASSWORD=adminpassword \
  -e DOCKER_INFLUXDB_INIT_ORG=padhowiki -e DOCKER_INFLUXDB_INIT_BUCKET=ops \
  -e DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=dev-token \
  influxdb:2.7
python3 -m venv .venv && source .venv/bin/activate
pip install influxdb-client requests pandas
INFLUX_URL=http://localhost:8086 INFLUX_TOKEN=dev-token python3 tsi_cardinality_audit.py

The first command stands up InfluxDB 2.7 in a single container with an admin token (dev-token) and a default bucket (ops) pre-provisioned via the INIT_* environment variables. The Python venv plus pip install influxdb-client brings in the official Python client which speaks both InfluxQL and Flux. After tsi_cardinality_audit.py completes, the printed cardinality numbers should match the sample output above — 288,000 series, with restaurant_id as the dominant high-cardinality tag at 12,000 unique values. To stress TSI further, raise RESTAURANTS_PER_CITY to 25,000 (creating 4.8M series) and watch the InfluxDB process's RSS climb to 6-8 GB; on a 16 GB laptop this is the boundary where TSI's index-cache eviction starts dominating query latency. The signal is the influxdb_storage_cache_cached_bytes metric (exposed at /metrics on the InfluxDB process itself) approaching the configured cache size — at that point queries start triggering disk reads of postings, and p99 query latency triples or worse.

The InfluxQL → Flux migration as a case study in API stability

There is a tactical lesson buried in the Flux story that is worth extracting separately. InfluxData announced Flux in early 2018 with a 2-year deprecation timeline for InfluxQL; in practice they spent 5 years supporting both before shipping 3.0 with neither. Why did the migration take so long? Because the cost of a query language migration scales with the integration surface area, not with the database population. Every InfluxQL query in every Grafana dashboard, every Telegraf output configuration, every alerting rule, every CI script that ran influx -execute "SELECT ..." had to be evaluated for migration. The total surface area in 2018 was estimated by InfluxData at over 10 million distinct query strings across the user base; even with 50% of those being trivially-translatable, the manual-review effort dwarfed the database operators' bandwidth. The lesson: the stability of your query language matters more than the cleverness of your storage, because the query language is the API your downstream consumers depend on. InfluxData's eventual move to SQL in 3.0 is an implicit acknowledgement of this — SQL has 50 years of user familiarity and zero migration cost for any team coming from PostgreSQL or MySQL or any of the OLAP databases. Whether 3.0 succeeds commercially depends on whether the community trusts InfluxData not to change languages a fourth time.

The contrast with PromQL's stability is sharp and worth dwelling on. PromQL has been backwards-compatible since Prometheus 2.0 (2017); a query written against Prometheus 2.0 still runs unchanged on Prometheus 2.49 (2024) and on every PromQL-compatible system in between (Cortex, Mimir, Thanos, VictoriaMetrics, M3). The stability is not accidental — the Prometheus team explicitly treats PromQL semantics as part of the public API and rejects PRs that would change query results even when those changes would arguably be improvements. This commitment is what let an entire ecosystem of dashboards, alerting rules, and downstream tools standardise around PromQL. InfluxData's commitment to InfluxQL was rhetorically similar but operationally weaker; the combination of "Flux is the future" announcements, gradual feature divergence, and eventually the SQL pivot eroded the trust that long-term users had built up. The general principle: for a query language to become a standard, the maintainer must demonstrate years of restraint about changing it, even when the changes would technically be improvements. PromQL passes this test; InfluxQL did not.

Where this leads next

This chapter completed the survey of single-engine TSDB designs in Part 2 — Prometheus's per-block model, VictoriaMetrics's MergeSet, M3's sharded cluster, and InfluxDB's TSM tree. Each one represents a different bet about cardinality, compression, query patterns, and operational complexity. The next chapters take the survey in two directions.

The thread to carry forward: time-series storage is a solved problem with multiple correct answers. The TSM engine, the MergeSet engine, the per-block Prometheus engine, the sharded M3 engine — each is a defensible local optimum. The non-storage decisions — query language, ingestion model, ecosystem integration, API stability — are what determined which engines became standards and which became niche. As you evaluate a TSDB for a workload, weight the storage choice at maybe 30% of your decision and the query-language-and-ecosystem choice at 70%. InfluxDB's TSM engine is the Exhibit A of why this weighting matters: a great engine inside a fragmenting ecosystem lost the labels war it was best-positioned to win.

A second thread: the type-aware encoding of TSM is genuinely under-appreciated. Prometheus's float64-only world is simpler but leaves storage on the table for non-numeric telemetry. The next generation of observability stores (DuckDB-as-metrics-store, Apache Arrow-based pipelines, ClickHouse-for-metrics) all bring back type-awareness. Watch this space — the "every value is a float64" assumption that Prometheus baked in and that PromQL cannot easily extend may be the place where the next storage generation diverges from the current standard.

A third thread that ties this chapter to the rest of Part 2: the storage engine is rarely the binding constraint. A team picking a TSDB in 2026 should spend more time on cardinality strategy, query-language ecosystem, and operational maturity than on the per-sample byte budget of the engine. Prometheus, VictoriaMetrics, M3, InfluxDB, and Mimir all compress to within 3× of each other; the engine that fits is the one whose ecosystem matches your team's existing skills and whose operational quirks your team can accept. InfluxDB's TSM engine is a great example of "the engine is not the bottleneck" — the bottleneck was the surrounding ecosystem, and no amount of storage cleverness compensates for query-language churn.

A fourth thread, more forward-looking: the IoT-vs-SaaS split is not eternal. As Indian SaaS companies expand into edge-and-IoT use cases (Ola Electric's vehicle telemetry, Cred's hardware-token authentication, Swiggy's delivery-rider device fleet), the boundary between "Prometheus territory" and "InfluxDB territory" is blurring. Some teams now run Prometheus for application-layer metrics and InfluxDB for device-layer metrics, joining them in Grafana via separate datasources. The next-generation engines (TimescaleDB, QuestDB, Mimir-with-line-protocol) are explicitly trying to bridge this split with engines that ingest both push-style line protocol and pull-style scrape. Whether one engine wins both worlds, or the bifurcation persists, is one of the live architectural questions in the time-series space as of 2026. Either way, the engineering choices TSM made — type-aware columns, multi-field measurements, explicit shards — will continue to inform the design of any engine that wants to handle the full spectrum.

References

  1. Paul Dix, "TSM Tree: A New Storage Engine for InfluxDB" — the canonical engineering explainer by InfluxData's CTO. Section on per-type encoding is the clearest account of why TSM's compression structure differs from a generic LSM.
  2. Edd Robinson, "Path to 1 Billion Time Series: InfluxDB High Cardinality Indexing" — the 2018 blog post that introduced TSI, with measured-on-disk sizes for the index at progressively larger cardinalities.
  3. InfluxDB 2.0 docs: Storage engine — official architecture reference for TSM and TSI as they ship today.
  4. Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the XOR-encoding paper TSM uses for float compression.
  5. Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — chapter 6 covers the TSDB design space and treats TSM as a representative of the multi-field-measurement school.
  6. Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — chapter 4 has the historical context for why InfluxDB's push-based model fit IoT and lost SaaS.
  7. Prometheus TSDB internals — chapter 7, the per-block model TSM diverges from.
  8. Cardinality: the master variable — chapter 3, the budget framing TSI's design exists to push back the wall on.