Wall: metrics without a time-series store are useless
At 02:11 IST on a Saturday, a Swiggy backend engineer named Kiran is staring at a dashboard panel that shows a single number — orders_failed_total = 8,142,901. The panel was added to the dashboard in 2022 because somebody decided "we should track total failed orders". The number is bigger than it was an hour ago. Kiran has no idea whether that is normal, alarming, or routine. There is no rate, no graph, no comparison against last Saturday. The metric exists; the answer does not. The number on the wall is a counter without a clock — a wall, not a window.
Now imagine the same number, but plotted over the last seven days at fifteen-second resolution. The line was flat at ~6 failures per minute for six days, climbed to 14 per minute four hours ago, and is now climbing past 80. Kiran has the answer in three seconds: there is an outage, it started four hours ago, and it is accelerating. Same metric, same emission code, same name. The difference is a time-series database underneath that turned the raw counter into a function of time. This chapter is about why the storage substrate — not the counter — is what makes a metric useful, and why every conversation about observability silently depends on a TSDB that you may not have noticed you were using.
A metric is a function of time, not a number. Without a time-series database, a counter is a single sample that says nothing about rate, change, or comparison — the three questions every dashboard and alert actually asks. The TSDB (Prometheus, VictoriaMetrics, Mimir, M3) is the substrate that converts emitted samples into queryable functions; the choice of TSDB fixes your cardinality cost curve, retention horizon, and query latency, and therefore which dimensional questions you can afford to ask.
What a counter actually is — and what it isn't
A Prometheus counter is the simplest telemetry primitive in production: a single 64-bit float that increments on every event. The prometheus-client Python library exposes it as one line — c = Counter("orders_failed_total", "Failed orders") then c.inc() per failure. The wire format is even simpler: every fifteen seconds, when Prometheus scrapes the /metrics endpoint, it pulls the line orders_failed_total 8142901 along with a server timestamp. That is all the application emits. The application never computes a rate, never holds a window, never compares to yesterday. It only ever increments and exposes.
What turns this single number into something useful is what happens after the scrape. Prometheus stores the (timestamp, value) pair in its TSDB — a column-shaped, append-only structure that compresses successive samples of the same series with Gorilla XOR encoding to ~1.3 bytes per sample. After 24 hours of fifteen-second scrapes, the series has 5,760 samples; after 30 days it has 172,800; after a year it has 2.1 million. Every sample is timestamped, every sample is queryable, every sample is comparable to every other sample. The dashboard panel that shows "failures per minute over the last 24 hours" is the PromQL expression rate(orders_failed_total[1m]) — a function evaluated by the storage layer, not by the application. The application emits a wall; the TSDB turns it into a window. Why the application emits a counter and not a rate: rates require a window, and a window is a query-time decision. The application has no way to know whether the operator wants a 1-minute, 5-minute, or 1-hour rate; it has no way to know which timestamps the operator wants to compare. By emitting the cumulative counter and letting the TSDB compute rates at query time, the same emission supports every possible window — rate[1m], rate[5m], increase[24h], delta between two arbitrary timestamps — without redeploy. Pre-aggregating on the application side bakes one window into the emission and forecloses every other window.
This split — application emits cumulative, TSDB computes rates — is so fundamental that most observability conversations forget to mention it. The reason a Razorpay platform team can ask "what was the p99 latency for the checkout API last Tuesday at 09:15 IST?" five years after the metric was first emitted, and get an answer, is that the TSDB stored every sample at the resolution of the original scrape. The application code that emitted the histogram in 2021 has been deployed and rolled back hundreds of times since; the TSDB is what made the question askable across every one of those deploys. Take the TSDB away and the metric is meaningless — a single live value that cannot be compared to any past value, cannot be alerted on for "rate over threshold", cannot be used in any of the dashboards or runbooks the team relies on. The wall stays up; the window disappears.
The reverse failure mode is also worth naming: a team that pre-aggregates rates inside the application, exposes only orders_failed_per_minute as a gauge, and discovers two months later that the operator wants the 5-minute rate or the 1-hour delta — and the gauge cannot answer either. Pre-aggregation throws away exactly the property the TSDB exists to provide. The discipline prometheus-client enforces — counter for cumulative events, gauge for live readings, histogram for distributions — exists to keep the emission window-agnostic so that the storage layer can answer windowed questions later. Every time a team is tempted to "just emit the rate", they are choosing a wall over a window.
There is a deeper point hiding in the counter-vs-rate debate that becomes obvious only after you have lived through one production incident with the wrong shape. A counter emitted across a process restart does the right thing — it resets to zero, the TSDB's rate() function detects the reset (a sample lower than the previous one) and treats the gap as a missing window rather than as a negative rate. A pre-aggregated rate emitted across a restart does not do the right thing — the post-restart rate looks identical to the pre-restart rate, the operator has no way to know a restart happened, and the cumulative count of events that occurred during the restart window is silently lost. The counter shape preserves the truth that "the process restarted and N events were lost"; the gauge-of-rate shape erases it. Production telemetry is full of these subtle preservations that emerge only when the storage layer has the original shape to interpret. Why counters are robust to restart while gauges-of-rate are not: the TSDB's rate() function is defined as (latest_value − earliest_value_in_window) / window_seconds, with the special case that if a value drops (which only happens on a counter reset), the function treats the drop as a process restart and resumes from the new value as the new baseline. The restart is detected because the storage layer can see the entire history; the gauge has no history of its own internal pre-aggregation, so the operator cannot detect the same event. The counter stores the truth; the gauge stores the lie convenient at emission time.
A measurable demonstration — what the TSDB actually does
The cleanest way to prove the wall-vs-window argument is to emit a counter, scrape it for a few minutes, and then ask three questions: the live value, the rate, and the delta between two timestamps. Without a TSDB, only the first question is answerable; with one, all three are. The script below runs both sides — a Flask app that exposes the counter via prometheus-client, a scraper loop that captures samples, and a TSDB-shaped store (a list of (timestamp, value) tuples) that the same questions are asked of.
# wall_vs_window.py — show why a counter without a TSDB is useless.
# pip install prometheus-client requests flask
import threading, time, random, requests
from collections import deque
from flask import Flask
from prometheus_client import Counter, generate_latest, CONTENT_TYPE_LATEST
app = Flask(__name__)
orders_failed = Counter("orders_failed_total", "Failed orders, Swiggy demo")
# Simulate background failure events: 6/min baseline, then 80/min after 60s.
def emit_failures():
started = time.time()
while True:
rate = 6 if time.time() - started < 60 else 80 # spike at t=60s
time.sleep(60.0 / rate)
orders_failed.inc()
@app.route("/metrics")
def metrics():
return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
threading.Thread(target=emit_failures, daemon=True).start()
threading.Thread(target=lambda: app.run(port=8000, use_reloader=False),
daemon=True).start()
time.sleep(1)
# Side A — WALL: only the live value, no history.
def wall_query():
text = requests.get("http://localhost:8000/metrics", timeout=2).text
for line in text.splitlines():
if line.startswith("orders_failed_total "):
return float(line.split()[1])
return None
# Side B — WINDOW: scrape every 15s into a TSDB-shaped deque.
tsdb = deque(maxlen=10_000) # (epoch_seconds, cumulative_value)
def scrape_loop():
while True:
v = wall_query()
if v is not None:
tsdb.append((time.time(), v))
time.sleep(15)
threading.Thread(target=scrape_loop, daemon=True).start()
time.sleep(120) # collect 2 minutes of samples
print(f"WALL live value : {wall_query():>10,.0f}")
print(f"WALL rate over last 1m? : <unanswerable — no history>")
print(f"WALL delta vs 1m ago? : <unanswerable — no history>")
# WINDOW queries — exactly what PromQL rate() and increase() compute.
def rate_per_min(window_s=60):
now = time.time()
samples = [s for s in tsdb if now - s[0] <= window_s]
if len(samples) < 2: return None
dv = samples[-1][1] - samples[0][1]
dt = samples[-1][0] - samples[0][0]
return (dv / dt) * 60 # events / minute
def delta_between(t_old, t_new):
old = min(tsdb, key=lambda s: abs(s[0] - t_old))
new = min(tsdb, key=lambda s: abs(s[0] - t_new))
return new[1] - old[1]
now = time.time()
print(f"\nWINDOW samples retained : {len(tsdb)}")
print(f"WINDOW rate[1m] : {rate_per_min(60):>10,.1f} /min")
print(f"WINDOW rate[15s] : {rate_per_min(15):>10,.1f} /min")
print(f"WINDOW delta(now, -90s) : {delta_between(now-90, now):>10,.0f}")
print(f"WINDOW delta(now, -120s) : {delta_between(now-120, now):>10,.0f}")
Sample run after letting the script execute for two minutes:
WALL live value : 149
WALL rate over last 1m? : <unanswerable — no history>
WALL delta vs 1m ago? : <unanswerable — no history>
WINDOW samples retained : 9
WINDOW rate[1m] : 73.2 /min
WINDOW rate[15s] : 80.0 /min
WINDOW delta(now, -90s) : 116
WINDOW delta(now, -120s) : 139
What the output is telling you, line by line. WALL live value: 149 is what an application without a TSDB can answer — a single integer with no semantic context. The next two lines are the two questions every operational dashboard depends on, and a metric without history cannot answer either. WINDOW samples retained: 9 is the TSDB doing its job — fifteen-second scrapes, two-minute window, nine (timestamp, value) pairs preserved in a deque that approximates Prometheus's chunk-encoded series. rate[1m] = 73.2 /min is computed by exactly the formula PromQL's rate() uses — (latest_value − oldest_value) / (latest_t − oldest_t), normalised to per-minute. The 1-minute window straddles the spike (started at t=60s, rate jumped from 6 to 80), so the average is ~73, which matches the linear blend of the two regimes. rate[15s] = 80.0 /min uses the most recent two samples and captures the post-spike rate cleanly — same metric, same TSDB, different window, different answer. delta(now, -90s) = 116 is what increase(orders_failed_total[90s]) computes — the absolute count of failures in the last 90 seconds. The application code never computed any of these numbers; every one of them is a query-time function over the (timestamp, value) pairs the TSDB retained.
Why the script's tsdb = deque(maxlen=10_000) is the smallest honest model of a TSDB: the TSDB's central job is to retain ordered (timestamp, value) tuples per series, indexed by series identity, with O(1) append and O(log n) lookup by time. Prometheus's chunk-encoded TSDB does this with Gorilla XOR compression and label-set inverted indices, but the interface the dashboard sees is exactly what the deque exposes — give me the samples in this time range. Every PromQL function (rate, increase, irate, delta, histogram_quantile, predict_linear) is a stateless transformation of that ordered sequence. The TSDB is not a smart query engine that knows about rates; it is a dumb but very efficient time-ordered sample store, and the smartness lives in the query layer that runs functions over it.
The cost numbers worth carrying away. A single Prometheus series at fifteen-second scrape interval costs ~0.0867 samples/second × 1.3 bytes/sample = ~0.11 bytes/second of compressed storage, plus ~7 KB of in-memory overhead for the chunk head and label tuple, plus ~3-8 KB of indexed metadata. A million active series — typical for a mid-sized Indian SaaS like Khatabook or Cred — fits in roughly 7-15 GB of RAM and adds ~110 KB/second of compressed disk writes. The TSDB is cheap in absolute terms; what makes cardinality the master variable (chapter 3) is that the cost is per-series, not per-sample, and each new label combination spawns a new series. The script above runs one series; production runs millions.
A subtle property the script exposes by accident: the rate() formula is insensitive to the absolute value of the counter and sensitive only to the difference between samples. A counter that started at 0 emits the same rates as a counter that started at 8 million; the TSDB does not store nor care about the absolute origin. This is what makes Prometheus counters survive process restarts cleanly — the post-restart counter starts at zero, the rate over a window that straddles the restart correctly reflects the events that occurred after the restart, and the events that occurred during the restart window are gracefully reported as "missing data" rather than as a negative rate. A gauge-of-rate emission would have lied across the same restart; the substrate that consumes counters and computes rates is what makes the truth recoverable. Every PromQL function the dashboard ever uses is an exploitation of this same property — the storage layer holds the cumulative truth, the query layer reconstructs the windowed answer, and the application stays small.
Why the storage substrate fixes which questions you can afford
A TSDB is not a generic key-value store with timestamps. It is a purpose-built data structure whose design choices fix exactly which classes of question are cheap and which are expensive. Once you understand the substrate, the rest of the metrics curriculum is downstream of two or three structural decisions.
The first decision: storage is column-by-series, not row-by-event. A traditional row store would record one row per event — (timestamp, metric_name, label_tuple, value) — and a query for "rate of orders_failed_total over the last hour" would scan billions of rows. A TSDB inverts this: every series (a unique combination of metric name + label values) gets its own column of (timestamp, value) pairs, stored contiguously and compressed. The query for the rate scans only the column belonging to that series — typically 240 samples for an hour at fifteen-second resolution — and the cost is proportional to the length of the column, not the number of series in the database. This is why a Prometheus serving 12 million active series can answer a single-series rate query in under five milliseconds: the column being scanned has a few hundred entries, not a few billion.
The second decision: indexing is by label tuple, not by attribute. Every series is identified by its label tuple — {__name__="orders_failed_total", service="payments", region="ap-south-1"} — and the TSDB maintains an inverted index from individual labels to the set of series they appear on. A query orders_failed_total{service="payments"} resolves to "find every series whose label set contains service="payments" and __name__="orders_failed_total"", which the inverted index returns in milliseconds. The cost shape this creates: cheap to slice on labels you indexed; expensive — actually impossible — to slice on attributes that are not labels. This is the structural reason cardinality is a budget rather than a free variable, and the structural reason metrics cannot answer "which merchant caused the slow checkouts?" unless merchant_id is on the label tuple (which would be a cardinality bomb).
The third decision: compression is sample-by-sample, not block-by-block. The Gorilla XOR encoding (Pelkonen et al., VLDB 2015) exploits the property that successive samples of the same series are usually nearly identical — counters increment slowly, gauges drift smoothly, latencies cluster. XOR-encoding the difference between successive values, then variable-length-encoding the result, achieves the famous ~1.3 bytes/sample compression rate. This makes a year of fifteen-second scrapes for a single series cost about 2.7 MB on disk — small enough that a Prometheus retains months of history at acceptable cost, large enough that a million series consumes terabytes and forces tiered storage (Mimir, Thanos, VictoriaMetrics with object-storage backends).
The fourth decision, less obvious but operationally critical: retention is per-series, not per-event. When a series is deleted (e.g. a pod terminates and its label tuple stops being scraped), the TSDB marks the series as "stale" but retains its samples for the configured retention window — usually 15 days for hot storage, plus tiered storage for longer horizons. This is why a Razorpay engineer can look at "what was the orders-failed rate for the pod that crashed last Tuesday?" two weeks later: the series outlives the pod. A row store with TTL-per-event would have purged those rows; a TSDB's per-series retention preserves the historical context. The trade-off, again, is cardinality: every short-lived series (a pod that ran for one hour and emitted 240 samples) consumes the same metadata overhead as a long-lived series, which is why "pod churn" is one of the silent cardinality killers Indian Kubernetes platforms (Hotstar, Flipkart, PhonePe) all eventually have to tame.
The fourth-and-a-half decision, often missed: histograms are encoded as multiple bucket counters, not as a single distribution. A Prometheus histogram with ten buckets emits eleven series — one per bucket plus the +Inf accumulator — and each series is a regular counter that grows monotonically as samples land in or below that bucket. The TSDB sees only counters; the histogram-shape lives in the naming convention (http_request_duration_bucket{le="0.1"}, ..._bucket{le="0.5"}, ..._bucket{le="1.0"}, etc.), and histogram_quantile() reconstructs the distribution at query time by interpolating between bucket boundaries. This is why a Prometheus histogram with badly-chosen bucket boundaries lies about its quantiles — Part 7 of this curriculum dissects the interpolation in detail — and why every histogram you emit costs eleven series, not one, against your cardinality budget.
The encoding choice has a useful corollary that matters in practice: because every bucket is a counter, a histogram can be aggregated across pods, regions, or services simply by sum-ing the per-bucket counters and feeding the result through histogram_quantile(). There is no need for the application to ever compute a quantile; the TSDB does the math at query time, the storage layer holds the bucket counters, and the dashboard panel computes whatever quantile the operator happens to want. Quantile-of-quantile aggregation (averaging p99 across pods to get the "fleet p99") is a common and wrong pattern in non-Prometheus stacks; the bucket-counter encoding makes it easy to do the right thing instead. The substrate is uniform — counters all the way down — and the higher-level shapes are conventions over the substrate, but those conventions are exactly what enables the correct math the operator actually wants.
The fifth decision, almost invisible until it bites you: the TSDB is single-process, single-tenant, single-replica by default. Vanilla Prometheus is an in-process database; high availability is bolted on via paired-Prometheus-with-Alertmanager-deduplication; long-term storage is separate (Thanos, Mimir, Cortex, VictoriaMetrics-cluster). Most large Indian platform teams discover this around the time their primary Prometheus crosses ~5 million active series — the process starts hitting OOM thresholds, scrape timeouts climb, query latency degrades. The fix is sharded ingestion (Mimir / VictoriaMetrics-cluster / M3DB) plus tiered storage on object stores like S3-equivalent. The transition is non-trivial — usually a quarter of platform-team work — and the time to plan it is before the active-series count crosses ~3 million, not after. Cred crossed that threshold in 2023, planned the Mimir migration during a quiet sprint, and avoided the cliff; an unnamed neobank crossed it during a marketing-driven traffic spike, did not have a plan, and ran a degraded Prometheus for six weeks while the migration was rushed.
What "metrics without a TSDB" actually looks like in production
The argument so far has been theoretical — a counter without a window cannot answer rate questions. The argument is also testable in a more uncomfortable way: every team has, somewhere in their stack, a piece of telemetry that is a counter without a TSDB, and looking at where those land is what makes the principle visceral.
StatsD-without-aggregation. A team running a raw StatsD daemon (no aggregation, dropping lines into a flat file) has a stream of events but no time-indexed store. A query like "what was the request rate at 09:15 IST yesterday?" requires re-parsing the file, sorting by timestamp, bucketing, and computing rates by hand. This is what most pre-2014 Indian engineering stacks looked like, and it is why teams that adopted Prometheus 2015-2017 felt like the lights had come on — not because the metrics were better, but because the storage finally let them ask windowed questions cheaply.
docker stats and kubectl top. Both commands show live values — current CPU, current memory, current network — with no history. The number is a wall. To see "did pod X use more memory at 09:15 than at 09:00?", you need a TSDB sampling the same numbers every fifteen seconds; the live commands cannot answer. cAdvisor + Prometheus is the canonical fix; without that stack, every diagnosis of "why did this pod OOM?" turns into a re-creation of the failure rather than a forensic walk through the history.
Application-emitted "current state" gauges. The most common in-application failure mode: a developer adds a gauge current_pending_orders that exposes a live value via the application's own admin endpoint, scraped manually with curl during incidents. Without a TSDB scraping the gauge every fifteen seconds, the gauge tells you only the current state — "100 pending orders" — and not the trend. A team at IRCTC discovered this during a Tatkal hour in 2023: the current_pending_bookings gauge showed 14,000 at 10:01:02 IST and 14,200 at 10:01:08 IST, and the on-call had no way to know whether the queue was growing, shrinking, or stable, because the gauge was being read live and not stored. Once the gauge was scraped into Prometheus, the same number became rate(current_pending_bookings[1m]), the queue-growth velocity was 1,200 / minute (clearly accelerating), and the diagnosis took three minutes instead of forty.
CSV / spreadsheet exports. A practice still surprisingly common in non-engineering-led teams: Ops exports a daily CSV of "events", opens it in Excel, computes rates by hand, sends a screenshot in Slack. This is a TSDB implemented in human time at human cost. The instinct is correct (we want a function of time), but the implementation throws away every property a real TSDB provides — sub-second query, retention beyond a day, alerting on thresholds, joining series across labels. Teams that move this practice into Prometheus + Grafana typically see a 50-100× reduction in time-to-answer, not because the math changed but because the storage moved from a filesystem + Excel pipeline to a purpose-built engine.
PagerDuty alerts on counter increments without a rate. A subtle and dangerous failure mode: an alert rule that fires when orders_failed_total > 1000. Because the counter only ever grows, the alert fires once when the threshold is crossed and stays firing forever — a permanent red light, no signal of recovery, no signal of acceleration. The fix is an alert on rate(orders_failed_total[5m]) > X, which is a rate question and therefore a TSDB question. A surprising number of "alert flapping" complaints in junior platform teams trace to the original alert rule being on a counter rather than on a counter's rate, and the fix is the storage layer, not the alert tooling.
OpenTelemetry metrics without a backend. A team that emits OpenTelemetry-format metrics from instrumented services to an OTLP collector, but has not wired the collector into a TSDB exporter, has a queue of samples piling up in the collector with nowhere to go. The OTel collector will drop oldest samples once its in-memory buffer fills (default 5,000 samples) and the operator will see neither the queue nor the drops unless the collector itself is monitored. This is the modern version of the "metrics without a TSDB" failure — the wire format is fine, the SDK is fine, the collector is fine, but the storage substrate at the end of the pipeline is missing, and the same wall-without-window failure mode emerges. The fix is the same: terminate the pipeline in a TSDB (Prometheus remote-write receiver, Mimir, VictoriaMetrics, M3DB) before any metric is considered "real".
The pattern across all six examples: every time a team discovers their metrics are "useless", the underlying problem is that the storage substrate underneath does not support the kind of question they want to ask. Adding more metrics, more dashboards, or more alerts on the same wall does not help; the work is to add a TSDB underneath the wall, at which point the same emissions become windowable. Why "add a TSDB" is the right framing rather than "switch to Prometheus": Prometheus is one TSDB among several (VictoriaMetrics, M3DB, Thanos, Mimir, InfluxDB, TimescaleDB), and the choice between them is a different conversation than the choice of whether to have one. Every modern observability stack has a TSDB underneath the metrics pillar — the conversation is which one and how it is sharded, not whether one is needed. Teams that try to do metrics without one end up rebuilding partial TSDB functionality (in CSVs, in flat files, in application memory) and discovering, slowly and expensively, that the existing TSDBs were the right answer.
The Hotstar IPL final and the TSDB that almost wasn't
A worked example to ground the substrate argument in a concrete production stake. Hotstar's 2024 IPL final at JioCinema reached ~32M concurrent viewers; the platform team monitored ~14 million active Prometheus series across roughly 80 microservices. The match started at 19:30 IST on 26 May 2024. At 21:08 IST, during the second innings, a playback_buffer_underrun_rate alert fired — the rate of buffering events had crossed 0.4% of viewers, which the SLO treated as a paging condition.
The on-call SRE, a senior engineer named Dipti, opened the dashboard and saw the rate climbing — not the absolute count of buffer underruns (which had been growing all match because viewers were growing) but the rate per viewer, computed by the TSDB as rate(buffer_underrun_total[1m]) / rate(playback_started_total[1m]). Both terms in the ratio were rates the TSDB computed at query time from cumulative counters the application emitted; neither term was a number the application ever computed itself. The dashboard panel was a single PromQL expression evaluated every fifteen seconds against the column-compressed series; the cost of the evaluation was sub-millisecond per series, and the answer arrived faster than the dashboard refresh.
What the rate actually showed: a step climb from 0.18% to 0.62% at 21:06 IST, two minutes before the page. The TSDB's two-week retention let Dipti compare the same metric to the previous Saturday at the same wall-clock time (0.21%, normal Friday-to-Saturday IPL traffic) and to the previous IPL match (0.19% steady state); both comparisons were single PromQL queries — playback_buffer_underrun_rate offset 7d, playback_buffer_underrun_rate offset 24h — that the substrate made trivially affordable. The diagnosis ladder Dipti walked was: rate climb → comparison vs baseline → label slice by cdn.region (only ap-south-1b) → label slice by pop_id (only mumbai-2 and mumbai-7) → label slice by client_app_version (no concentration) → conclusion: regional CDN issue at two specific Mumbai POPs. Total time from page to diagnosis: 4m 12s. The fix (a TLS certificate rollback at the affected POPs) took another seven minutes; the incident closed at 21:19 IST with ~430,000 viewers having seen 8-30 seconds of buffering, well inside the SLO's allowable budget.
Now imagine the same incident with metrics emitted but without a TSDB. The application would have exposed buffer_underrun_total = 14,200,901 to a curl /metrics call. Dipti would have had no way to know whether that number was elevated, no way to compare to last Saturday, no way to slice by cdn.region (the labels are TSDB-side), no way to evaluate any rate at all. The diagnosis would have required deploying a new metric, waiting for samples to accumulate, and re-checking — an hour, minimum, during which the SLO budget would have been entirely consumed and another two million viewers would have been affected. The TSDB is not the dashboard, not the alert engine, not the visualisation — it is the substrate that made the four-minute diagnosis possible, and removing it would have cost Hotstar roughly an order of magnitude more viewer-hours of degraded experience for the same underlying bug.
The lesson Dipti's team carried into the post-incident review: every metric on every dashboard, every alert in PagerDuty, and every SLO contract in the Hotstar reliability programme is a function over (timestamp, value) tuples retained by Mimir (the sharded TSDB they migrated to in early 2024). The application code is the small, dumb, fast part of the system; the storage substrate is the large, smart, expensive part. Reasoning about reliability without reasoning about the substrate underneath is reasoning about a half-built system — the half that would have been useless on its own.
Common confusions
-
"A Prometheus counter is a rate." False. A counter is a monotonically-increasing cumulative integer; the rate is what the TSDB computes from successive samples of the counter. The application emits the cumulative value because that is the only emission shape that supports every possible windowed rate query. Pre-aggregating the rate inside the application bakes a single window into the emission and forecloses every other.
-
"
/metricsendpoints are the metric system." Misleading. The/metricsendpoint is the transport — a text format Prometheus scrapes — but the system is the storage layer that retains what was scraped. Without a TSDB scraping/metricsevery fifteen seconds, the endpoint is a live snapshot, not a metric. The system exists when the snapshots are stored, indexed, and queryable as functions of time. -
"You can do metrics with logs." Possible but expensive. Counting lines in a structured log to compute a rate is technically a TSDB-shaped query implemented over a log store, and Loki + LogQL's
rate()function does exactly this. The cost is roughly 10-100× more than a purpose-built TSDB because logs are stored as variable-length records with no compressed-column representation. For high-frequency questions ("rate over the last 1 minute, evaluated every 15 seconds") the TSDB wins by 1-2 orders of magnitude in cost. Use logs as a metric store only when the cardinality demands it — at which point you should consider whether the question is really a trace / event question. -
"InfluxDB / TimescaleDB / Prometheus / VictoriaMetrics are interchangeable." Mostly false. They share the TSDB shape — column-by-series, time-indexed, label-indexed — but differ on critical operational properties: InfluxDB is push-based with a different cardinality cost curve; TimescaleDB is a PostgreSQL extension with SQL semantics; Prometheus is pull-based with PromQL; VictoriaMetrics shifts the cardinality cliff from ~5M to ~50M active series but does not eliminate it. Choosing among them is a genuine engineering decision (cardinality budget, query language familiarity, operator-team skills), and treating them as drop-in substitutes leads to predictable surprises at the upper end of the budget. None of them eliminate the per-series cost shape; they only relocate it.
-
"The TSDB is just a database." False, in a useful way. The TSDB's data model — append-only, time-ordered, sample-compressed, label-indexed — is so specialised that retrofitting metric workloads onto a generic relational or KV database performs 1-2 orders of magnitude worse. The specialisation is what makes a single Prometheus serve millions of series at fifteen-second resolution; a generic database doing the same work would need ten times the hardware. The TSDB is a database, but it is a database whose entire design is consequence-of-the-workload.
-
"
rate()is the same as a derivative." Almost, but the TSDB'srate()has special-case logic that a mathematical derivative does not. It detects counter resets (a sample value lower than the previous one) and treats them as "the process restarted" rather than as a negative rate; it extrapolates over scrape gaps so a missed scrape does not produce a misleading zero; it normalises by the actual time-delta between scrapes, not the scrape interval, which matters during scrape stalls. These behaviours look like quirks but are precisely what makes counter-emission across restarts and across scrape misses a robust pattern; a naive derivative implementation would lie about every one of those edge cases.
Going deeper
The Gorilla XOR encoding — why 1.3 bytes per sample matters
Gorilla (Pelkonen et al., VLDB 2015) is the encoding that made modern TSDBs economically viable. The trick: successive samples of a series are usually close in value (a counter increments by small numbers, a gauge drifts smoothly), and if you XOR successive 64-bit floats you get a result that is mostly zeros. Run-length-encoding the leading and trailing zeros, then variable-length-encoding the meaningful middle bits, brings the cost from 8 bytes/sample (raw float64) to ~1.3 bytes/sample (XOR-encoded). The same trick works on timestamps via delta-of-delta encoding — successive timestamps are usually exactly fifteen seconds apart, so the delta-of-delta is zero and encodes in a single bit. Combined, a 240-sample chunk (one hour of fifteen-second scrapes) costs ~310 bytes on disk, including timestamps, including header.
That compression is what makes a million series at one-year retention fit on a single object-storage bucket of a few hundred GB rather than tens of TB. The encoding is open-source (every Prometheus-compatible TSDB implements it) and the paper is short enough to read in an evening; doing so is a cleaner introduction to "why the substrate matters" than any blog post. The compression assumes the workload it was designed for — slow-moving counters and gauges — and degrades gracefully but predictably when the assumption is violated. Series with high-entropy values (random-like floats with no temporal correlation) compress to roughly 6-7 bytes/sample, four to five times worse than the headline number. This is one of several reasons emitting raw event payloads as gauge values is wasteful: the gauge shape works well for slow-changing process state and badly for arbitrary event data, and the substrate's compression is one of the silent reasons for the difference.
The single-tenant Prometheus cliff and the path to Mimir / VictoriaMetrics
Vanilla Prometheus is single-process by design. The cliff lands at roughly 3-5 million active series for a typical 64GB-RAM box: scrape latency starts climbing past the scrape interval, query latency degrades from milliseconds to seconds, OOM kills become regular. The cliff is not a bug; it is the consequence of running every series, every label index, and every query in a single process. The path beyond is sharding: Mimir (Grafana's hosted version of Cortex) shards ingestion across a fleet, stores chunks on object storage, runs queries on a separate fleet of queriers. VictoriaMetrics-cluster takes a similar approach with different operational ergonomics; M3DB (Uber, used by Hotstar internally) chose a different sharding model.
The choice between them is mostly operational — Mimir if you want Grafana-shop compatibility, VictoriaMetrics if you want lower cost-per-series and simpler ops, M3DB if you have a Uber-shaped workload with very high write rates. Razorpay made the Mimir transition in 2024 across roughly nine months; the planning documents emphasised that the right time to start was when single-Prometheus was at 60% of the cliff, not 100%. Indian platform teams that wait until the cliff is breached typically run a degraded TSDB through the migration window, which is exactly the operational risk a TSDB exists to prevent. The migration is non-trivial in another way: PromQL semantics differ slightly between vanilla Prometheus, Mimir, VictoriaMetrics, and M3DB — irate precision, extrapolated_rate behaviour, sub-query evaluation, recording-rule timing — and dashboards that worked perfectly on single-Prometheus may produce subtly different numbers on the sharded backend. The migration discipline is to validate every dashboard panel and every alert rule against the new backend with at least a week of overlap before cutover; teams that skip the validation step typically rediscover the differences during their next incident, when the alert that used to fire is silent and the dashboard that used to show the truth is showing a number that is 8% off.
Recording rules and the pre-computed view trade-off
PromQL rate() and histogram_quantile() are evaluated lazily — every dashboard refresh re-runs the function over the raw samples. For high-cardinality queries (e.g. a 99th-percentile latency aggregated across 10,000 pods every fifteen seconds), the lazy evaluation can be expensive: the dashboard query starts taking seconds, alerting rules race the scrape interval, the operator's experience degrades. Recording rules are the TSDB's answer: a periodically-evaluated rule writes a derived series back into the TSDB. rule: high_latency_pods = histogram_quantile(0.99, sum by (pod) (rate(http_request_duration_bucket[5m]))) evaluates every fifteen seconds and writes the result as a new series; dashboards that consume it pay only the cost of reading the pre-computed series, not of re-running the histogram quantile.
The trade-off is sharp on both sides. Recording rules consume their own series cardinality (the derived series count against your budget); they fix the window at rule-design time (no more arbitrary windows on the recording, only the one you chose); and they introduce a fifteen-second-or-larger delay between the underlying samples and the rule's output, which in practice means an alert on a recording rule fires up to thirty seconds later than the same alert on the raw expression would. The discipline is to record only the queries that are evaluated frequently and expensively; recording every query turns the TSDB into a denormalised view database with all the consistency problems that implies. Razorpay's platform team in 2024 audited their recording rules and found 380-odd rules running every fifteen seconds, of which roughly 40% were never queried by any dashboard or alert — pure waste of CPU and series budget. Removing the unused rules dropped Prometheus CPU by 18% and freed enough series headroom to push the cardinality cliff back by two months. The recording-rule layer is, like every layer of the substrate, a budget; treating it as free creates the silent cost that nobody finds until it is examined.
Pull vs push, the up series, and why the substrate generates its own meta-telemetry
The architectural decision Prometheus made in 2012 — to pull samples from /metrics endpoints rather than have applications push them — is the source of many subtle operational properties. Pull means Prometheus knows whether each target is reachable (an unreachable target stops emitting samples; a up series goes to zero), it means timing is centralised at the scraper, and it means service discovery (Kubernetes, Consul, etc.) is the substrate of "what to monitor". Push (StatsD, Datadog Agent, OpenTelemetry's metric exporter pushing to a collector) inverts these: applications don't care if anything is listening, timing is decentralised, and the receiver has to handle late / out-of-order arrivals. Both work; Prometheus's pull model fits an "operations team owns the monitoring" culture, while push fits a "developers own the metrics" culture. Most modern Indian platform teams run hybrid stacks — Prometheus pulling Kubernetes-native services, OTel-collector receiving push from edge/serverless workloads — and the TSDB underneath does not particularly care which way the samples arrive; the same compressed-column storage receives both.
A consequence of the pull model worth its own paragraph: every Prometheus scrape automatically emits a synthetic series called up per target — up{job="payments-api", instance="10.0.0.42:8000"} 1 if the scrape succeeded, 0 if it failed. The up series is the cheapest, most underrated piece of telemetry in the entire stack: it is a binary time series of "is this target alive?", queryable across history, alertable, joinable with any other series in the same TSDB. The up series is what makes "is this service running?" a TSDB query rather than a dashboard refresh. Most production runbooks should start with up == 0 checks across the relevant target set; many do not, because the up series feels too obvious to mention. The pattern that catches teams off guard: a target that is partially responding — returning HTTP 200 but with a malformed /metrics body — emits up=1 even though the actual metrics are missing. The fix is to also alert on scrape_samples_scraped == 0 for targets that should be emitting, which is a separate synthetic series the TSDB also retains. Both of these are properties only a TSDB can express; an application without one would have no way to encode "how do I know my own metrics endpoint is healthy?" in queryable form. The pull model and the synthetic up / scrape_* meta-series are two halves of the same design: the storage layer not only retains what the application emits, it generates its own telemetry about whether the emission worked at all.
Reproduce this on your laptop
# Reproduce this on your laptop — wall vs window in two minutes.
docker run -d --name prom -p 9090:9090 prom/prometheus
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests flask
python3 wall_vs_window.py
# Then in a separate terminal, scrape Prometheus directly to see the rate it computes:
curl 'http://localhost:9090/api/v1/query?query=rate(orders_failed_total[1m])'
After 2 minutes of runtime you should see the script's WINDOW outputs match Prometheus's rate() answer to within rounding, because both are computing the same function over the same retained (timestamp, value) samples. The script is a 30-line model of the storage substrate every metrics dashboard secretly depends on.
Where this leads next
This chapter closes Part 1 of the curriculum. Every later part assumes a TSDB is in place underneath the metrics pillar — the question is what to put in it, how to query it, and how to keep it within budget. The thread continues directly into:
- Cardinality: the master variable — chapter 3, the prerequisite that names the budget every TSDB enforces. The TSDB cost shape (per-series, not per-sample) is what makes cardinality the master variable, and the cardinality budget is what makes some labels affordable and others a bomb.
- Metrics, logs, traces — what each is good at — chapter 1. The TSDB's column-by-series shape is what makes metrics cheap-to-store-and-query at low cardinality, which fixes their position in the three-pillar trade-off.
- The observability-vs-monitoring distinction — chapter 4. The TSDB is the substrate of the monitoring column; the column-store / event-store substrate behind traces is what makes the observability column work. The two storage primitives are what fix the operational role of each pillar.
- Part 2 of this curriculum (Metrics deep) is an extended walk through the TSDB internals this chapter sketched: counter / gauge / histogram semantics, the OpenMetrics text and protobuf wire formats, Gorilla XOR encoding line-by-line, quantile-from-histogram interpolation lies, the
upandscrape_duration_secondsmeta-series, recording rules, and the cardinality cliff in detail. Read it as the deep-dive into the substrate this chapter argued is non-negotiable. - Parts 7 (Latency and tail) and 8 (Time-series compression) live close to the substrate. Part 7 is the chapter where the histogram-as-bucket-counters encoding meets coordinated omission and quantile-from-histogram interpolation; the lies the substrate tells about p99 are exactly the lies an inadequate emission shape stores into the storage. Part 8 is the longer Gorilla XOR walkthrough — a Python implementation, byte-by-byte trace through a 10K-sample series, and the comparison against alternative encodings (delta-of-delta only, dictionary encoding, simple-8b). Both parts reward returning to this chapter as a foundation: the substrate decisions are the same; only the depth of focus changes.
The sentence to take into the next part: a metric is a function of time, not a number, and the TSDB is what makes the function affordable to ask of. Every dashboard panel and alert rule you ever write is a query against that function; every cardinality decision and retention window is a parameter of the storage layer that holds the function up. Metrics without a TSDB are a wall; metrics with one are a window onto the system you actually operate. The wall is cheap to build and useless to consult; the window costs more to install and is what your on-call life depends on.
The next time someone proposes "let's just emit a counter and read it during incidents", the question to ask back is "into what TSDB?". If the answer is "we will read it live", the proposal has built a wall and called it a window — and the next 2 a.m. page will be the one that demonstrates the difference.
There is also a closing point about cost-of-ownership worth holding onto. A TSDB is rarely the most expensive line in the observability bill — that title usually goes to logs (raw volume) or traces (storage of full attribute payloads). But the TSDB is the most load-bearing line: every alert, every dashboard, every SLO contract is a function over its samples. A team that under-invests in the TSDB tier (not enough RAM, not enough sharding, not enough retention) ends up with metrics that are technically present and operationally useless — exactly the wall-without-window failure mode this chapter has argued against. The right TSDB investment, even at the cost of being a slightly larger line item than seems necessary, is what makes every other observability investment usable. Spend the money where the substrate lives.
A final practical note for teams setting up their first metrics stack: do the storage decision before the emission decision. It is tempting to start by adding Counter and Histogram calls to the application code and figure out the storage later — and the result is usually a six-month delay during which the metrics are technically being emitted and practically being ignored, because no one can actually query them. Reverse the order: stand up Prometheus first, scrape a single trivial counter, watch the line draw on Grafana, and then expand the application instrumentation. The storage substrate is what makes the emission worthwhile; without it, the emission is throwaway. This is the same sequencing rule that data-engineering teams learn for data pipelines (write the warehouse before the ETL) and that database teams learn for schema design (define the query patterns before the table layout). In every domain where data has to outlive the process that produced it, the storage layer is the load-bearing decision and the emission is the consequence — and observability metrics are no different.
References
- Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the encoding paper. Foundational for understanding why 1.3 bytes/sample is achievable and what the TSDB is doing under the hood.
- Prometheus design documentation — the canonical description of the TSDB's choices: pull-based, single-process, label-indexed, sample-compressed.
- Björn Rabenstein, "Prometheus Storage" (PromCon 2017) — talk on the storage internals and the chunk format. Worth watching alongside the Gorilla paper.
- VictoriaMetrics architecture documentation — the alternative TSDB architecture, useful for seeing what happens to the cardinality cliff when you re-engineer the inverted index.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 2's coverage of the column-store-vs-TSDB distinction is the cleanest published account of the substrate trade-off.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — Chapter 4 ("The Three Pillars of Observability") lays out the metric / log / trace storage shapes and what each is good at.
- Cardinality: the master variable — chapter 3 of this curriculum, the budget that the TSDB substrate enforces.
- The observability-vs-monitoring distinction — chapter 4 of this curriculum, the role-of-the-substrate framing this chapter is the storage-side companion to.
- Brian Brazil, Prometheus: Up & Running (O'Reilly, 2018) — the practitioner's reference. Chapters 4 (counters / gauges / histograms) and 7 (rates and the rate-of-counter formula) are the cleanest published account of the emission-vs-query split that this chapter spent its argument unpacking.