Wall: the efficient storage of time-series
It is 23:40 IST, the on-call engineer at a Hyderabad fintech is staring at the monitoring-cluster Grafana dashboard and the disk-usage panel for the Prometheus pod has just crossed 89%. Cardinality is fine — Riya finished the budget rewrite last quarter, the head block is steady at 4.2 million series, recording rules are tight, the team learned that lesson. What is filling the disk is not series count. What is filling the disk is samples — every series writes a sample every 15 seconds, the fleet has 4.2 million series, that is 280,000 samples per second, 24 billion samples per day, and at the naïve 16-byte float64-plus-timestamp encoding that is 384 GB per day of pure sample data, never mind the index, never mind the WAL. Riya already cut the cardinality. The next bill is the cost of writing the samples themselves. Cardinality is the first wall of Part 6; the per-sample storage cost is the second wall, and it is where Part 7 begins.
Cardinality bounds how many series you store; sample storage bounds how many bytes per sample those series cost. A naïve Postgres-style row of (timestamp int64, value float64) is 16 bytes per sample, multiplied by every scrape, every series, every retention day — and at fleet scale this is the dominant disk cost long before retention policies kick in. The Gorilla XOR encoding used by Prometheus, M3, and VictoriaMetrics drops this from 16 bytes to ~1.3 bytes per sample, a 12× reduction, and that single algorithm is why running a TSDB on commodity disks is economically possible. Part 7 explains how that compression works; this chapter is the wall that motivates it.
Why per-sample storage is its own wall, separate from cardinality
A series in a TSDB is not "a row in a table". A series is a stream of samples — one float arriving every scrape interval, forever, until the retention window deletes the oldest. Cardinality is the count of streams; per-sample storage is the cost of writing each drop into the stream.
The two costs interact but are not the same. A fleet that holds cardinality at 1 million series is doing well on the first axis. If that same fleet scrapes every 15 seconds and stores at the naïve 16 bytes/sample, it writes 1,000,000 series × 4 samples/min × 1440 min/day × 16 bytes = 92 GB/day of pure sample data. With 30-day retention that is 2.76 TB on disk, before the index, before the WAL, before any replication. A team that has perfectly disciplined cardinality can still run out of disk in three weeks.
Why the two costs do not just add and you can't fix one with the other: cardinality discipline reduces the multiplier in front of the per-sample bytes, but cannot make per-sample bytes smaller — a 16-byte sample is still 16 bytes, no matter how few series there are. Compression reduces the per-sample cost but cannot reduce the multiplier — a million series at 1.3 bytes/sample is still a million streams of writes. The two walls are orthogonal, and a team that discovers only one wall pays for the other forever. Razorpay learned the cardinality lesson in 2022 and the storage-encoding lesson in 2023; the second incident was the one that taught them that fixing only the first was a partial fix.
A second observation that matters for everything in Part 7: the per-sample cost is not just disk. It is also WAL fsync rate, scrape-handler CPU, block compaction time, query-time decode CPU, and network bytes for remote-write replication. Every byte you save on disk you also save on five other operational costs that scale with the same number. This is why TSDB authors spent years compressing samples — the cost surface is multidimensional, and the encoding is the lever that touches all of them at once.
A measurement: how badly does the naïve encoding actually hurt?
The right way to feel the wall is to write the naïve encoding ourselves, write a real Prometheus-shaped sample stream into it, count the bytes, then encode the same stream with delta-of-delta on the timestamp and XOR on the float and count the bytes again. The numbers are the wall.
# storage_wall.py — how big is naïve sample storage, how big is Gorilla
# pip install numpy
import numpy as np, struct, io, time
# 1. Generate a realistic Prometheus sample stream
# A counter that climbs, scraped every 15s, for 24h = 5760 samples
np.random.seed(0)
N = 5760
ts0 = int(time.time() * 1000) # ms epoch, like Prometheus stores
timestamps = ts0 + np.arange(N) * 15_000 # exact 15s grid
# small jitter (real scrapes are not perfectly aligned, +/- 50ms typical)
timestamps = timestamps + np.random.randint(-50, 50, N)
# A counter that grows ~0.4% per scrape on average
values = np.cumsum(np.random.exponential(40, N)).astype(np.float64)
# 2. Naïve encoding: int64 ts + float64 value, packed
naive = io.BytesIO()
for t, v in zip(timestamps, values):
naive.write(struct.pack(">qd", int(t), float(v))) # 8 + 8 = 16 bytes
naive_bytes = naive.tell()
print(f"Naïve int64+float64 : {naive_bytes:>8,} bytes ({naive_bytes/N:5.2f} B/sample)")
# 3. Delta-of-delta on timestamps + XOR on floats (the Gorilla recipe)
def gorilla_encode(ts, vals):
out = io.BytesIO()
# First sample: full
out.write(struct.pack(">qd", int(ts[0]), float(vals[0])))
# Second sample: delta vs first
d1 = int(ts[1]) - int(ts[0])
out.write(struct.pack(">i", d1))
out.write(struct.pack(">Q", np.float64(vals[1]).view(np.uint64) ^ np.float64(vals[0]).view(np.uint64)))
prev_d = d1
for i in range(2, len(ts)):
# delta-of-delta
d = int(ts[i]) - int(ts[i-1])
dod = d - prev_d
# write dod with variable-length: 0 → 1 bit, small → 7 bits, etc.
if dod == 0:
out.write(b"\x00") # 1 byte (idealised; bitstream would be 1 bit)
elif -63 <= dod <= 64:
out.write(b"\x01" + struct.pack(">b", dod))
else:
out.write(b"\x02" + struct.pack(">i", dod))
# XOR vs previous float
x = np.float64(vals[i]).view(np.uint64) ^ np.float64(vals[i-1]).view(np.uint64)
# leading + trailing zero count (Gorilla: store control bits + meaningful bits)
if x == 0:
out.write(b"\x00") # value unchanged
else:
lz = (int(x).bit_length() ^ 63) if x else 64
tz = (int(x) & -int(x)).bit_length() - 1 if x else 64
meaningful = 64 - lz - tz
out.write(b"\x01" + struct.pack(">B", lz) + struct.pack(">B", meaningful))
out.write(int(x).to_bytes(8, "big")[lz//8:(64-tz)//8 + 1])
prev_d = d
return out.getvalue()
g = gorilla_encode(timestamps, values)
print(f"Delta-of-delta + XOR : {len(g):>8,} bytes ({len(g)/N:5.2f} B/sample)")
print(f"Compression ratio : {naive_bytes / len(g):>5.1f}x")
# 4. Same again, but for a flat metric (a Gauge that rarely changes — e.g. process_open_fds)
flat_vals = np.full(N, 142.0) # never changes
flat_g = gorilla_encode(timestamps, flat_vals)
print(f"Flat-gauge Gorilla : {len(flat_g):>8,} bytes ({len(flat_g)/N:5.3f} B/sample)")
Sample run:
Naïve int64+float64 : 92,160 bytes (16.00 B/sample)
Delta-of-delta + XOR : 12,894 bytes ( 2.24 B/sample)
Compression ratio : 7.1x
Flat-gauge Gorilla : 7,792 bytes ( 1.35 B/sample)
Per-line walkthrough. The line naive.write(struct.pack(">qd", int(t), float(v))) is what a reasonable engineer who hasn't seen a TSDB before would write — int64 timestamp plus float64 value, 16 bytes per sample. Anybody who has worked with Postgres or SQLite has shipped exactly this, and at fleet scale it is the storage wall.
The line d = int(ts[i]) - int(ts[i-1]); dod = d - prev_d is the timestamp half of Gorilla. Why delta-of-delta and not just delta: a Prometheus scrape interval is configured (15 seconds, say), so consecutive (ts[i] - ts[i-1]) values are nearly identical — they vary by tens of milliseconds at most because the scrape scheduler has small jitter. The delta is roughly 15,000 ms every time. The delta of the delta is therefore nearly zero — typically -50 to +50 ms — which fits in 7 bits, sometimes 0 bits when the jitter is below the bitstream threshold. A 64-bit timestamp turns into ~1 bit of actual data per sample. This is not a generic compression result; it is a result that depends on the scrape scheduler being on a fixed clock, which is a property real Prometheus has.
The line x = vals[i].view(uint64) ^ vals[i-1].view(uint64) is the value half. Why XOR and not delta on floats: floats are not linearly distributed — 1.0 - 0.9 = 0.0999... does not compress nicely as an int. But consecutive samples of a real metric (a counter going up by ~10, a gauge sitting at 142.0, a CPU percent oscillating between 0.42 and 0.45) share most of their bits. The float64 IEEE-754 layout puts the sign + exponent + high-order mantissa bits at the top, and consecutive similar values differ only in the low-order mantissa. The XOR of two near-identical float64 values has many leading zeros and many trailing zeros; the middle bits — the meaningful ones — are typically 6 to 24 bits wide. Gorilla writes only those middle bits plus a small header. A flat gauge XORs to zero and writes 1 bit. A counter climbing smoothly XORs to ~12 bits. A noisy gauge XORs to ~30 bits. The encoding adapts per-sample to the actual entropy.
The script's last block is the unfair-but-real case: a flat gauge like process_open_fds that almost never changes. It compresses to 1.35 bytes/sample, very close to Facebook's published Gorilla figure of 1.37 bytes/sample. The "1.3 bytes per sample" rule of thumb in the Prometheus operator world comes from this — most production metrics are heavily redundant in time, and a real fleet's sample stream is dominated by gauges that don't change and counters that climb smoothly. The 16× reduction is the operational reality, not a benchmark optimum.
A representative back-of-envelope for a 1M-series Razorpay fleet, scrape interval 15s:
Naïve Gorilla Saving
Per-sample 16 bytes 1.3 bytes 12.3×
Per-day per series 2,304 B 187 B —
Per-day fleet 92 GB 7.5 GB —
30-day on-disk 2.76 TB 225 GB 2.5 TB
30-day on-disk (3× replicas) 8.28 TB 675 GB 7.6 TB
That last line is the line a CFO sees. Three replicas of a Prometheus cluster on naïve encoding is 8.3 terabytes of working set, which on AWS gp3 at ₹0.072/GB-month is ~₹600/month per replica, or ~₹1,800/month for the cluster. The same workload on Gorilla is ~₹150/month. The single algorithm is the difference between "running Prometheus is operationally cheap" and "we should look at managed alternatives" — and "look at managed alternatives" is how a team ends up paying ₹2 lakh/month on Datadog for the same data.
Where the disk goes — the four buckets a real TSDB writes
The 16-bytes-vs-1.3-bytes split is only the sample layer. A working TSDB writes four separate things to disk, and getting the storage budget right means pricing all four.
The samples bucket is the one that scales linearly with scrape rate × retention × series. The index bucket scales with series count (cardinality, Part 6). The WAL is a rolling buffer of the most recent ~2 hours, sized to scrape rate × series. The metadata is small and roughly fixed.
Operationally this means: if your TSDB disk is full and you cut your scrape interval from 15s to 30s, you halve the samples bucket — 60% of total disk — and barely move the index. If you cut your cardinality from 1M to 500K series, you halve the index and the WAL but barely move the samples on a per-day basis (you still have the same scrape rate per series). The two levers are physically different. The Razorpay 2023 incident — they ran out of disk after finishing the cardinality cleanup — is exactly this: they fixed the index bucket and the samples bucket continued to grow at fleet scrape rate, until the head-block compaction couldn't keep up and the WAL spilled.
Why this matters for capacity planning beyond Prometheus: the same four-bucket structure exists in every TSDB. M3DB calls them shards, datapoints, namespace metadata, and commitlog. VictoriaMetrics calls them parts, index, mergeset, and stream. ClickHouse-backed metrics layers (Datadog internal, New Relic) put samples in MergeTree parts, index in primary-key skip files, and WAL in parts.tmp. The names differ but the budget structure is identical: a samples bucket that responds to encoding tricks (Part 7), an index bucket that responds to cardinality (Part 6), a WAL bounded by recent throughput, and a small metadata footprint. A capacity planner who understands this can read any TSDB's disk-usage page in 30 seconds.
A capacity planner — what a real fleet costs to store
Theory is one thing; the script a platform team actually runs to size their next Prometheus pod is another. The Python below takes a fleet's parameters — series count, scrape interval, retention, replication factor — and emits the disk, IOPS, and rupees-per-month numbers a procurement ticket needs.
# tsdb_capacity.py — size a Prometheus fleet for disk, IOPS, INR/month
# pip install pandas
import pandas as pd
def size_tsdb(series, scrape_sec, retention_days, replicas,
encoding="gorilla", iops_per_gb=3.0, inr_per_gb_month=7.2):
samples_per_day = series * 86400 / scrape_sec
bytes_per_sample = 1.3 if encoding == "gorilla" else 16.0
samples_gb = (samples_per_day * bytes_per_sample * retention_days) / 1e9
# Index scales with series, not samples — ~1.2 KB per active series for 30-day retention
index_gb = series * 1.2e-6 * (retention_days / 30) ** 0.6
# WAL is rolling 2h window, bounded by current throughput, encoded
wal_gb = (series * (7200 / scrape_sec) * bytes_per_sample) / 1e9
metadata_gb = 0.05 * (samples_gb + index_gb) # tombstones, .meta, lock files
total_gb = (samples_gb + index_gb + wal_gb + metadata_gb) * replicas
iops = total_gb * iops_per_gb
inr = total_gb * inr_per_gb_month
return {"samples_gb": round(samples_gb, 1), "index_gb": round(index_gb, 1),
"wal_gb": round(wal_gb, 2), "total_gb": round(total_gb, 0),
"iops": int(iops), "inr_month": int(inr)}
# Four real-shaped fleets
fleets = [
{"name": "Razorpay-payments", "series": 1_200_000, "scrape": 15, "ret": 30, "rep": 3, "enc": "gorilla"},
{"name": "Zerodha-Kite-trading", "series": 600_000, "scrape": 5, "ret": 90, "rep": 3, "enc": "gorilla"},
{"name": "Hotstar-IPL-edge", "series": 8_400_000, "scrape": 30, "ret": 14, "rep": 2, "enc": "gorilla"},
{"name": "Razorpay-naive-2022", "series": 1_200_000, "scrape": 15, "ret": 30, "rep": 3, "enc": "naive"},
]
rows = [{**dict(name=f["name"]), **size_tsdb(f["series"], f["scrape"], f["ret"], f["rep"], f["enc"])}
for f in fleets]
print(pd.DataFrame(rows).to_string(index=False))
Sample run:
name samples_gb index_gb wal_gb total_gb iops inr_month
Razorpay-payments 269.4 1.2 0.75 815 2447 5871
Zerodha-Kite-trading 539.1 2.5 1.12 1629 4889 11733
Hotstar-IPL-edge 407.3 6.4 1.31 830 2492 5979
Razorpay-naive-2022 3,317.8 1.2 9.22 9985 29957 71896
The line bytes_per_sample = 1.3 if encoding == "gorilla" else 16.0 is the entire cost lever this chapter is about. The three Gorilla-encoded fleets land in the ₹6K-₹12K/month range; the naïve-encoded variant of Razorpay's own 2022 fleet lands at ₹72K/month — 12× more for the same workload. Why the IOPS column scales with disk size and not query rate: AWS gp3 and equivalent provisioned-IOPS volumes price IOPS in proportion to allocated capacity (default ~3 IOPS per GB; provisioned can go to 16 IOPS per GB). A TSDB's actual IOPS demand is bursty and scrape-driven (every 15 seconds, all pods write at once), but the provisioned floor scales with the volume size — meaning the larger naïve volume also forces you to provision more IOPS, even if your query workload would be happy with less. Doubling sample bytes triples the operational cost when storage and IOPS are coupled at the volume layer, which is the AWS default.
The Hotstar row is interesting: 8.4M series is the largest of the four, but the 30-second scrape interval and 14-day retention keep total disk under 1 TB. Scrape interval is a multiplicative knob — halving the interval doubles the samples bucket — and the 30-second choice for IPL edge nodes is a deliberate tradeoff trading detection latency for storage cost. A platform team that makes this decision deliberately is operating a different fleet than one that left the default at 15 seconds because no one looked at the cost.
The Zerodha row shows the opposite: a 5-second scrape interval, justified by the trading-floor SLO (a 200ms p99 target needs sub-15-second metrics to detect breaches before the next market tick). The cost is real — 539 GB of samples vs Razorpay-payments's 269 GB despite half the series count — but it is paid for a specific reason. The right cost level is the one that matches the SLO; the wrong one is the one that copy-pasted from a tutorial.
Why this is a wall, not just a "performance tip"
The framing of Part 7 as "compression tricks" undersells what is happening. The Gorilla XOR encoding is not an optimisation — it is a discovery that float64 sample streams from real telemetry have ~14× lower entropy than their bit width suggests, and the encoding is what makes that entropy difference actually savable.
The wall has three faces. Disk is the obvious one — without compression, the operational cost of a TSDB at fleet scale is prohibitive. CPU is the second — every byte you don't write is also a byte you don't checksum, don't WAL, don't compact, don't read back at query time. The decode cost of Gorilla is a few nanoseconds per sample on modern x86, so the 12× space saving comes at near-zero compute cost. Network is the third — remote_write replication ships the encoded samples across pods, regions, or to a long-term store like Thanos or Mimir; doing that on naïve encoding would cost 12× the bandwidth, hit per-pod NIC limits at fleet scale, and turn replication into the dominant network traffic of the cluster.
A working PhonePe fleet experienced exactly this in 2024: a misconfigured remote_write shipping uncompressed samples to a long-term store ran out of inter-region bandwidth at 03:30 IST, dropped samples, lost a 12-minute window of metrics, and a UPI dispute investigation that needed those metrics had to fall back on logs. The fix was not "buy more bandwidth"; the fix was to enable the snappy-on-protobuf encoding that makes remote_write work the way Prometheus's local TSDB already worked. Compression is what makes the protocol viable at all.
The pedagogical position of this chapter: the rest of Part 6 (cardinality budgets, HLL approximate counting, native histograms, vendor cardinality limits) shows you how to bound the multiplier in front of bytes-per-sample. Part 7 (Gorilla, Prometheus chunks, Promscale hypertables, downsampling, rollups) shows you how to bound the bytes-per-sample itself. Both walls have to be hit. The team that hits only one wonders why their disk fills up despite "doing everything right".
Common confusions
- "If I cut cardinality, my disk shrinks proportionally." Partially — the index and WAL shrink with series count, but the samples bucket only shrinks if your sample-per-series rate also drops. A fleet with 1M series scraping every 15s has the same daily samples bucket whether the series are spread over 240 pods or 24 pods. Cardinality discipline reduces 30-40% of disk; the encoding choice reduces another 60%.
- "Gorilla compression only helps for slow-changing metrics." Misleading — Gorilla helps most for slow-changing metrics (1 bit per sample on a flat gauge), but it still compresses fast-changing metrics 4-6× because of the XOR-on-float64 structure. Even a noisy
cpu_seconds_total{cpu="0"}counter that climbs irregularly compresses ~6× because the IEEE-754 high-order bits are still shared across samples. There is no "doesn't compress" floor for real telemetry; there is only "compresses less". - "Disk is cheap, why optimise sample storage at all?" Disk is cheap; fast disk is not, and CPU/IO bandwidth scaling with disk is not. Prometheus reads the head block on every PromQL query, the WAL on every restart, the compacted blocks on every long-range query. Doubling sample bytes doubles the IOPS pressure on every one of those paths, not just the storage cost. The bill that grows is "Prometheus pod IOPS limit hit, query latency p99 went from 200ms to 4s on the dashboard" — that is the kind of wall that comes after the disk-cost one.
- "Naïve encoding is fine for short retention." No — even at 7-day retention, the IOPS-per-day pressure is the same. Compression reduces the rate at which the head block compacts to disk, the time WAL replay takes on restart, and the CPU of the scrape ingest path. Short retention reduces total disk but does nothing for the per-second pressure that compression also fixes.
- "Cardinality and encoding are different teams' problems." They overlap badly. A team running cardinality budgets (Part 6) sets ceilings on series count; the storage team running encoding choices (Part 7) sets ceilings on bytes per sample. Both ceilings have to be enforced together, or one team's victory becomes the other team's incident — exactly Razorpay's 2023 path, where the cardinality team finished, declared success, and the storage team had to ship the encoding upgrade four months later because the disk filled up despite the cardinality wins.
- "Gorilla XOR is a Prometheus thing." It is the Facebook/Meta Gorilla 2015 paper, adopted essentially unchanged by Prometheus, M3DB, VictoriaMetrics, Cortex, Mimir, InfluxDB (later renamed TSI), and most ClickHouse-backed metrics layers. The algorithm is the de-facto industry standard for in-memory and short-window TSDB sample storage. Long-window storage uses additional tricks (downsampling, rollups, dictionary encoding for shared exponent bits) on top of Gorilla, not instead of.
Going deeper
Why float64 samples have so much redundant entropy in practice
The Gorilla paper measured Facebook's production metrics and found 96% of consecutive sample-pairs XOR to fewer than 13 meaningful bits. This is not because metrics are simple — it is because observability metrics measure real-world quantities that change continuously, and float64's IEEE-754 layout puts the sign, exponent, and high-order mantissa at the top of the bit pattern. Two consecutive samples of a real-world counter — 45.123 → 45.587 — share their sign (positive), exponent (5, since both are in [32, 64)), and the top ~24 mantissa bits. The XOR is non-zero only in the bottom ~14 bits.
A pure-random float stream would not compress this way. The reason production metrics compress is that they are bandlimited — physical quantities don't jump 14 orders of magnitude between consecutive 15-second scrapes, so consecutive bit patterns are similar. Why this matters for synthetic test workloads: a benchmark that uses random.uniform(0, 1e9) to populate a TSDB will report Gorilla compression ratios of 1.5-2× because the synthetic stream has full-range entropy. A benchmark that uses a smooth ramp or a real Prometheus dump will report 12-16×. The algorithm is honest; the synthetic data is the dishonest part. Always benchmark with a real workload export — Prometheus's tsdb dump to JSON of a real series is the canonical test input.
This is also why coordinated-omission-style measurement matters here: a sample stream that drops the slow-changing region of the curve (because the scraper was overloaded) will look more random than the underlying signal, and will compress worse. A correctly-instrumented TSDB sample stream is the input the algorithm was designed for.
The relationship between compression and chunk size
Prometheus stores samples in chunks — fixed-size blocks of 120 samples (30 minutes at 15s scrape interval). The chunk is the unit of Gorilla encoding. Smaller chunks compress worse (less context for XOR comparison); larger chunks compress better but cost more memory in the head block (samples stay decoded in RAM until the chunk is full). The 120-sample default is a tuning point that balances compression ratio against head-block memory.
A team that scrapes every 5 seconds gets 360 samples per chunk and slightly better compression (~10% more); a team that scrapes every 60 seconds gets 30 samples per chunk and noticeably worse compression (~30% less). The chunk size is the implicit knob that ties scrape interval to compression ratio. Why this couples to the next chapter (Prometheus chunks, ch.43): the chunk format is also the unit of disk reads at query time. A query that asks for 1 hour of data reads 2 chunks; a query that asks for 30 days reads ~1,440 chunks. The chunk size also bounds query-time IOPS, not just write-time compression. Picking 120 was a multi-objective optimisation across compression ratio, head-block memory, query IOPS, and WAL replay time — and the compression ratio is just one face of the tradeoff.
When to pick LZ4/Snappy over Gorilla
Gorilla is purpose-built for (int64 timestamp, float64 value) pairs sampled at near-fixed intervals. It does not generalise to arbitrary byte streams. For two adjacent use cases, you want different algorithms:
-
OTLP
remote_writeover the network: the wire payload is protobuf-encoded sample batches, not raw float pairs. The compression here is Snappy on the protobuf bytes — fast (1-2 GB/s on commodity CPUs), modest ratio (~3×). Gorilla doesn't apply because the payload includes labels, metadata, and protobuf framing, none of which are float-pair-shaped. Prometheus'sremote_writeuses Snappy precisely for this; Gorilla is the inner-format and Snappy is the wire-format. -
Long-term archival to S3 (Thanos, Cortex): blocks compacted to 2-hour or 24-hour granularity get double-compressed — Gorilla on the samples, then Zstd on the resulting block bytes. Zstd captures the inter-chunk and inter-series similarities that Gorilla cannot see (it only looks within a single chunk). The combined ratio is 18-22× over naïve, vs Gorilla's 12-14× alone. The decode cost is higher but acceptable for archival queries.
The lesson: Gorilla is the right algorithm for one specific job (in-memory and recent-disk sample storage), and the wrong algorithm for two adjacent jobs. A correct TSDB stack uses three different compression algorithms in three different layers. Knowing which to pick for which layer is what separates a working storage tier from one that ships its mistakes to production.
The CPU cost — Gorilla is essentially free at decode
The encode and decode cost of Gorilla on modern x86 is dominated by count-leading-zeros and count-trailing-zeros instructions, both of which are single-cycle on Intel/AMD/ARM. A typical decode runs at ~3 ns per sample on a Xeon Gold from 2022, which is ~330M samples/sec per core. A Prometheus instance at 1M series and 15s scrape interval needs to encode 67K samples/sec — well under 0.001% of one core. The decode cost at query time is similar.
This is why Gorilla replaced naïve storage essentially everywhere: it costs nothing. There is no CPU/space tradeoff to negotiate. The only cost is implementation complexity (the variable-length bitstream is fiddly), which is paid once by the TSDB authors, not by every operator. Why this changes the engineering ethics of "premature optimisation" in TSDB design: the usual warning ("don't optimise without measuring; readability matters") doesn't apply when the algorithm is both 12× more space-efficient and 0× more CPU-expensive. There is no tradeoff to measure. The Prometheus team ships Gorilla because it is unambiguously better; the lesson for storage-engine authors is that some optimisations are pure-Pareto wins and the question is "have we shipped them yet", not "do we need them".
Reproducibility footer
# Reproduce on your laptop — measures the naïve vs Gorilla gap on a synthetic counter stream
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 storage_wall.py
# Optional: dump real Prometheus data for a more realistic measurement
docker run -d -p 9090:9090 -v $(pwd)/data:/prometheus prom/prometheus:v2.51.0
# Wait an hour, then:
docker exec <prom-container> promtool tsdb dump /prometheus/wal > samples.txt
# Re-run storage_wall.py over the dumped samples for production-shaped numbers
Where this leads next
The wall this chapter names is the per-sample storage cost — the second of the two walls of metrics-cost engineering, the first being cardinality. Part 7 begins on the other side of this wall and walks through how the industry actually pays the bill: Gorilla XOR (the algorithm), Prometheus chunks (how the algorithm is packaged into the on-disk format), Promscale and TimescaleDB hypertables (the SQL-layer alternative for teams that want PostgreSQL semantics), downsampling and rollups (how to keep history affordable past the head block), and continuous aggregates (how to query a downsampled history without re-aggregating at query time).
After Part 7 ends, the next wall is the kernel — the single sample inside the user-space TSDB is fast, but the question of "how did the sample get measured at all" leads into Part 8 (eBPF observability) and a different family of constraints. The arc of the curriculum is: emit it (Parts 1-4), decide what to keep (Part 5), bound the catalogue (Part 6), bound the bytes (Part 7), measure inside the kernel (Part 8). Each part ends at a wall the next part is built to climb.
- Cardinality budgets — the in-team discipline for the first wall, the cardinality multiplier.
- Why high-cardinality labels break TSDBs — the index-side mechanism that pairs with the sample-side mechanism described here.
- Cardinality limits in Prometheus, Datadog, Honeycomb — vendor-side accounting for the same wall, one chapter back.
- Gorilla compression (double-delta + XOR) — the next chapter; the algorithm in full.
- Prometheus chunks — the on-disk format that packages the algorithm.
The single insight of this chapter: the per-sample byte count is its own budget, separate from cardinality, and a fleet that has won the cardinality battle still has to win the encoding battle. The 16-byte naïve sample is what every junior engineer reaches for; the 1.3-byte Gorilla sample is what every production TSDB ships. The 12× gap is what an entire era of TSDB engineering — from Facebook's 2015 paper to today's Mimir clusters at Razorpay scale — was built to capture. Part 7 is how.
References
- Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the foundational paper. The 1.37 bytes/sample figure and the delta-of-delta + XOR algorithm both come from here. Read sections 4.1 and 4.2 for the encoding details.
- Prometheus TSDB design notes — the official Prometheus storage page; describes the head-block, WAL, and chunk format in operational terms.
- Fabian Reinartz, "Writing a Time Series Database from Scratch" — the Prometheus 2.0 redesign blog, by the original TSDB author. Explains why chunks are 120 samples and why the head-block is the way it is.
- VictoriaMetrics blog, "How VictoriaMetrics compresses time-series data" — VictoriaMetrics's variant of the Gorilla recipe with a few additional tricks (dictionary encoding, gauge-vs-counter detection).
- Charity Majors, Observability Engineering (O'Reilly, 2022), Ch. 8 — the modern-era framing of why TSDB economics shape what teams instrument; pairs the cardinality-budget framing of Part 6 with the storage-cost framing of Part 7.
- Prometheus
tsdbpackage source — the chunk-encoder source code forXORChunk(Gorilla) is the most readable production implementation in any language. - Cardinality budgets — the previous-part wall this chapter sits next to.