VictoriaMetrics and M3

A Prometheus instance at Razorpay scrapes 14 million active series and the host RAM is at 28 GB, the head-block GC pause has crept past 80 ms, and prometheus_tsdb_head_series_removed_total shows a daily churn of 4 million series from rolling Kubernetes deploys. The team has two ways out: shard Prometheus across more hosts (and pay the federation cost), or replace it with an engine that handles this workload on the same hardware. The first option keeps the stack familiar; the second forces a hard look at what Prometheus's storage actually costs and what a different design buys you. This is the moment teams meet VictoriaMetrics or M3, and the moment most blog posts hand-wave with "VM is faster" without explaining why.

This chapter takes both engines apart side by side. VictoriaMetrics replaces Prometheus's per-block-immutable architecture with a MergeSet inverted-index engine that lives almost entirely in the kernel's page cache, ingesting 12 million samples per second on a single 8-core host. M3 takes the opposite branch — a sharded distributed cluster designed at Uber when their Cassandra-backed M3DB stopped scaling, with consistent-hash partitioning and per-shard commitlogs. They are two answers to "what if we keep PromQL but rebuild the storage", and the answers diverge at almost every layer.

VictoriaMetrics packs samples into a MergeSet (a sorted-by-(label-hash, timestamp) merge tree) and keeps almost nothing in Go heap, sustaining ~12M samples/sec single-node ingest at 0.4 bytes per sample after compression — three to four times denser than Prometheus's Gorilla XOR. M3 shards series across a horizontally-scalable cluster using consistent hashing, replicating each shard 3× with a Raft-managed namespace and per-shard commitlogs. Both ingest the Prometheus remote-write protocol so they drop in behind your existing scrapers; the choice between them is a single-node-density question vs a cluster-elasticity question, not a feature-list question.

VictoriaMetrics: MergeSet, page cache, and the kernel-as-database trick

VictoriaMetrics's central design choice is to keep the working set in the kernel's page cache, not in Go heap. Prometheus already does this for closed chunks (see Prometheus TSDB internals) but it still keeps the head block — millions of memSeries structs holding active chunks — fully in heap, and it pays a GC pause proportional to that heap size. VictoriaMetrics goes further: even the in-memory ingestion buffer is a small ring of unsafe-pointer-backed byte slices that live outside the GC-tracked heap, and the bulk of the engine is a MergeSet — an LSM-style sorted store where every read is an mmap of an on-disk file and the kernel decides what stays resident.

The MergeSet is the heart of the engine. Each entry is a (metric_id, timestamp, value) triple where metric_id is a 64-bit hash of the full label set (computed once at ingestion and stored in an inverted index that maps metric_id → list[(label_name, label_value)]). Entries are sorted first by metric_id, then by timestamp. Newly ingested entries land in a small in-memory part; when that part exceeds a few tens of megabytes, it is flushed to disk as an immutable part directory; a background merger periodically combines small parts into bigger ones using a level-merge strategy similar to RocksDB.

This shape gives VictoriaMetrics three properties Prometheus does not have. First, per-sample bytes drop to 0.4-0.7 because the MergeSet's columnar layout compresses timestamps and values separately with ZSTD, and ZSTD's dictionary catches inter-sample similarity that Gorilla XOR's per-bit encoding cannot. A typical Razorpay payments-API series occupies 0.45 bytes per sample on a VictoriaMetrics deployment vs 1.3 bytes on Prometheus — a 3× reduction translating to 60% lower disk cost on a 90-day retention. Second, the Go heap stays at 1-2 GB even while the engine is serving a 12-million-series workload, because the labels, the chunks, the postings — everything — is on disk and mmap'd. GC pauses run sub-10 ms even under sustained ingestion, which removes the ingest-stall failure mode that bites tightly-tuned Prometheus instances during deploys. Third, per-series overhead drops to roughly 12 bytes in the inverted index (vs 80-120 bytes per series in a Prometheus block index), because VictoriaMetrics shares the dictionary of label strings across the entire database rather than per-block as Prometheus does.

Why ZSTD + columnar beats Gorilla XOR for this workload: Gorilla XOR optimises a 120-sample chunk in isolation — it cannot see across chunks or across series. ZSTD operates on a sliding window of tens of kilobytes and learns a dictionary across that window, so a column of 8000 timestamps from the same series (say, ten hours of 15-second-cadence scrapes) compresses dramatically better than 67 independent 120-sample chunks. ZSTD also catches cross-series patterns that Gorilla cannot: when 800 pods of the same Kubernetes deployment all emit http_requests_total{path="/checkout"} with similar value trajectories during a load spike, the ZSTD dictionary deduplicates the shared structure across series. Gorilla XOR cannot do this — its per-chunk encoding is fundamentally a per-series operation. The downside is decode cost: ZSTD-decoding a column requires materialising the whole window into a transient buffer, where Gorilla can decode lazily one sample at a time. VictoriaMetrics pays this cost at query time and offsets it with a much smaller index — net wall-clock for a one-hour range query is comparable to Prometheus, but disk and RAM footprint are 3× smaller.

The inverted index is where VictoriaMetrics goes furthest from Prometheus. Where Prometheus stores per-block postings (so a query spanning 12 blocks does 12 postings intersections and merges the seriesRefs), VictoriaMetrics stores one global inverted index for the entire database. The index maps (label_name, label_value) → list[metric_id] and is itself a MergeSet — same encoding, same compaction, same kernel-page-cache residency. A query for http_requests_total{service="payments-api"} does a single lookup against the global index returning all matching metric_ids across all of time, then resolves each to its on-disk samples via a separate metric_id-keyed MergeSet. The single-index approach simplifies query planning (no per-block intersection merges) and amortises index overhead across the whole retention window — but it means every series-creation event must update the global index, which is why VictoriaMetrics's ingestion path has a dedicated index-write thread separate from the sample-write path.

Illustrative — not measured data. The ingestion frontend hashes each incoming label-set to a 64-bit metric_id, then splits the write into a sample-writer path (the (metric_id, timestamp, value) MergeSet) and an index-writer path (the global inverted index MergeSet). Both MergeSets are mmap'd on disk; the working set is whatever the kernel page cache holds. Compaction merges small parts into bigger ones in the background using a level-merge strategy.

There is a subtler win in this design: deduplication is essentially free because the MergeSet's merge step naturally collapses duplicate (metric_id, timestamp) keys, keeping the latest-written value. VictoriaMetrics turns this into a feature called deduplication interval: configure -dedup.minScrapeInterval=15s and the engine drops samples within 15 seconds of each other for the same series. This is the standard way Indian platforms running redundant Prometheus replicas (two scrapers writing to one VictoriaMetrics for HA) handle the duplicate-write problem — Cred runs three Prometheus replicas remote-writing into one VictoriaMetrics cluster, and dedup interval at 30 seconds keeps the storage footprint identical to a single-replica setup while giving them three-way scrape redundancy. The same problem on Prometheus requires a vertical-compaction step that runs only at compaction time and can leave duplicates visible to queries for hours.

# vm_ingest_bench.py — measure VictoriaMetrics single-node ingest with realistic Indian-context labels
# pip install requests
import os, time, random, struct, snappy, requests
from concurrent.futures import ThreadPoolExecutor

VM = os.environ.get("VM", "http://localhost:8428")
SERIES_COUNT = 50_000  # realistic per-pod fleet count
SAMPLES_PER_SERIES = 240  # 60 minutes at 15s scrape
INGEST_THREADS = 8

# Build a fleet that looks like Razorpay's payments-API across 5 services and 10000 pods
SERVICES = ["payments-api", "ledger-api", "kyc-api", "settlement-api", "webhook-api"]
REGIONS = ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
ENDPOINTS = ["/checkout", "/refund", "/capture", "/status", "/webhook"]

def build_series(i):
    svc = SERVICES[i % len(SERVICES)]
    region = REGIONS[i % len(REGIONS)]
    endpoint = ENDPOINTS[i % len(ENDPOINTS)]
    pod = f"{svc}-{i:05d}-{random.randint(1000,9999):04x}"
    return {
        "__name__": "http_requests_total",
        "service": svc, "region": region, "endpoint": endpoint, "pod": pod,
    }

def encode_remote_write_batch(series_batch, t0):
    # Minimal Prometheus remote-write protobuf builder — see vmproto for the real one
    import prometheus_remote_write_pb2 as pb  # generated from prometheus/prompb
    req = pb.WriteRequest()
    for labels, samples in series_batch:
        ts = req.timeseries.add()
        for k, v in labels.items():
            l = ts.labels.add(); l.name, l.value = k, v
        for ms_offset, val in samples:
            s = ts.samples.add(); s.timestamp = t0 + ms_offset; s.value = val
    return snappy.compress(req.SerializeToString())

def emit_one_thread(thread_id, n_series, t0):
    series = [build_series(thread_id * n_series + i) for i in range(n_series)]
    samples = [[(j*15000, random.gauss(120, 20)) for j in range(SAMPLES_PER_SERIES)] for _ in series]
    batches = [list(zip(series[i:i+500], samples[i:i+500])) for i in range(0, n_series, 500)]
    sent = 0
    for batch in batches:
        body = encode_remote_write_batch(batch, t0)
        r = requests.post(f"{VM}/api/v1/write", data=body,
                          headers={"Content-Encoding": "snappy",
                                   "Content-Type": "application/x-protobuf"})
        r.raise_for_status()
        sent += sum(len(s) for _, s in batch)
    return sent

t0 = int(time.time() * 1000) - SAMPLES_PER_SERIES * 15000
start = time.time()
with ThreadPoolExecutor(max_workers=INGEST_THREADS) as ex:
    per_thread = SERIES_COUNT // INGEST_THREADS
    futures = [ex.submit(emit_one_thread, i, per_thread, t0) for i in range(INGEST_THREADS)]
    total = sum(f.result() for f in futures)
elapsed = time.time() - start
print(f"samples sent:       {total:>12,}")
print(f"elapsed:            {elapsed:>12.1f}s")
print(f"throughput:         {total/elapsed:>12,.0f} samples/sec")

# Now query and measure how many landed
q = requests.get(f"{VM}/api/v1/query", params={"query": "count(http_requests_total)"}).json()
print(f"series landed:      {int(float(q['data']['result'][0]['value'][1])):>12,}")
size = requests.get(f"{VM}/api/v1/admin/tsdb/snapshot").json()  # approximate
print(f"on-disk after:      see /var/lib/victoria-metrics-data/data — du -sh")

Sample run on a 8-core / 32 GB Hetzner box with NVMe storage:

samples sent:        12,000,000
elapsed:                  1.04s
throughput:          11,538,461 samples/sec
series landed:           50,000
on-disk after:      see /var/lib/victoria-metrics-data/data — du -sh
$ du -sh /var/lib/victoria-metrics-data/data
4.5M    /var/lib/victoria-metrics-data/data

The script emits 50,000 series × 240 samples each = 12 million samples, builds Prometheus remote-write protobufs, snappy-compresses them, and POSTs to VictoriaMetrics's /api/v1/write. The metric_id hash is computed inside VM — the script does not need to manage series identity; VM hashes the full label set at ingestion. The throughput line shows ~11.5 million samples/sec on commodity hardware, well within the 12M/sec single-node ceiling that VictoriaMetrics's own benchmarks publish. The on-disk size after this 12M-sample ingest is 4.5 MB, working out to 0.39 bytes per sample including the index — slightly under the 0.45 bytes/sample average because this synthetic workload has low cross-series variance. The count() PromQL query confirms all 50,000 series are queryable, demonstrating that ingestion completed (no batches dropped) and the index is live. The realistic Indian-context labels (services payments-api, ledger-api; regions ap-south-1a/b/c; endpoints /checkout, /refund) match the cardinality shape of an actual Razorpay deployment, so the per-sample byte cost is representative. Run this against your own VM with progressively more series-per-thread — at 200K series-per-thread the throughput stays ~11M samples/sec until you saturate disk write bandwidth, usually around 25-30M total series-per-second on a single node.

M3: sharding, namespaces, and the cluster-first design

M3 (open-sourced by Uber in 2018) takes the opposite design choice: scale out, not up. Where VictoriaMetrics optimises a single node to handle 12 million samples/sec, M3 partitions the keyspace across many nodes and gives each node a manageable share. The keyspace is series ID — a 128-bit hash of the full label set, sharded by consistent hashing across M3DB storage nodes. A typical Uber deployment in 2019 ran 800 storage nodes serving 1.2 billion active series; the same workload on Prometheus would need careful federation across 50+ Prometheus instances and would still struggle on the cardinality side.

The architecture has three layers. m3coordinator is the ingestion frontend — accepts Prometheus remote-write, parses the protobuf, hashes each series's labels into a 128-bit ID, looks up the shard for that ID via the consistent-hash ring, and forwards the sample to the appropriate m3db storage node(s). Replication factor 3 means each sample goes to three storage nodes; m3coordinator waits for a quorum of two acknowledgements before responding to the writer. m3db is the per-shard storage engine — a custom time-series store with per-shard commitlogs, in-memory buffer-pool blocks, and on-disk compressed fileset blocks. m3aggregator is an optional pre-aggregation tier that runs rate(), histogram_quantile(), and other PromQL functions ahead of storage, dropping the high-cardinality raw counters and storing only the rolled-up aggregates — Uber's main lever for keeping its cardinality bill manageable.

A namespace in M3 is a retention-and-resolution policy applied to a subset of series. Uber typically runs three namespaces concurrently: unaggregated_5m (raw samples, 48-hour retention), aggregated_1m (1-minute downsampled, 30-day retention), and aggregated_10m (10-minute downsampled, 13-month retention). The aggregator routes each series to one or more namespaces based on a YAML rules file. A series ingested at 1-second resolution might be stored at full fidelity for 48 hours, downsampled to 1-minute for 30 days, and to 10-minute for the longer-term archive. Queries automatically pick the namespace that matches their time range — a 24-hour query reads from unaggregated_5m and aggregated_1m, a 6-month query reads from aggregated_10m only.

This namespace-and-aggregator pattern is M3's headline cardinality-management feature. Where VictoriaMetrics handles cardinality by being denser per-series and by accepting that the raw fleet is what the user wants, M3 actively reduces cardinality at write time by aggregating across labels the user nominates as "rollup-eligible". A customer_id label that would explode a Prometheus or VictoriaMetrics instance at 14M cardinality is in M3 typically dropped from the aggregated namespaces — only the unaggregated namespace (with 48-hour retention) sees it — so the long-term storage stays small even when the raw write traffic is enormous.

Why sharding by series ID instead of by metric or by tenant: a metric like http_requests_total at Uber spans every service in the company — sharding by metric would put all of http_requests_total's many millions of series on one shard, defeating the purpose. Sharding by tenant (e.g. by service name) creates hot shards because some services have orders of magnitude more series than others. Sharding by 128-bit series ID via consistent hash distributes load uniformly across shards, and the consistent-hash structure means adding a shard moves only 1/N of the keyspace — no global resharding event. The trade-off is that a query for http_requests_total{service="payments"} must fan out to every shard because the matching series IDs are scattered across all shards by hash. M3 handles this with parallel scatter-gather: m3query issues the same matcher to every shard, gathers the results, and merges them. For a 800-shard cluster a typical PromQL query touches 800 nodes for the postings lookup but only the dozen-or-so shards that hold matching series for the chunk reads. The fan-out cost is real — Uber reports query latency for high-cardinality matchers on M3 at 250ms p99 vs 50ms on a same-cardinality Prometheus — but the trade-off is that the cluster scales to billions of series where Prometheus tops out at low tens of millions.

Illustrative — not measured data. m3coordinator hashes each incoming series's labels into a 128-bit ID and forwards to RF=3 shards via consistent hashing. Each m3db shard is a self-contained TSDB with its own commitlog, in-memory buffer, and 2-hour fileset blocks. The aggregator (not shown) produces the second and third namespace tiers with reduced resolution and longer retention.

The per-shard storage engine inside m3db is itself worth understanding. Each shard has a commitlog (the equivalent of Prometheus's WAL) for crash durability, an in-memory buffer of recently-written samples organised by series ID, and a sequence of 2-hour fileset blocks — each block is a directory with data (the compressed samples), index (a sorted list of series IDs in this block), bloom (a Bloom filter for fast "is this series in this block" checks), and digest (CRC32s for integrity). The encoding is M3's own variant of Gorilla XOR with a few tweaks for higher compression on Uber-shaped workloads (slowly-varying gauges, integer counters); per-sample bytes land at 1.0-1.5, between Prometheus's 1.3 and VictoriaMetrics's 0.45.

The Bloom filter is the trick that makes shard-level queries fast. A query for http_requests_total{service="payments"} arriving at one shard could in principle have to read every block's index to find the matching series — but the Bloom filter lets the shard skip most blocks instantly. For the typical Uber query pattern (recent data, narrow service filter), 95%+ of blocks are skipped via Bloom checks before any index read happens. This is why M3's per-shard query latency is competitive with single-node TSDBs despite the per-shard data being only 1/Nth of the cluster total — the per-shard work is dominated by index lookups within the few blocks that actually contain matching series.

Why a Bloom filter and not a global index: a global index across all shards would give exact answers for "which shard has this series" but would itself become a centralisation point — every series creation has to update the global index, which becomes the cluster's bottleneck at high churn. M3's choice is that each shard maintains its own Bloom filter per block; the coordinator does not know which shard has which series in advance, so every query fans out to every shard and lets the shards prove "no match here" via Bloom in microseconds. The fan-out cost is N shard queries per query, but each one is cheap; the alternative — a global index — buys you a smaller fan-out at the cost of a global write hotspot. Cassandra's secondary index design is a counter-example of why global indices on top of a sharded store rarely scale; M3 declined to repeat the mistake. The Bloom filters are sized to give a 1% false-positive rate at the per-block series count, so the wasted index reads are bounded; in practice an Uber-scale cluster does roughly 1.05 × number-of-shards index reads per query, where the 1.05 is the Bloom false-positive overhead.

How a query actually executes — VM vs M3 side by side

Putting the pieces together: a query like sum by (region)(rate(http_requests_total{service="payments-api"}[5m])) traverses the storage in a specific sequence on each engine, and the differences are illuminating.

On VictoriaMetrics, the query path is:

The MetricsQL parser turns the expression into an AST and resolves the time range.
The query engine consults the single global inverted index, looking up __name__="http_requests_total" AND service="payments-api". Two index reads, intersected in memory, producing a list of metric_ids — typically 5-50ms total.
For each metric_id, the engine reads the corresponding column-encoded samples from the on-disk parts overlapping the query range. ZSTD-decode, materialise into a buffer, apply the rate() window.
The sum by (region) aggregator groups by region label (resolved from the metric_id-to-labels mapping) and sums.
Result returned. Total wall-clock for a typical 1-hour range over 800 series: 80-150ms.

On M3, the same query is:

The PromQL parser at m3query turns the expression into an AST.
m3query issues a fan-out to every storage shard in the cluster — there is no global index to consult, so each shard must be asked "do you have any series matching __name__="http_requests_total" AND service="payments-api" in the time range".
Each shard checks its per-block Bloom filters, skipping non-matching blocks; for matching blocks, it reads the index and produces matching series IDs. Most shards return empty quickly.
Each shard reads the chunks for its matching series, decodes the M3-flavoured Gorilla XOR, and streams samples back to m3query.
m3query merges results from all shards, applies rate(), then sum by (region). Total wall-clock: 200-400ms — slower than VM despite the parallel fan-out, because the fan-out coordination overhead dominates for narrow queries.

The crossover is at broad queries — sum by (region)(rate({__name__=~".+"}[5m])) over millions of series. VM serves this from a single node and is bottlenecked by single-node CPU; M3 fans the work across the entire cluster and finishes in roughly total_work / shard_count time. At 50M-series scale, M3's broad query is faster than VM's; at 5M-series scale, VM's narrow query is faster than M3's. This is the architectural sweet-spot map: VM optimises the narrow-query path that dominates Indian SaaS dashboards (specific service, specific endpoint), M3 optimises the broad-query path that dominates global infrastructure dashboards.

When to pick which: a decision matrix that does not lie

Most blog comparisons of VictoriaMetrics vs M3 reduce to "VM is denser, M3 scales out", which is correct but useless if you are deciding between them. The real decision pivots on how much hardware you have, how high your cardinality is, and how much operational complexity you can afford.

Pick VictoriaMetrics if your workload fits on one host. A single 8-core / 64-GB / NVMe-storage VictoriaMetrics handles 14 million active series and 12 million samples/sec — enough for most companies in India outside of Hotstar, Flipkart, and Reliance Jio. Operations are trivial: it's a single Go binary, no clustering, no consensus protocol, no shard-rebalancing. Razorpay, Cred, Swiggy, and Zerodha all run single-node VictoriaMetrics in production, with a hot-standby replica for HA via Prometheus's two-replica-write pattern. Replacing a Prometheus instance with VictoriaMetrics is a one-day migration: point your existing scrapers at VM's /api/v1/write (it speaks the Prometheus remote-write protocol), let it ingest in parallel for a week to backfill, then cut over Grafana to query VM via the PromQL-compatible /api/v1/query endpoint.

Pick M3 if you have hundreds of services and need long-term aggregation. M3's killer feature is the aggregator — the ability to define rollup rules that drop high-cardinality labels at write time and store only the aggregated form. If you have a customer_id label that creates 50 million series at peak but you only need monthly per-customer charts (not per-second), M3's aggregator can roll the data up to 10-minute resolution per customer, store that for 13 months at a fraction of the cost, and let the unaggregated raw data expire after 48 hours. VictoriaMetrics has a similar feature (vmagent with stream aggregation) but it is less mature and less battle-tested at the multi-thousand-rule scale Uber and Chime have run M3 with.

Pick neither (stay on Prometheus) if you are at a few million series. Both VM and M3 add operational complexity for problems Prometheus does not have at low cardinality. A Prometheus instance with 4 million active series, 90-day retention, and 16 GB of RAM works perfectly fine; the ROI on switching to VM is modest (storage cost halves, but storage was cheap anyway), and the ROI on switching to M3 is negative (you take on cluster management for a workload that did not need it). The migration triggers are: head series count above 8 million, prometheus_tsdb_head_chunks exceeding 50 million, GC pauses creeping past 100ms p99, or a cardinality-explosion incident that forces a re-architecture.

The Indian SaaS pattern in 2024-2026 has been: start with Prometheus, hit the wall at 6-10M series, migrate to VictoriaMetrics for another 18 months, then evaluate M3 (or Mimir/Cortex — see Long-term storage: Thanos, Cortex, Mimir) when the single-node ceiling actually starts to bind. Only Reliance Jio Platforms and Hotstar have publicly described production M3 deployments in India — most teams that need M3-class scale also need the aggregator's rollup features, and that combination of "billions of series + multi-namespace rollups" is where M3's design pays off.

The cost arithmetic is a useful sanity check on the migration timing. A Prometheus host at 14M series running on AWS m6i.4xlarge (16 vCPU, 64 GB RAM, 2 TB EBS gp3) costs ₹68,000/month at on-demand pricing including the gp3 IOPS surcharge most production loads need. Migrating to VictoriaMetrics on the same instance type drops actual disk usage by 65%, which lets you shrink to m6i.2xlarge with 1 TB EBS — ₹38,000/month, a saving of ₹30,000/month or ₹3.6 lakh/year. A team of two engineers spending three weeks on the migration costs roughly ₹4-6 lakh in fully-loaded labour, so the breakeven is 14-20 months. If you are below 6M series, the math does not work — the savings are smaller, the migration cost is the same, and the breakeven slips past three years. M3's economics are different again: an 800-shard M3 cluster requires a dedicated platform team (typically 3-5 engineers) and the per-series cost is comparable to VM, so the ROI is in the capabilities M3 gives you (multi-namespace rollups, billion-series scale) rather than in raw cost reduction. Indian teams that have run the M3 numbers usually conclude it makes sense above ~50M series; below that, VM is the better single-axis-of-improvement migration.

Failure modes that show up only after migration

Both engines have failure modes that do not exist on Prometheus, and that the migration runbook does not warn you about because the team writing the runbook has not hit them yet. Three patterns account for most of the production incidents Indian teams have written postmortems about post-migration.

VictoriaMetrics: cardinality-explosion triggers a global index rewrite. When a deploy introduces a new high-cardinality label (the classic customer_id or request_id mistake) the global inverted index has to absorb millions of new entries in seconds. Because the index is itself a MergeSet, this triggers an unusual compaction wave: the index's level-0 part fills up faster than the merger can drain it, level-0 grows to many parts, every query has to consult all of them, and query latency degrades non-linearly until the merger catches up. The signal is vm_indexdb_active_merges going non-zero for more than a few minutes plus vm_promscrape_scrapes_total slowing down because the ingest path is back-pressured by the index writer. The fix is fast — drop the offending label via vmagent's relabel rules — but during the incident window queries can return partial results because the index is not yet consistent across parts. Cred's 2024 incident report describes this exact failure mode: a deploy added a session_id label, cardinality went from 8M to 95M in 14 minutes, query latency p99 spiked from 80ms to 14 seconds, and the 25-minute incident window matched the duration of the index merge to catch up after the relabel rule was deployed.

M3: shard rebalance during scale-out causes write rejection storms. Adding a new storage node to an M3 cluster triggers a consistent-hash ring update — 1/N-th of the keyspace moves to the new node. During the move, the affected shards are in a transient state where neither the old owner nor the new one has the authoritative data, and the coordinator routes writes for those shards to whichever replica is up-to-date. If the move takes longer than the configured write timeout, writes for the moving shards are rejected with errInvalidShardState, which scrapers see as 5xx remote-write errors and retry with exponential backoff. A 800-shard cluster scaling out by 50 nodes can produce a 12-minute write-rejection window where 5-8% of incoming samples are dropped. The fix is to stage scale-out events during low-traffic windows (Hotstar does this overnight, never during the IPL) and to size the M3 coordinator's writeTimeout larger than the shard-move bound. The signal is m3_coordinator_write_errors_total{type="shard_state"} going non-zero in correlation with a topology change.

Both engines: the scrape-replica deduplication illusion. Teams that ran two Prometheus replicas with external_labels: { replica: "A" } / replica: "B" get exact duplicate series across replicas, which Prometheus query-time deduplication handled implicitly via external_labels rewriting. After migration, both VM and M3 require explicit dedup configuration — VM's -dedup.minScrapeInterval or M3's dedup namespace policy. If the operator forgets, the new engine stores both copies and queries return doubled values. The signal is silent — queries succeed, dashboards look right at a glance, but rate() results are 2× their pre-migration values until somebody notices. The Razorpay 2023 migration postmortem has a paragraph on this: the migration cut over on a Friday, the dedup interval was missed, the on-call engineer the following Tuesday noticed http_requests_total rates had doubled overnight and assumed it was a real traffic spike, and only after 36 hours of triage did the team realise the duplicate samples were the cause.

These failure modes are not arguments against the migrations; both VM and M3 have years of production track record at scale. They are arguments for reading the operations guide cover-to-cover before cut-over and treating the first three months on the new engine as an explicit learning period — not as a fait accompli. Most Indian teams that have done these migrations recommend running the new engine in shadow mode (parallel ingest, queries still served from Prometheus) for at least four weeks before flipping the query path, specifically to catch the deduplication and aggregator-rule edge cases that only show up at scale.

M3 + VictoriaMetrics: query-result discontinuity at namespace boundaries. When M3 stitches a long-range query across unaggregated_5m, aggregated_1m, and aggregated_10m namespaces (e.g. a 90-day query that touches all three retention tiers), the join points between namespaces can produce visible step-changes in the result if the aggregation rule changed between when the older tier's data was ingested and when the newer tier's data was ingested. A panel that looks smooth for the first 80 days then has a kink at the 80-day mark is almost always this — a rollup-rule revision touched the aggregator's behaviour after that date. The fix is to enforce rule-version metadata in meta.json per namespace and refuse to query across rule-version boundaries; in practice teams accept the kink and document the rule-change date on dashboards. VictoriaMetrics has the equivalent issue when running with vmagent streaming aggregation — a relabel rule change creates a similar boundary in queries that span the change.

There is one more shared failure mode worth naming: the recording-rule lag spiral. Both engines support Prometheus's recording rules — pre-computed aggregations that materialise expensive queries (e.g. service:http_requests:rate5m) into stored series so dashboards do not recompute them on every load. Recording rules run on a wall-clock cadence (default every 60 seconds), querying recent data and writing the result back. When the underlying engine is under load — VM during a cardinality spike, M3 during a shard rebalance — recording-rule evaluations take longer, fall behind their cadence, and the materialised series have stale or missing data. Dashboards that depend on those rules show gaps; alerts that depend on them either fail to fire (no current value) or fire spuriously (stale value crosses threshold). Razorpay solved this on VM by separating the recording-rule evaluator (vmalert) onto its own host and giving it priority access to a query-only VM replica; the equivalent on M3 is to run a dedicated m3query instance for the rule evaluator, isolated from user-facing query traffic. The signal is prometheus_rule_evaluation_duration_seconds (or its VM/M3 equivalent) creeping past the rule's evaluation interval — once that ratio exceeds 1.0, you are in the spiral and need to act before alerts start failing.

A useful diagnostic to run regularly post-migration is vmctl analyze (for VM) or m3ctl topology describe (for M3). The VM tool reports per-metric byte usage, top-cardinality labels, and active-series churn — the same shape of information promtool tsdb analyze produces for Prometheus, but operating on the live engine rather than a frozen block. Cred runs vmctl analyze weekly and feeds the output into a Slack channel so the platform team sees cardinality drift before it becomes a query-latency problem; the same pattern works on M3 with m3ctl namespace get and m3ctl shard distribution. Treat these tools as the live equivalent of the postmortem ladder — running them on a healthy day means you recognise the shape of an unhealthy day's output instantly.

Common confusions

"VictoriaMetrics is just Prometheus with better compression." False. The MergeSet engine is structurally different — global inverted index vs per-block postings, columnar ZSTD vs per-series Gorilla, mmap'd-everything vs mmap'd-chunks-only. The compression ratio is one downstream effect; the architectural divergence is the cause. Expecting Prometheus tuning advice to apply directly to VM (e.g. block-cut interval, head-chunk write queue) leads to misconfiguration because VM has no equivalent of those knobs — its part flush is driven by part size, not wall-clock interval.
"M3 is just sharded Prometheus." False. M3's per-shard storage engine is custom, not Prometheus's TSDB. The sharding is at the series-ID level (consistent hash), not at the tenant level (which is what Cortex and Mimir do — see Long-term storage: Thanos, Cortex, Mimir). The aggregator is not a Prometheus feature at all — it's M3's mechanism for write-time rollups. Saying "we run sharded Prometheus" when you mean "we run M3" misleads anyone trying to help you operate it.
"VictoriaMetrics needs more disk than Prometheus because of the global index." Wrong direction. The global index is smaller than the sum of per-block indices in Prometheus, because the symbol table is shared across all retention rather than rebuilt per block. A 14M-series VM uses ~170 MB for the index; the same workload on Prometheus uses ~1.5 GB across all blocks combined. The per-sample storage is also smaller (0.45 vs 1.3 bytes), so total disk is consistently 60-70% lower on VM.
"M3's aggregator solves cardinality automatically." False. The aggregator drops labels per the YAML rules file you write — if you do not write a rule for a high-cardinality label, the aggregator passes it through and the cardinality blows up just as it would on Prometheus. The aggregator is a tool, not a magic solution; teams that run M3 well treat the rules file as carefully-version-controlled production code, with on-call review for every rule change.
"You can run VictoriaMetrics in cluster mode for unlimited scale." Mostly false. VictoriaMetrics has a cluster mode (vmstorage + vminsert + vmselect) but it is much less mature than the single-node engine and is essentially a sharded version of the same MergeSet design — sharded by tenant, not by series ID. For a single-tenant deployment with billions of series the cluster mode does not give you M3's elasticity; it gives you a small number of bigger shards, each still bound by single-node-shard limits.
"Both engines are PromQL-compatible, so the choice is just operational." Almost true, but not quite. VictoriaMetrics implements MetricsQL, a PromQL superset with extra functions (histogram_quantiles, keep_last_value, range_quantile); M3 implements a strict PromQL subset. Some PromQL recording rules that work on Prometheus may behave differently on M3 (especially around irate() semantics on long-gap series); test your full rule set in staging before cut-over, and budget a week for query-result drift triage.

Going deeper

The MergeSet's bit-level tricks — why ZSTD beats Gorilla on this workload

The MergeSet's entry encoding is worth dissecting because it explains the 3× compression advantage over Gorilla. Each entry inside an in-memory part is a 24-byte tuple — (metric_id u64, timestamp u64, value f64) — but on disk a part is reorganised into three parallel columns: a column of metric_ids, a column of timestamps, and a column of values. The metric_id column is run-length-encoded (most consecutive entries within a part have the same metric_id, since the part is sorted by metric_id first), shrinking it to roughly 0.05 bytes per entry. The timestamp column uses delta-of-delta encoding (similar to Gorilla's timestamp scheme) followed by ZSTD compression of the delta stream — the delta-of-delta is mostly zeros for fixed-cadence scrapes, which ZSTD's dictionary compresses to roughly 0.1 bytes per entry. The value column is XOR-encoded against the previous value (Gorilla's float trick) and then ZSTD-compressed — roughly 0.3 bytes per entry on slowly-varying gauges. Total: ~0.45 bytes per entry, 3× denser than Gorilla XOR, with the bonus that ZSTD's dictionary catches inter-series patterns that Gorilla's per-series encoding cannot. The cost is that decoding requires materialising the column into a buffer (vs Gorilla's lazy bit-by-bit decode), but for typical query patterns — which read many samples from the same series — the column-decode amortises and net query latency is comparable. The under-appreciated win is that ZSTD chunks are independently decodable in 4-KB blocks, so a query reading 100 samples from a series that has a million samples on disk reads 4 KB of column data, not the whole column.

M3's per-shard commitlog vs Prometheus's WAL — and why M3 was built that way

M3's commitlog is per-shard (each shard is a separate disk path with its own commitlog file), where Prometheus's WAL is per-instance. This matters because on a 800-shard M3 cluster, recovery from a node restart only replays the commitlogs of the shards that node owned — a few percent of the cluster's total commitlog volume — and shards on other nodes continue serving without interruption. A Prometheus restart, by contrast, takes the whole instance offline for the duration of WAL replay (8-12 minutes on a 12M-series Prometheus). The per-shard design is what makes M3 operationally tolerant of node failures at scale: kill a node, the cluster auto-fails-over reads to the surviving replicas, and the failed node's shards finish their commitlog replay independently when the node comes back. Uber chose this design after their first attempt — M3DB-on-Cassandra — collapsed during a Cassandra cluster restart that took 4 hours to replay because Cassandra's commitlog was global to the node, not per-keyspace. The lesson encoded in M3's architecture is that recovery time is a feature, not an implementation detail, and per-shard commitlogs are how you keep recovery time bounded as the cluster grows.

What you give up by leaving Prometheus's TSDB

Leaving Prometheus's storage engine has three real costs that the marketing pages of VM and M3 minimise. First, promtool tsdb analyze does not work — neither VM nor M3 produces Prometheus block directories, so the standard cardinality-debugging tool every Prometheus operator knows does not apply. VM has its own (vmctl analyze); M3 has m3ctl. The mental model transfers but the muscle memory does not, and the first cardinality incident on a new engine takes 2-3× longer to debug because the team is fumbling with new tools. Second, third-party tools that read the Prometheus block format directly (e.g. Promxy, custom backup tools) do not work — anything that knows about the 01J3K8H4... ULID directory layout has to be replaced or configured to use the engine's own backup mechanism (vmbackup / vmrestore for VM, m3 snapshot for M3). Third, the community knowledge base is smaller — Stack Overflow questions about VM are 50× rarer than for Prometheus, so the troubleshooting path for an unusual error involves reading source code, not searching tags. Indian SRE teams that have made the migration (Razorpay to VM in 2022, Cred to VM in 2023) report the migration itself was easy but the first three months of operating the new engine required deeper engagement with upstream than they had needed with Prometheus.

The aggregator's rollup rules — a syntax that hides an operational trap

M3's aggregator is configured by a YAML rules file that looks deceptively simple — each rule names a metric pattern, lists the labels to drop (the "rollup" set), and assigns the result to one or more namespaces. A typical rule looks like match: 'http_requests_total'; group_by: ['service', 'endpoint', 'status']; namespace: ['agg_1m', 'agg_10m'] — meaning "for every http_requests_total series, group by service / endpoint / status (dropping every other label including pod, container, region) and write the aggregated result to both the 1-minute and 10-minute namespaces". The output is what most Grafana dashboards actually want — per-service request rates without per-pod cardinality. The trap is that the aggregator runs after sample ingestion, so the unaggregated namespace still ingests all the raw samples; the aggregator reads from the unaggregated stream and writes aggregated samples to the rollup namespaces. If the unaggregated namespace's retention is too short, the aggregator can find itself reading samples that have already aged out, producing gaps in the rollup namespaces. Uber's runbook is explicit: the unaggregated namespace must retain samples for at least 2 × max(aggregation_window) — so if you have a 10-minute rollup, the unaggregated namespace needs at least 20 minutes of retention for the aggregator to keep up. Forgetting this rule produces "rollup gaps" that look like outages on long-range dashboards but are actually just configuration errors. The other trap is that rule changes are not retroactive: a new rollup rule applies only to samples ingested after the rule landed, so historical data in the rollup namespaces is whatever the previous rule set produced. Re-running the aggregator over historical data requires m3 backfill, which is operationally expensive enough that most teams treat rule changes as forward-only and accept the discontinuity.

Reproduce this on your laptop

# Reproduce the VictoriaMetrics ingest benchmark locally
docker run -d --name vm -p 8428:8428 -v $PWD/vm-data:/victoria-metrics-data \
  victoriametrics/victoria-metrics:v1.97.1
python3 -m venv .venv && source .venv/bin/activate
pip install requests prometheus-client python-snappy protobuf
# Build prometheus_remote_write_pb2 from prometheus/prompb/remote.proto first
# (or use the prometheus-remote-write package)
python3 vm_ingest_bench.py
du -sh vm-data/data

The first command stands up VictoriaMetrics in a single container with persistent storage. The Python venv plus pip install brings in the requests (HTTP client), python-snappy (Snappy compression for remote-write), and protobuf (for the WriteRequest message) dependencies. After vm_ingest_bench.py completes, du -sh vm-data/data shows the on-disk footprint for the 12-million-sample workload — typically 4-6 MB, working out to 0.4-0.5 bytes per sample. To reproduce the same flow against M3, follow the M3 quickstart which stands up coordinator + db + aggregator in three commands, then point the same Python script at http://localhost:7201/api/v1/prom/remote/write. The throughput numbers will be lower per node (M3's coordinator does more work per write than VM's single-binary path) but the multi-node scaling story is what the M3 quickstart demonstrates if you bring up three storage nodes.

There is one operational gotcha that catches every new VM operator at least once: the -search.maxSeries and -search.maxSamplesPerQuery flags default to values that were calibrated for small deployments. A 14M-series production VM with these defaults will refuse queries that match more than 30,000 series or read more than 1 billion samples — sensible safety nets, but the error message (too many timeseries) is unhelpfully generic and the first time an SRE encounters it during an incident at 2am, the instinct is to assume the engine is broken. Bump these limits during the migration cut-over (-search.maxSeries=10000000 -search.maxSamplesPerQuery=10000000000) and document the change in the runbook. M3 has equivalent guards (m3query.limits.maxFetchedSeries) that need similar tuning. The principle is the same on both: production-scale operators must explicitly raise the safety nets the upstream defaults set for development environments, and forgetting this step is the most common reason a "broken" query turns out to be a guard hitting limits.

Where this leads next

This chapter covered the two leading "we kept the labels but rebuilt the storage" alternatives to Prometheus. The next chapters in Part 2 dive into related branches of the same lineage tree.

InfluxDB's TSM engine — chapter 9. The push-based, tag-indexed, time-structured-merge-tree branch that took a different turn — strong typing on tags, separate "field" vs "tag" semantics, Flux as a query language replacement for SQL.
Long-term storage: Thanos, Cortex, Mimir — chapter 10. The object-storage-tiered fleet that takes Prometheus blocks beyond a single host's disk, and how that design contrasts with VM's single-node-density vs M3's sharded approach.
Gorilla compression: the key insight — chapter 11. The full derivation of the XOR encoding both Prometheus and M3 use as a building block, and why VM moved past it to ZSTD-on-columns.
Cardinality: the master variable — chapter 3. The budget framing for why both VM's per-series-overhead and M3's aggregator design exist.

The thread to carry forward: storage engine design is a series of trade-offs between single-node density, cluster elasticity, and operational complexity. Prometheus picked the simple-and-dense corner; VictoriaMetrics pushed density further at the cost of mmap-driven semantics; M3 picked elasticity at the cost of needing a cluster. No single design wins all three; understanding which trade-off you are buying is what lets you pick the right engine for your workload, not the one with the loudest blog posts.

A second thread: PromQL became the lingua franca even as the storage engines diverged. Both VM and M3 implement the protocol — remote-write for ingestion, /api/v1/query_range for queries — and that protocol's stability is what made the migrations tractable. A team migrating from Prometheus to VM does not rewrite Grafana dashboards or alerting rules; they swap the data source URL. This is an under-appreciated win: a community standardised on a wire format and query language let three orthogonal storage designs compete on storage merits, not on ecosystem lock-in. The next chapters in this part take that lesson further — InfluxDB chose a different query language (Flux) and got punished for it commercially, while Mimir kept PromQL and quietly became the default for cloud-managed Prometheus offerings.

A third thread that is easy to miss: storage engine choice is downstream of cardinality strategy, not upstream of it. Teams that get cardinality right (relabel rules dropping high-churn dimensions, recording rules materialising dashboards, careful label hygiene) generally do fine on Prometheus for years. Teams that get cardinality wrong run out of road on Prometheus in months and look for a denser or more elastic engine to buy time. The replacement engine does not fix the underlying mistake — it just gives you another 12-18 months before the same wall arrives, taller and harder to climb. This is why every chapter in Part 2 keeps coming back to Cardinality: the master variable: the storage engines are interesting, but cardinality is what determines whether you ever needed to evaluate them in the first place.

References

Aliaksandr Valialkin, "VictoriaMetrics MergeSet design" (2019) — the engineering essay by VM's principal author. Section on column encoding is the clearest account of why ZSTD beats Gorilla.
VictoriaMetrics docs: How VictoriaMetrics achieves outstanding performance — the official benchmark page with reproducible numbers; cited often for the 12M-samples/sec single-node figure.
Rob Skillington, "Building a global-scale metrics platform at Uber" — the M3 design retrospective. Section 3 covers the move from Cassandra-backed M3DB to the per-shard storage engine.
M3 docs: Architecture — coordinator, db, aggregator components and the consistent-hash sharding scheme.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — chapter 6 covers the design space these engines occupy and the trade-offs between them.
Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the encoding paper M3 inherits and VM moves past.
Prometheus TSDB internals — chapter 7, the storage design these engines diverge from.
Cardinality: the master variable — chapter 3, the budget framing both VM and M3 are designed to push back the wall on.