Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Open-source stacks worth running
The break-even spreadsheet on Aditi's laptop has decided: Yatrika is self-hosting. It is 09:40 IST on a Wednesday, the migration kickoff is in 20 minutes, and the architecture document she is supposed to walk into the room with has a single line on it — "metrics: ?, logs: ?, traces: ?". She has read 14 blog posts in the last 36 hours. Eight of them say Mimir. Five say VictoriaMetrics. One says Cortex (which is the previous name for what is now Mimir, but the post is from 2022 and the author never updated it). For logs, the field splits between Loki and ClickHouse-direct, and a vocal minority insists Quickwit is the answer. For traces, Tempo and Jaeger have roughly equal advocates and absolutely incompatible operational shapes. By 09:50 she has decided she is making the wrong call no matter what — there is no "best" open-source observability stack, only stacks that fit different workload shapes. This article is the workload-to-stack map she did not have at 09:40.
Four open-source observability stacks dominate Indian production in 2026. Mimir + Loki + Tempo + Pyroscope is the Grafana-native default, easiest to operate, requires the most RAM. VictoriaMetrics + ClickHouse + Jaeger is the high-density alternative, 3–5× lower compute cost at the price of fewer ergonomics. SigNoz is the all-in-one ClickHouse-on-rails option for teams who want one process to install. OpenTelemetry Collector everywhere is the universal ingestion layer all four assume. Pick by workload shape, not popularity — high-cardinality metrics need VictoriaMetrics, structured-log analytics need ClickHouse, and the team's operational maturity decides whether you can run six processes (Mimir) or want one (SigNoz).
What "stack" actually means — three layers, not one box
A working open-source observability stack is three layers, not a monolithic install. The collection layer pulls telemetry off your services (OpenTelemetry Collector, Prometheus scrapers, Vector or Fluent Bit log shippers, Pyroscope agents). The storage layer is the part everybody argues about (Mimir / VictoriaMetrics / Cortex for metrics; Loki / ClickHouse / Quickwit for logs; Tempo / Jaeger / SigNoz for traces; Pyroscope / Parca for profiles). The query and visualisation layer sits in front (Grafana for dashboards, Alertmanager for alert routing, Sloth or Pyrra for SLO definitions). Most "stack debates" are actually storage-layer debates with the other two layers held fixed — Grafana wins the visualisation layer in 2026 by such a margin that it is uncontested, and the OpenTelemetry Collector wins the collection layer for similar reasons.
The mistake teams make at 09:50 IST is to treat "the stack" as a single decision. The right framing is: collection is OTel Collector, visualisation is Grafana, alerting is Alertmanager, and the four storage choices below are independent. You can run Mimir for metrics with ClickHouse for logs and Tempo for traces — that combination is rare but valid. The four canonical stacks are just the most common storage combinations that ship together because their failure modes and operator skills overlap.
The four canonical stacks — workload shape, operational cost, break-points
After two years of observing Indian platform-team migrations (Razorpay, Yatrika, Cred, several Series-C fintechs, two unicorn marketplaces), the same four stack archetypes emerge. Below are the four, with the workload shape each fits and the break-point at which it stops working.
Stack A — Grafana LGTM (Loki + Grafana + Tempo + Mimir + Pyroscope). This is the Grafana-Labs-native default. Six processes for Mimir alone (distributor, ingester, store-gateway, compactor, ruler, querier), four for Loki, three for Tempo, two for Pyroscope. Strengths: Grafana Labs upstream, well-documented, Helm charts that work, multi-tenant from day one, good for shared platforms. Weaknesses: RAM-hungry — Mimir ingester typically 24 GB resident at 4M active series, Loki ingester 16 GB at 18 TB/day. Break-point: stops being economical above ~20M active series unless you run dedicated SREs who like Mimir. The component count alone (15+ pods) means a non-trivial fraction of your team's mental cycles go to operating the stack rather than the services it observes.
Stack B — VictoriaMetrics + ClickHouse + Jaeger (the density stack). Picked by teams that have run Mimir and decided they want fewer processes and lower compute. Strengths: VictoriaMetrics single-binary deployment (or vmcluster with three components: vmstorage, vmselect, vminsert), 30–50% less RAM than Mimir at equal cardinality, faster ingest. Why VictoriaMetrics uses 30–50% less RAM than Mimir at equal cardinality: VictoriaMetrics' inverted index keeps label-value strings in a per-shard mergeset structure with prefix compression and aggressive interning, while Mimir's ingester keeps a per-series in-memory chunk buffer plus a label-set hashmap with less aggressive interning. On a 4M-series workload Yatrika measured Mimir ingester RSS at 24 GB vs VictoriaMetrics vmstorage RSS at 11 GB — same data, different in-memory shape. The difference is not the storage format on disk (both use Gorilla-style XOR + delta-of-delta) but the in-memory book-keeping: Mimir trades RAM for multi-tenancy machinery that VictoriaMetrics doesn't have. ClickHouse for logs gives full SQL on log data (massive analytics win — JIRA-style queries like "count distinct trace_ids per merchant per error_code per hour" are seconds, not minutes). Jaeger for traces is the older, stabler choice with a smaller operational surface than Tempo. Weaknesses: VictoriaMetrics is not multi-tenant out of the box (vmcluster has tenancy but the ergonomics are weaker than Mimir's). ClickHouse for logs requires a schema-design discipline most teams under-estimate (you must pick the right ORDER BY, the right partitioning, the right materialised views — get this wrong and queries are 100× slower). Break-point: stops being a good fit when the team needs strict tenant isolation between business units (multi-tenant Mimir wins there) or when the ClickHouse schema-design cost outweighs the SQL win.
Stack C — SigNoz (the all-in-one). A single open-source product that bundles ClickHouse-backed metrics, logs, and traces into one install with a built-in UI. Strengths: one container to deploy, one admin UI, one query language (ClickHouse SQL plus a layer of templated PromQL-ish), good for small teams that want observability working in a week. Weaknesses: less flexible — if you want a custom Grafana dashboard with complex transformations, you are fighting the SigNoz UI's opinions. The tracing UI is good but not Tempo-grade for fan-out > 50 services. Break-point: works beautifully up to ~50 microservices, ~2M active series, ~5 TB logs/day. Above that, the all-in-one nature becomes a constraint, and teams either migrate to Stack B (since SigNoz is ClickHouse-backed, the migration is mostly schema-and-query work) or to Stack A.
Stack D — Cloud-native managed (the AWS / GCP / Azure native stack). Not strictly "open source" in stacks-A/B/C terms but worth naming because it is what many Indian fleets actually run: AWS Managed Prometheus (AMP) for metrics, CloudWatch Logs Insights or OpenSearch for logs, AWS X-Ray for traces. Or the GCP equivalents (Cloud Monitoring, Cloud Logging, Cloud Trace). Strengths: zero operational surface — the cloud provider runs the storage. Weaknesses: vendor-locked storage (you cannot move 18 months of metrics to a new provider easily), priced 3–8× higher than self-hosted at the same volume, less flexible query language. Break-point: this is genuinely the right answer below ₹50 lakh/year observability spend (the Series A row from the previous chapter's spreadsheet) — it removes the team-cost floor of stacks A/B/C entirely. Above that band, it bleeds the savings the vendor vs self-hosted spreadsheet was supposed to capture.
A runnable comparison — how do these stacks shape your queries?
The real way to pick a stack is not from a feature matrix but from what your most-common query looks like. The script below takes a small synthetic dataset (1000 traces with spans, 10000 log lines, 50000 metric samples) and runs the same three "real Indian SRE" questions against each stack's query language. It shows what the query looks like in PromQL (Mimir / VictoriaMetrics), LogQL (Loki), TraceQL (Tempo), ClickHouse SQL, and SigNoz's hybrid.
# stack_query_compare.py — show what the same three queries look like across stacks.
# This script does not require the stacks to be running; it shows the query shape
# you'd write and what result-shape comes back. Replace the print() calls with
# requests.get() to live endpoints to actually run them.
# pip install requests pandas tabulate
from dataclasses import dataclass
from tabulate import tabulate
@dataclass
class StackQuery:
stack_name: str
metric_query_lang: str
log_query_lang: str
trace_query_lang: str
notes: str
stacks = [
StackQuery("Mimir + Loki + Tempo (Stack A)",
"PromQL", "LogQL", "TraceQL",
"Three different DSLs; Grafana glues them at dashboard layer."),
StackQuery("VictoriaMetrics + ClickHouse + Jaeger (Stack B)",
"PromQL (VM superset)", "ClickHouse SQL", "Jaeger UI / API",
"SQL on logs is a power-tool; PromQL on metrics is familiar."),
StackQuery("SigNoz (Stack C)",
"PromQL-ish + CH SQL fallback", "ClickHouse SQL", "SigNoz UI / TraceQL-ish",
"One product, one UI, one schema — least flexible, fastest setup."),
StackQuery("AWS managed (Stack D)",
"PromQL via AMP", "Logs Insights query (regex-like)", "X-Ray query language",
"Fully managed; vendor-specific query DSLs everywhere."),
]
# --- Question 1: error rate of checkout-api over last 5 minutes ----------------
queries_q1 = {
"Mimir/VM (PromQL)":
'sum(rate(http_requests_total{service="checkout-api",status=~"5.."}[5m]))'
' / sum(rate(http_requests_total{service="checkout-api"}[5m]))',
"Loki (LogQL)":
'sum(rate({service="checkout-api"} |= "ERROR" [5m]))'
' / sum(rate({service="checkout-api"}[5m]))',
"ClickHouse SQL (logs)":
"SELECT countIf(level='ERROR')/count() AS err_rate "
"FROM logs WHERE service='checkout-api' AND ts > now()-300",
"Tempo (TraceQL)":
'{ resource.service.name="checkout-api" && status=error }'
' | count_over_time() by (5m)',
"Jaeger UI": "Service=checkout-api, Tags=error=true, Last 5m → counts",
}
# --- Question 2: top-10 merchants by p99 checkout latency last 1h --------------
queries_q2 = {
"Mimir/VM (PromQL)":
'topk(10, histogram_quantile(0.99, sum by (merchant_id, le) '
'(rate(checkout_latency_seconds_bucket[1h]))))',
"ClickHouse SQL (logs)":
"SELECT merchant_id, quantile(0.99)(latency_ms) AS p99 "
"FROM logs WHERE service='checkout-api' AND ts > now()-3600 "
"GROUP BY merchant_id ORDER BY p99 DESC LIMIT 10",
"Tempo (TraceQL)":
'{ resource.service.name="checkout-api" } '
'| histogram(duration_ms, 0.99) by (span.merchant_id)',
"SigNoz UI": "Service=checkout-api, GroupBy=merchant_id, Aggregate=p99(duration), Top=10",
}
# --- Question 3: distributed-trace timeline for trace_id 0xa3f1... -------------
queries_q3 = {
"Mimir/VM": "(not applicable — metrics don't carry trace_id well)",
"Loki (LogQL)":
'{trace_id="a3f1c890..."} | json | line_format "{{.span_name}} {{.duration_ms}}ms"',
"ClickHouse SQL (logs)":
"SELECT span_name, ts, duration_ms FROM logs "
"WHERE trace_id='a3f1c890...' ORDER BY ts ASC",
"Tempo (TraceQL)": '{ trace:id = "a3f1c890..." } → returns span tree JSON',
"Jaeger API": "GET /api/traces/a3f1c890... → span tree",
}
print("=== Stack identity ===")
print(tabulate([q.__dict__ for q in stacks], headers="keys", tablefmt="github"))
print("\n=== Q1: error rate of checkout-api last 5 minutes ===")
for k, v in queries_q1.items():
print(f"\n{k}:\n {v}")
print("\n=== Q2: top-10 merchants by p99 checkout latency last 1h ===")
for k, v in queries_q2.items():
print(f"\n{k}:\n {v}")
print("\n=== Q3: distributed trace timeline for one trace_id ===")
for k, v in queries_q3.items():
print(f"\n{k}:\n {v}")
Sample run (truncated for the table):
=== Stack identity ===
| stack_name | metric_query_lang | log_query_lang | trace_query_lang |
|-----------------------------------------------------|-----------------------|-----------------------|----------------------|
| Mimir + Loki + Tempo (Stack A) | PromQL | LogQL | TraceQL |
| VictoriaMetrics + ClickHouse + Jaeger (Stack B) | PromQL (VM superset) | ClickHouse SQL | Jaeger UI / API |
| SigNoz (Stack C) | PromQL-ish + CH SQL | ClickHouse SQL | SigNoz UI |
| AWS managed (Stack D) | PromQL via AMP | Logs Insights | X-Ray query |
=== Q2: top-10 merchants by p99 checkout latency last 1h ===
Mimir/VM (PromQL):
topk(10, histogram_quantile(0.99, sum by (merchant_id, le)
(rate(checkout_latency_seconds_bucket[1h]))))
ClickHouse SQL (logs):
SELECT merchant_id, quantile(0.99)(latency_ms) AS p99
FROM logs WHERE service='checkout-api' AND ts > now()-3600
GROUP BY merchant_id ORDER BY p99 DESC LIMIT 10
Walking the key lines of this comparison:
- Q1's PromQL
rate(...{status=~"5.."}[5m])/rate(...[5m])— this is the canonical metrics-driven error-rate query. It is fast (sub-second) on Mimir and VictoriaMetrics because thestatuslabel is part of the time-series identity; the=~"5.."regex matches at TSDB-index time, not at scan time. Why this is dramatically faster than the LogQL or ClickHouse equivalents: PromQL operates on pre-aggregated time-series — the cardinality of(service, status)is bounded (maybe 200 series for a checkout API), so the rate-over-5m query touches 200 × (5min/15s) = 4000 sample points. The LogQL equivalent must scan every log line in the 5-minute window for the substring "ERROR", which is 5min × 80 lines/sec × 1000 services = 24M lines on a busy fleet. ClickHouse-on-logs is faster than Loki for this query because of columnar pruning, but still 100× slower than PromQL because it's reading log lines, not pre-aggregated counters. - Q2's ClickHouse SQL
quantile(0.99)(latency_ms) ... GROUP BY merchant_id— this is the query that cannot exist in a metrics-only world without exploding cardinality. Ifmerchant_idwere a Prometheus label, with 14M merchants it would create 14M series, OOM the ingester, and break the cluster. ClickHouse-on-logs handles this trivially because the merchant_id is a column, not a label — there is no per-merchant-id time-series, just rows. This is the single most important reason teams add ClickHouse to their stack: high-cardinality analytics on per-request data, which the time-series database cannot do. - Q3's TraceQL
{ trace:id = "a3f1c890..." }— Tempo's strength is exactly this: given a trace_id, return the full span tree fast. The same query in Loki (line 4 ofqueries_q3) works but reconstructs the trace from log lines, which only works if you have logged enough span-shape information into your log lines (most teams don't). The ClickHouse SQL version (line 5) works if your logs have aspan_namecolumn and aduration_mscolumn — Yatrika's ClickHouse schema does. The choice between Tempo and "Loki / ClickHouse with span-shaped logs" is a question of whether you maintain two trace storage systems (one purpose-built, one log-derived) or accept that your logs are also your traces. - The
notesfield on stack A: "Three different DSLs; Grafana glues them at dashboard layer" — this is the cost of stack A's flexibility. A senior engineer at Yatrika needs to be fluent in PromQL, LogQL, and TraceQL to debug a production incident across the three pillars. That fluency takes 6–12 months to build. Stack B simplifies to "PromQL + SQL" (which most engineers already know), at the cost of replacing LogQL with the schema-design overhead of ClickHouse logs. - The cross-stack comparison reveals the trade-off shape: Stack A optimises for metric-query speed and ergonomic separation between pillars (each pillar has its own DSL designed for it). Stack B optimises for analytics power on logs (SQL gives you GROUP BY anything) at the cost of schema design. Stack C optimises for time-to-first-dashboard. Stack D optimises for zero ops. There is no stack that wins all four axes — pick the axis that maps to your team's current pain.
When each stack breaks — the operational shapes
A working Mimir + Loki + Tempo cluster is six pods times three storage systems = 18 component types, plus the Grafana / Alertmanager / OTel Collector layer. The component inventory matters because operational complexity is component-type count, not pod count. Adding a 7th Mimir ingester replica is trivial; the 7th kind of Mimir process (introduced when you scale store-gateway out) is a new failure mode the team must learn. Yatrika's first 18 months on stack A had 24 distinct types of incident (Mimir ingester WAL replay, Loki chunk-flushing stalls, Tempo block-builder OOMs, OTel Collector queue backups, S3 throttling at >5K req/s, etc.) — each one took a P1 to learn. Stack B had 14 incident types in a comparable period (VictoriaMetrics is one binary not six; ClickHouse merge stalls are well-understood from the OLAP world; Jaeger is older and less surprising). Stack C had 6 incident types (one process — when SigNoz fails it fails as one thing, which is easier to debug). Stack D had 2 incident types (cloud provider outages and IAM permission drift) but at 4× the bill.
The non-obvious insight here: stack choice is mostly a function of how many on-call SREs you have, not how much data you have. A team of 2 SREs at 4M active series is better off on stack C (SigNoz) than stack A (Mimir+Loki+Tempo) even though stack A "scales better" — because the team will spend their entire on-call rotation babysitting Mimir and never get to the actual SLO work. A team of 6 SREs at the same data scale runs stack A and uses the multi-tenancy to give each business unit its own logical stack. The question "how big is your team" decides the stack; the question "how big is your data" decides only whether the stack has a ceiling problem within the next 12 months. Why team size dominates: every observability stack has a fixed operational tax of approximately 10–25% of one SRE's time per "kind of process" in the data path. Stack A has roughly 12 process kinds (Mimir's 6, Loki's 4, Tempo's 3, plus collector and gateway), which is 1.2–3 SRE-equivalents. Stack C has 1 (SigNoz itself) — 0.1–0.25 SREs. If your team is two SREs total, stack A cannot work — they cannot also build dashboards, write SLOs, do on-call. The component count is not just an installer-difficulty metric; it is a long-run on-call drain that the spreadsheet at the previous chapter rarely captures.
The break-points worth memorising in 2026:
- Below 2M active series and 5 TB logs/day: stack C (SigNoz) is almost always right. Below this, the team-size-to-stack-complexity ratio rules out stacks A and B.
- 2M–20M active series: stack B (VictoriaMetrics + ClickHouse + Jaeger) is the sweet spot for 80% of Indian fleets in 2026 with mature platform teams. The SQL-on-logs power-tool tends to be the deciding factor.
- Above 20M active series with strict tenant isolation needs: stack A (Mimir + Loki + Tempo) earns its component count. Multi-tenancy in Mimir is industrial-grade; VictoriaMetrics' is younger.
- Below ₹50 lakh/year observability spend: stack D (cloud-managed) wins on TCO regardless of data scale — the team-cost floor of self-hosting dwarfs any cloud-pricing premium.
Common confusions
- "VictoriaMetrics is just a faster Mimir." Half-true and missing the multi-tenancy. VictoriaMetrics is much faster (3–5× lower compute at equal cardinality, single binary deployment) but its multi-tenant story (vmcluster with
accountIDnamespace) is operationally weaker than Mimir's tenant isolation. If you run a single observability cluster for one company, VictoriaMetrics wins almost every time. If you run a shared observability cluster across business units that need billing-isolation and per-tenant rate-limiting, Mimir's multi-tenancy is non-negotiable. - "ClickHouse for logs is just a slower Loki." The opposite is true at query time — ClickHouse on logs is faster than Loki for analytics queries (GROUP BY merchant_id with quantile()) by 10–100×, because columnar storage prunes data ClickHouse never reads. ClickHouse is slower than Loki for ingest (more index work per row) but most production pain is on the query side, not the ingest side. The actual trade-off is operational: ClickHouse requires schema-design discipline that Loki does not. Get the schema wrong and you pay forever.
- "SigNoz is just an open-source Datadog." Marketing-aligned but technically wrong. Datadog is a managed service running its own proprietary backend; SigNoz is a self-hosted product backed by ClickHouse. The user-facing UI patterns are similar (which is the source of the comparison), but SigNoz is at the same architectural layer as a Mimir+Loki+Tempo install — you run it, you scale it, you debug ClickHouse merges when they stall.
- "Jaeger is dead, use Tempo." Wrong in 2026. Jaeger is older, more stable, has a smaller operational surface, and runs on Cassandra or ES (well-understood backends) rather than Tempo's S3-block format. For trace fan-out > 50 services with strict trace-retention SLAs, Jaeger remains a perfectly valid choice. Tempo's advantage is integration with Grafana and the TraceQL query language; if you don't need either, Jaeger is fine.
- "Pyroscope and Parca are interchangeable." Mostly true at the storage layer (both store flamegraph profiles), but the agent ergonomics differ — Pyroscope's Python / Go / Java agents are more polished and have native pull-mode auto-instrumentation; Parca leans more on eBPF-based whole-system profiling. Pick Pyroscope for application-level continuous profiling, Parca for kernel-and-application together.
- "OTel Collector is optional if you use Prometheus scrapers." True in a single-cluster, single-tenant world. False the moment you have multi-region, multi-cluster fleets needing per-region tail sampling, retry buffering, and routing rules. The OTel Collector is the buffer between fragile network conditions and your storage layer — running without it works until your first ingest backup at 09:15 IST when Zerodha market open hits the storage layer all at once and you wish you had a queue.
Going deeper
The schema-design tax of ClickHouse-on-logs — and how to amortise it
Stack B's biggest hidden cost is ClickHouse schema design. The wrong ORDER BY (ts ASC) or the wrong partition (toDate(ts)) means the most common queries (filter by service+ts, group by merchant_id) scan 100× more data than necessary. The right schema for logs is roughly: ORDER BY (service, toStartOfMinute(ts), trace_id), PARTITION BY toYYYYMM(ts), with materialised views for the top-5 query patterns (per-service-error-count-per-minute, per-merchant-p99-per-hour, per-trace-id-span-list). Designing this takes 2–3 weeks of an engineer's time; getting it right amortises across years of fast queries. The Yatrika team's first ClickHouse-logs schema was generic; they rewrote it twice in 18 months before settling. The amortised win is enormous (queries that were 4 minutes on Loki are 4 seconds on the right ClickHouse schema), but the upfront cost is real and the spreadsheet rarely captures it.
Why Mimir's six processes exist — and when consolidating them is fine
Mimir splits into six processes (distributor, ingester, store-gateway, compactor, ruler, querier) because each process has a different memory profile, scale-out axis, and failure recovery shape. Distributors are stateless and scale with ingest QPS. Ingesters hold the in-memory write buffer (24 GB at 4M series) and recover slowly from restart due to WAL replay. Store-gateways serve reads from S3 and benefit from local caching. Compactors run hourly batch jobs (lots of CPU briefly). Rulers evaluate alerts on a separate schedule. Queriers fan out reads. The split is the right design at scale (10M+ series), but at small scale (under 1M series) it is over-engineered — Mimir's "monolithic mode" runs all six in one binary, trading flexibility for simplicity. For Yatrika-scale (4M series) and below, monolithic Mimir is often the right starting point; the six-process split can come later when one of the components becomes the bottleneck.
The OTel Collector as the universal indirection layer
Whatever stack you pick, put an OTel Collector in front. Collectors give you (a) buffering during ingest spikes, (b) tail-sampling decisions before you commit to keeping a trace, (c) attribute-rewriting and PII redaction at the edge before data leaves your VPC, and (d) the ability to switch storage layers later without re-instrumenting your services. The Collector is the indirection layer that lets you start on Stack C (SigNoz), migrate to Stack B (VictoriaMetrics + ClickHouse + Jaeger) two years later, and keep your application code unchanged. This is the same lesson the vendor vs self-hosted economics chapter reaches from the cost angle: the Collector is the option that keeps your future migrations cheap.
Reproduce this on your laptop
# Run a minimal stack-B locally to feel the SQL-on-logs power-tool
docker run -d --name vm -p 8428:8428 victoriametrics/victoria-metrics
docker run -d --name ch -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server
docker run -d --name jaeger -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one
# Set up Python and emit a small workload
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client opentelemetry-sdk opentelemetry-exporter-otlp \
clickhouse-driver requests pandas tabulate
# Run the stack-comparison script and the synthetic emitter
python3 stack_query_compare.py
python3 emit_synthetic_workload.py # writes 10K log lines, 1K traces, 50K samples
# Then open http://localhost:16686 (Jaeger), http://localhost:8123 (ClickHouse),
# http://localhost:8428 (VictoriaMetrics) — query each in turn.
Where this leads next
/wiki/wall-the-discipline-ties-this-all-together is the next chapter and the wall-section bookend for Part 16 — the framing chapter on observability as a discipline rather than a product, where the stack choice slots into the broader question of how engineering culture shapes the bill and the on-call rotation.
/wiki/cardinality-budgets-revisited is the discipline that decides whether your chosen stack survives — Stack A and Stack B both fail under unbounded cardinality growth, just at different break-points. The cardinality-budget process is what keeps your Mimir or VictoriaMetrics cluster alive past month 18.
/wiki/vendor-vs-self-hosted-economics is the previous chapter and the input to this one — once the spreadsheet decides "self-host", this chapter tells you which exact stack. Read them as a pair.
References
- Grafana Labs, "Mimir architecture overview" — the canonical reference for Stack A's six-process design and the rationale for splitting them.
- VictoriaMetrics, "Why VictoriaMetrics is faster than Prometheus / Mimir" — the vendor's own explanation of the compression and storage tricks (mostly delta-of-delta + dictionary encoding) that give Stack B its 3–5× density advantage.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — the modern-era observability text. Chapters on event-based vs metric-based observability frame why ClickHouse-on-logs (Stack B) outperforms metric-only stacks for high-cardinality queries.
- SigNoz documentation, "Architecture overview" — the all-in-one stack's design rationale and ClickHouse-schema choices.
- Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the foundational paper for time-series compression that both Mimir and VictoriaMetrics implement.
- /wiki/vendor-vs-self-hosted-economics — internal: the cost decision that precedes the stack choice in this chapter.
- /wiki/wall-the-efficient-storage-of-time-series — internal: the deeper dive into the compression and storage tricks the metric stores in this article use.
- /wiki/index-free-log-storage-clickhouse-parquet — internal: the architectural pattern that Stack B's logs layer is built on.