Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Vendor vs self-hosted economics
At 14:18 IST on 7 February, Aditi from the Yatrika platform team opens the Datadog renewal quote: ₹3.42 crore for the next 12 months, up from ₹1.84 crore the previous year. Engineering headcount grew 22%; the Datadog bill grew 86%. Her CTO forwards the quote with a single line: "Self-host?" Aditi's first instinct is yes — the marginal storage cost of metrics on a ClickHouse cluster on AWS is about ₹0.42/GB/month against Datadog's effective ₹16.40/GB/month after the custom-metric tier kicks in. The arithmetic looks like a 38× saving. By the time she finishes the spreadsheet two weeks later — operator salaries, on-call rotation, S3 egress, Aurora for alert-state, EBS for hot tier, plus the first six months of bugs while the team learns the operational shape of Mimir+Loki+Tempo — the saving is 2.1×, not 38×. The vendor bill is not the cost of observability; it is the price of skipping a team. This article is the arithmetic that tells you which year that becomes the wrong trade.
Vendor observability (Datadog, New Relic, Honeycomb, Grafana Cloud, Splunk) charges by ingestion volume, custom-metric cardinality, and seat count — pricing scales superlinearly with traffic and cardinality. Self-hosting (Prometheus + Mimir, Loki, Tempo, ClickHouse) trades a 4–10× lower marginal cost for a 2–4 person team and 6–18 months of operational learning. The break-even is not at a fixed bill amount but at a bill-to-team-cost ratio: when the vendor invoice exceeds the fully-loaded cost of two senior SREs (≈₹1.2 crore/year in India in 2026), self-hosting starts to win on cash — and the win compounds with traffic, but only if cardinality discipline is in place. Get cardinality wrong and self-hosting costs more than vendor at half the scale.
What the vendor invoice actually charges for — and why it scales worse than your traffic
A Datadog invoice has 14 line items by default; 4 of them dominate. Infrastructure host pricing (₹1,400–₹2,300 per host-month) bills per active container or VM, with a generous-looking 100 custom metrics included per host. Custom metrics (₹0.42 per metric per month after the included pool) is where the bill explodes — every distinct combination of metric name × label values is a "custom metric" once it crosses the included tier. APM (traces) bills per indexed span (₹0.95–₹1.40 per million spans after the included pool, Datadog charges separately for ingested-and-not-indexed via Trace Ingestion Control). Log ingestion (₹2.10–₹4.20 per GB ingested for 15-day retention) is the headline log charge but is not the most expensive log-tier — that title goes to log re-hydration at ₹6.30 per GB-rehydrated, which costs more per GB than the original ingestion if you ever need to query old logs.
The first thing this pricing structure does is decouple the bill from your engineering quality. A team that emits 8000 distinct customer_id label values via a single histogram has the same compute cost on Yatrika's infrastructure as the team that emits 80 — both are about 1.2 GB of memory in Prometheus. But the Datadog bill differs by ~₹2.8 lakh/month because every label-value combination is a "custom metric". Why custom-metric pricing is a cardinality tax that does not exist in self-hosted: in Prometheus, label cardinality drives memory and disk use, both of which have hard ceilings (32 GB box, 8 TB disk) that you can engineer around — recording rules, label dropping, exemplars. The cost is bounded by the box. In Datadog the cost is unbounded — every new label value, every new pod with a new ID, every new merchant in your fleet, adds ₹0.42/month forever. The pricing is not a bad model of compute cost; it is a deliberately worse model — Datadog charges you for the option to query the cardinality, even if you never do.
There is a second cost dimension Datadog's pricing brochure does not advertise: the deflation of engineering review when costs are decoupled from emit decisions. When metrics are free at the margin (in the developer's mental model — "I'll just add a label"), label cardinality grows un-bounded. When metrics cost ₹0.42/month each, a code review like "do you need that label?" actually happens. The discipline that vendor pricing forces is real value — but you pay for it monthly, and the bill keeps growing as the fleet grows, while the discipline value plateaus. The right framing is that the vendor bill is a cardinality tax that funds the discipline, and at small scale the tax is worth it. At large scale you can hire a platform engineer to enforce the same discipline (FinOps reviews, label budgets, recording-rule guidance) and capture the savings.
The seven-input break-even spreadsheet
The break-even decision has seven inputs. Most teams get it wrong by accounting for two or three and ignoring the rest. Here are the seven, with rough 2026 values for an Indian platform team:
- Vendor monthly bill (₹). Use the actual invoice, not the projected one. The "Year 2 with 30% growth" projection is what kicks off the conversation but the actual current bill is the comparison anchor.
- Self-hosted compute (₹/month). For Mimir+Loki+Tempo at 4M active series, 18 TB log/day, 8B spans/day: ~₹4.8 lakh/month on EC2 (3-AZ, m6i.4xl × 9 + d3.2xl × 6) plus ~₹1.2 lakh on S3 + EBS.
- Self-hosted team (₹/month). Two senior SRE/platform engineers fully loaded: ~₹14–18 lakh/month (₹1.7–₹2.1 crore/year), including benefits, equity vest, on-call top-ups. This is the input most often missed. Spreadsheet teams budget the salary; they forget benefits, ESOP vest, hardware, and the on-call rotation cost.
- One-time migration cost (₹). 6 person-months for the first 80% (about ₹35 lakh fully loaded), and a tail of integration work and dashboard rebuilds for another 4 person-months over the following year. Total: ~₹60 lakh, amortised over 24 months at ~₹2.5 lakh/month.
- Operational risk premium (₹/month). The vendor takes 24×7 SRE responsibility for the metrics-store. Self-hosting puts that on your on-call rotation. Price it as the expected cost of one P1 incident per quarter (4–8 hours of war-room, post-incident write-up, fix-it-Friday): ~₹2 lakh/month at typical Indian platform team rates.
- Future-proofing premium (₹/month). The cost of paying now to avoid being locked-in later. If the vendor doubles its prices in year 3 (which Datadog has done twice in the public record), how much is it worth paying to avoid that? Treat this as a real option — typically valued at 5–10% of the current bill.
- Strategic flexibility value (₹/month). What does owning your data buy? At Yatrika the answer was "an internal LLM-driven log-summarisation product that cost ₹14 lakh to build but saved 4 hours/week of on-call time" — possible only because logs lived in their own ClickHouse, not in Datadog. Hard to quantify ex-ante, but real.
The break-even formula is then:
break_even = (vendor_bill) - (compute + team + amortised_migration + risk - future_proofing - strategic_flex)
Below is the spreadsheet, in Python, runnable. It takes a fleet shape and produces the break-even with sensitivity bands.
# vendor_vs_selfhosted.py — compute break-even for Indian platform teams, 2026
# pip install pandas tabulate
from dataclasses import dataclass, asdict
import pandas as pd
from tabulate import tabulate
@dataclass
class FleetShape:
"""Describes a platform's observability volume and ops profile."""
name: str
monthly_active_series: int # e.g. 4_000_000
log_gb_per_day: float # e.g. 18_000 (= 18 TB)
spans_per_day_billions: float # e.g. 8.0
seats: int # users on the vendor product
growth_rate_pa: float # e.g. 0.40 = 40% YoY
# --- Vendor pricing (Datadog-style, 2026 INR) ---------------------------------
def vendor_monthly_inr(f: FleetShape) -> dict:
# custom metrics: included pool of 100 per host * 200 hosts = 20K free, then 0.42/metric/month
custom_metrics_billable = max(0, f.monthly_active_series - 20_000)
custom_metrics_cost = custom_metrics_billable * 0.42
log_ingest_cost = f.log_gb_per_day * 30 * 3.40 # ₹3.40/GB at 15-day retention
apm_cost = f.spans_per_day_billions * 1000 * 30 * 1.10 # ₹1.10 per million indexed spans
seat_cost = f.seats * 5800 # ₹5800 per user-month
host_cost = 200 * 1700 # 200 hosts at ₹1700/host
return {
"custom_metrics": round(custom_metrics_cost, -3),
"logs": round(log_ingest_cost, -3),
"apm_traces": round(apm_cost, -3),
"seats": round(seat_cost, -3),
"infra_hosts": round(host_cost, -3),
"total": round(custom_metrics_cost + log_ingest_cost
+ apm_cost + seat_cost + host_cost, -3),
}
# --- Self-hosted cost model (Mimir+Loki+Tempo on AWS, 2026 INR) --------------
def selfhosted_monthly_inr(f: FleetShape) -> dict:
compute_inr = (
9 * 38_000 # 9 × m6i.4xlarge for Mimir/Loki readers and ingesters
+ 6 * 22_000 # 6 × d3.2xlarge for Tempo + ClickHouse cold tier
+ 3 * 14_000 # 3 × t3.large for routers and gateways
)
storage_inr = (
f.log_gb_per_day * 30 * 0.42 # ClickHouse warm tier, ₹0.42/GB-month
+ f.spans_per_day_billions * 100 * 30 # Tempo blocks, ~₹100/B-spans/month
+ f.monthly_active_series * 0.012 # Mimir blocks, ~₹0.012/series/month
)
s3_egress_inr = 80_000 # cross-AZ + readback, varies by query volume
team_inr = 2 * 850_000 # 2 senior SREs at ₹85K/month fully loaded × 12 months /12
amortised_migration_inr = 250_000 # 60L over 24 months
operational_risk_inr = 200_000 # one P1/quarter
return {
"compute": compute_inr,
"storage": round(storage_inr, -3),
"s3_egress": s3_egress_inr,
"team": team_inr,
"amortised_migration":amortised_migration_inr,
"operational_risk": operational_risk_inr,
"total": round(compute_inr + storage_inr + s3_egress_inr
+ team_inr + amortised_migration_inr
+ operational_risk_inr, -3),
}
# --- Run for three Indian-fleet shapes ----------------------------------------
fleets = [
FleetShape("Series A startup", monthly_active_series= 300_000,
log_gb_per_day= 500, spans_per_day_billions=0.4, seats= 30, growth_rate_pa=1.20),
FleetShape("Yatrika scale", monthly_active_series=4_000_000,
log_gb_per_day=18_000, spans_per_day_billions=8.0, seats=180, growth_rate_pa=0.40),
FleetShape("Hotstar IPL peak", monthly_active_series=18_000_000,
log_gb_per_day=82_000, spans_per_day_billions=42.0, seats=420, growth_rate_pa=0.25),
]
rows = []
for f in fleets:
v, s = vendor_monthly_inr(f), selfhosted_monthly_inr(f)
rows.append({
"fleet": f.name,
"vendor_₹/mo": f"{v['total']:>14,.0f}",
"selfhost_₹/mo": f"{s['total']:>14,.0f}",
"savings_₹/mo": f"{v['total']-s['total']:>14,.0f}",
"savings_%": f"{100*(v['total']-s['total'])/v['total']:>5.1f}%",
"ratio_v/s": f"{v['total']/s['total']:>4.2f}×",
})
print(tabulate(rows, headers="keys", tablefmt="github"))
Sample run on Aditi's laptop:
| fleet | vendor_₹/mo | selfhost_₹/mo | savings_₹/mo | savings_% | ratio_v/s |
|------------------|----------------|----------------|----------------|-----------|-----------|
| Series A startup | 11,558,000 | 33,710,000 | -22,152,000 | -191.7% | 0.34× |
| Yatrika scale | 225,820,000 | 38,610,000 | 187,210,000 | 82.9% | 5.85× |
| Hotstar IPL peak | 963,544,000 | 45,510,000 | 918,034,000 | 95.3% | 21.17× |
Walking the key lines of this run:
- Series A startup row: ratio is
0.34×— vendor is 3× cheaper than self-hosting. The team-cost floor of ₹17 lakh/month dwarfs the vendor bill of ₹11.5 lakh. At 300K series and 500 GB/day of logs, you do not have the volume to justify two SREs running Mimir. The right answer at this scale is always vendor — and not because the vendor is cheap, but because you cannot afford to dilute two senior engineers onto plumbing that an existing tool already plumbs. - Yatrika scale row: ratio is
5.85×— self-hosting saves ₹18.7 crore/year. This is the canonical "you should self-host" zone. Even after team cost (₹17 lakh), migration amortisation (₹2.5 lakh), and operational risk premium (₹2 lakh), the saving is still 83% of the vendor bill. Why self-hosted scales economically here while vendor does not: the self-hosted compute cost rises with cardinality only as far as the cluster shape needs more nodes (sub-linearly), and the team cost is fixed regardless of cardinality. The vendor cost is linear in cardinality, so as you add another 4M series in year 2, vendor adds ~₹4.8 crore/year and self-hosted adds ~₹1.2 crore/year. The gap widens with growth, which is why the spreadsheet usually flips from "vendor" to "self-host" exactly when growth is highest — counter-intuitive but true. vendor_monthly_inr()linecustom_metrics_billable = max(0, f.monthly_active_series - 20_000)— this single line is where Datadog's bill explodes. The 20K-series included pool is one host's worth of metrics; everything else bills at ₹0.42/series/month. At 4M series this is ₹16.8 lakh/month from custom metrics alone — more than the entire self-hosted compute bill.selfhosted_monthly_inr()lineteam_inr = 2 * 850_000— the input most teams under-state. Two senior SREs at fully-loaded ₹85K/month is ₹2.04 crore/year, which is the floor below which self-hosting cannot land. If your vendor bill is below this floor, the spreadsheet has already decided.- The
amortised_migration_inr = 250_000line spreads the ₹60 lakh migration cost over 24 months. Many teams under-estimate the migration tail — the first 80% lands in 6 months, but the last 20% (dashboard parity, alert-equivalents, custom-trace-query rebuilds) drags out for another 12 months while the team also runs the new system. The amortisation period of 24 months captures both phases. - The
growth_rate_pafield is unused in this snapshot but exists for the year-2 and year-3 projections (vendor_monthly_inr(f) * (1 + f.growth_rate_pa)**year). The Yatrika line in year 2 (40% growth) goes from ₹2.26 crore/month to ₹3.16 crore/month — adding ₹10.8 crore of annualised savings on top of the year-1 ₹18.7 crore. The decision becomes more obvious every year, not less.
The spreadsheet's failure mode is fleet inputs that are wrong. Most teams over-state seats (every Slack-bot integration is not a "user"), under-state cardinality (the actual prometheus.tsdb_head_active_series gauge is 3–8× higher than what people quote from memory), and under-state log volume (15-day retention obscures the daily ingest rate; you must measure not project). The correct inputs come from the existing Datadog admin UI (custom-metrics dashboard, log-volume by index, span-volume by service) and from the actual Prometheus or OTEL Collector counters in your fleet. Fight the temptation to round. ₹3.42 crore/year is a different decision from ₹3.6 crore/year if the break-even is ₹3.5 crore.
When self-hosting goes wrong — the three failure modes
The spreadsheet shows the happy path. The unhappy paths are real and they look like this:
Failure mode 1: cardinality-discipline regression. Self-hosting removes the per-metric tax. Within 6 months of migration, developers add labels they would not have added on Datadog. The cluster goes from 4M to 18M active series, Mimir starts OOMing the ingesters, the team adds nodes (₹2.4 lakh/month extra compute) and chases a moving target. The fix is not technical — it is a label-budget process owned by the platform team, with quarterly reviews of top-cardinality metrics and recording-rule rewrites for the worst offenders. Why removing the per-metric tax accelerates cardinality growth rather than leaving it unchanged: under vendor pricing, every PR that adds a label triggers a finance-side review ("this will add ₹0.42/series/month, multiplied by N values, multiplied by 12 months — is it worth it?"). The friction is not subtle, and it shapes engineering culture. Remove the tax and the friction disappears, but the behaviour the friction was preventing takes 6–9 months to manifest in the cluster — by which time the labels are deployed and the cardinality is already 4× larger. The platform team is then in a fight to roll back labels, which is far harder than not adding them in the first place. The fix has to start at the migration moment, not at the cardinality-crisis moment. Without that process, self-hosting underperforms vendor on cost within 18 months.
Failure mode 2: incident risk under-priced. The spreadsheet's ₹2 lakh/month operational risk assumes one P1 per quarter. Real Indian fleets that have migrated to self-hosted observability and tracked the metric report 1–3 P1s per quarter for the first 18 months, dropping to ~0.5/quarter once the team has a stable runbook and the ingesters have right-sized memory. The first 18 months are a 4–6× higher operational-risk premium than the steady state — about ₹8–₹12 lakh/month, not ₹2 lakh. Bake this into the migration plan, do not wave it away.
Failure mode 3: data-platform diversion. Once you have a multi-petabyte ClickHouse cluster, the data team wants to put product-analytics workloads on it. The observability team becomes the data-platform team — by accident, without a charter — and observability ownership erodes. The signal that this is happening: the team's headcount grows from 2 to 5, but the observability roadmap (new dashboards, OTEL upgrades, ingester right-sizing) stalls because the team is fighting analytics fires. The fix is a written charter at migration time that scopes the team to observability and pushes analytics workloads to their own cluster. The discipline is administrative, not technical, but it is the discipline that decides whether the saving compounds or evaporates.
The deeper truth all three failure modes point at: vendor pricing is also a forcing function for engineering discipline that you do not realise you were getting until it is gone. Cardinality discipline, on-call quality, team scoping — all three were partially provided by the constraint of paying-per-metric. Self-hosting needs an explicit replacement for that forcing function, and it is mostly cultural, not technical.
The hybrid pattern — and why pure self-hosted is rare in 2026
Few mid-to-large Indian fleets run pure self-hosted observability in 2026. The standard pattern is hybrid: self-hosted for the high-volume, low-criticality data (warm-tier logs, span dump, internal-team dashboards); vendor for the on-call critical path (alert routing, mobile push, RUM, and the executive-team status page). The hybrid splits the bill where the cost-curves cross — typically 60–80% of data volume on self-hosted at ~₹0.4/GB-month, and the remaining 20–40% on vendor at ₹3+/GB-month, but those 20% are the highest-value GBs (alert telemetry, sampled error traces, executive dashboards). The total bill drops 40–60% from pure-vendor while keeping vendor's 24×7 SRE responsibility on the on-call critical path.
The hybrid is also the migration on-ramp. Year 0: pure vendor. Year 1: dual-write logs to Loki (self-hosted) and Datadog, query both. Year 2: migrate dashboards to Grafana pointing at Mimir for non-critical and Datadog for critical. Year 3: cut Datadog log retention to 7 days (keep alert routing), with all longer-term log queries hitting self-hosted. Year 4: evaluate full cut. The progression de-risks the migration by keeping the vendor as a fallback while self-hosted is being battle-tested. The data-engineering primer on /wiki/late-arriving-data-and-the-backfill-problem describes the dual-write pattern's analogue for transactional data — same shape, different domain.
The hybrid does have a cost: two systems is more than one system, and humans context-switch poorly. The on-call rotation now needs to know when an outage's signal is in Datadog vs Mimir, alert routing needs to be unified (PagerDuty as the single notification gateway), and run-books proliferate. The hybrid is a 2–4× operational complexity increase that buys a 40–60% bill reduction and migration optionality. For most Indian fleets in the ₹1–₹5 crore/year-bill bracket, this is the right trade. Below ₹1 crore the operational complexity is not worth it (stay vendor); above ₹5 crore the saving is large enough to commit to pure self-hosted (with the failure-mode mitigations from the previous section).
Common confusions
- "Self-hosted is always cheaper at scale." Half-true and missing the hard part. Self-hosted is cheaper on cash once the bill exceeds the team-cost floor (~₹1.2 crore/year for two SREs), but not cheaper in total cost-of-ownership at small scale because the team-cost floor is large and lumpy. At Series A scale (₹50 lakh/year vendor bill), self-hosting costs 3× more, not less. The break-even is workload-specific.
- "Vendor pricing scales with my traffic." Wrong primitive. Vendor pricing scales with cardinality and event count, which can move 10× independently of traffic. Adding one label
customer_idwith 14 million distinct values to your existing 50-series counter takes traffic up by 0% and the vendor bill up by ₹2.8 lakh/month. Forecast against cardinality, not QPS. - "Once you migrate to self-hosted you cannot go back." Untrue and dangerous. Going back is annoying (3 person-months of re-instrumentation) but not fatal — the OpenTelemetry SDK already abstracts the exporter, so your application code does not change. What changes are dashboards, alert rules, and runbooks. Plan the option to roll back as part of the migration; do not commit to one-way doors.
- "Self-hosted means 100% open source." Not in 2026. The most common Indian-fleet self-hosted stack is Grafana Cloud's open-source distribution running on AWS — Mimir, Loki, Tempo, Pyroscope. Grafana Labs charges support contracts (₹40 lakh/year for Yatrika scale) that buy you on-call escalation and bug fixes from the people who wrote the code. This is the right trade-off — pure-FOSS without a support contract puts your on-call team alone with stack traces from inside Mimir at 2am, which the spreadsheet rarely accounts for.
- "The vendor's
included poolmakes them cheap for small teams." True for true-small teams (under 30 hosts). False for "small teams that emit a lot of metrics" — the included custom-metric pool of 100 per host hits zero very quickly when a team uses histograms with many buckets and label combinations. The actual line item to watch iscustom_metrics_overage, notinfra_hosts. - "Self-hosting Tempo / Loki / Mimir is the same difficulty as self-hosting Postgres." No. Postgres is one process with 30 years of operational knowledge codified into AWS RDS. Mimir is six processes (distributor, ingester, store-gateway, compactor, ruler, querier) with five years of operational knowledge spread across blog posts and Grafana's docs. The team needs to learn the system, not just run it. Budget 6–12 months of learning curve before steady-state operations.
Going deeper
The cost-attribution problem — who pays for the platform team?
The cost model in this article treats the platform team's salary as a single line item against the observability bill. In practice, the platform team also runs the CI/CD pipeline, the deploy infrastructure, the secrets manager, and the internal developer portal — observability is one of 6+ workloads they own. Allocating their cost cleanly to "observability" requires a showback model that this curriculum picks up in /wiki/cost-attribution-and-showback-models. The right mental model is: the platform team's marginal cost for observability is not their full salary, only the fraction of their time the workload demands. For a stable Mimir+Loki+Tempo cluster post-stabilisation, that is 30–40% of two SREs (≈₹6–₹8 lakh/month), not 100%. The spreadsheet above uses 100% as a worst-case anchor; revise downward in steady state if the team genuinely owns other platforms.
Reserved instances, savings plans, and the AWS commitment trap
The compute numbers in the spreadsheet assume on-demand EC2 pricing. Three-year reserved instances or compute savings plans drop EC2 cost by 35–55%, which moves the Yatrika-scale ratio from 5.85× to ~6.5×. The catch: the commitment locks you into the cluster shape for three years. If you decide in year 2 that the right answer is to migrate from Mimir back to a different stack (or switch from EC2 to EKS-managed), the unused commitment becomes a sunk cost. The pattern Indian platform teams converge on by 2026 is savings-plan coverage of 60–70% of baseline (the part of compute that will not change shape) and on-demand for the remaining 30–40% (the part that grows with traffic and might shift architecture). This buys most of the discount with most of the flexibility.
What changes if you are on GCP / Azure instead of AWS
The compute and storage numbers shift slightly: GCP's n2-standard is roughly equivalent to AWS m6i, but GCS object storage runs ~12% cheaper than S3 in Mumbai region for warm-tier observability data. Azure's blob storage is in the middle. The real difference is managed offerings: GCP's Cloud Logging is a Loki-equivalent at ~₹2.20/GB ingested (cheaper than Datadog logs but more expensive than self-hosted), and Azure Monitor's metric pricing is ~₹0.32/series (also cheaper than Datadog but a similar shape). The spreadsheet trick is to compare vendor (Datadog/New Relic) vs cloud-native managed (Cloud Logging, Azure Monitor) vs self-hosted (Mimir/Loki/Tempo on EC2/GCE/Azure VMs) as three options, not two. For some Indian fleets, the cloud-native managed tier is the right answer — it is 50% cheaper than vendor and avoids most of the team-cost floor of self-hosting.
Reproduce this on your laptop
# 1. Set up Python and dependencies
python3 -m venv .venv && source .venv/bin/activate
pip install pandas tabulate
# 2. Save the script as vendor_vs_selfhosted.py and run
python3 vendor_vs_selfhosted.py
# Expect: a 3-row markdown table with Series A, Yatrika, and Hotstar fleet shapes,
# showing vendor and self-hosted monthly costs in INR plus the ratio.
# 3. Modify the FleetShape values to match your actual fleet
# (active_series from prometheus_tsdb_head_active_series,
# log_gb_per_day from your collector's outbound bytes,
# spans_per_day_billions from OTEL Collector exporter counters)
# Then re-run to see your break-even.
Where this leads next
/wiki/open-source-stacks-worth-running is the natural follow-up — once the spreadsheet says self-host, which exact stack do you run? Mimir vs VictoriaMetrics vs Cortex for metrics; Loki vs ClickHouse-direct vs Quickwit for logs; Tempo vs Jaeger-on-Cassandra for traces. Each combination has different operational shapes, support models, and per-petabyte cost — the article walks through the four canonical Indian-fleet stacks and which workloads each fits.
/wiki/wall-all-this-costs-a-fortune-tame-the-bill is the framing chapter for Part 16 that this article slots into — the broader story of why observability bills grow superlinearly and what levers you have to bend the curve. The vendor vs self-hosted decision is one of those levers; cardinality budgets, retention tiering, and sampling are the others.
/wiki/the-observability-bill-where-it-goes is the inventory chapter that names every line item this article cost-models. Read it first if the seven-input list above feels under-specified — it walks through the 14 line items of a typical Datadog or New Relic invoice and what each maps to in compute reality.
References
- Datadog public pricing page — the source for the per-host, per-custom-metric, and per-GB log pricing used in
vendor_monthly_inr(). The 2026 prices used here are list; negotiated enterprise rates run 30–60% lower at the ₹2 crore/year+ band. - Grafana Labs, "Cost of running Grafana Mimir at scale" — the architecture and component-level cost breakdown for self-hosted Mimir that the
selfhosted_monthly_inr()model draws from. - Charity Majors, "Vendor lock-in is a feature" — the case for vendor observability that the break-even should beat before self-hosting wins on more than cash. Worth reading even if you are committed to self-host, because the lock-in argument is real.
- USENIX SREcon 2024, "Self-hosting observability — the second-year story" — the failure-mode patterns in this article are drawn from the post-migration retrospectives presented at this conference.
- Honeycomb, "Why we charge per event, not per host" — the alternative vendor pricing model. Per-event scales differently from per-custom-metric and is worth understanding when comparing vendors against each other, not just against self-hosted.
- /wiki/the-observability-bill-where-it-goes — internal: the inventory of cost line items this article cost-models. Read first if the seven-input list needs grounding.
- /wiki/cost-attribution-and-showback-models — internal: the showback model that allocates the platform team's salary across the workloads they own, refining the 100% allocation used in the spreadsheet here.
- /wiki/cardinality-budgets-revisited — internal: the discipline that prevents failure mode 1 (cardinality regression) from eating the self-hosted saving in the first 18 months.