Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: observability in distributed systems is a data problem

It is 02:14 IST on a Sunday and Kiran, the SRE on call at PaySetu, is staring at three Grafana tabs because the merchant-settlement p99 has just walked from 180ms to 4.2 seconds and stayed there. The service has 240 pods spread across three regions. The trace UI is timing out because Kiran's query — service=settlement AND status=5xx AND merchant_id=* — fans out across 14 days of trace storage and the backend is rate-limiting her. The logs UI works but returns 11 million lines for the last 10 minutes, and the search bar is grep-over-S3. The metrics UI is fast, but the only label that would tell her which merchant is broken — merchant_id — was dropped six months ago because it pushed Prometheus past 12 million active series and the cluster fell over. Kiran has all the data. None of it is in a shape she can ask the question with. The wall every distributed system hits, eventually, is that the moment you have more than a handful of services, observability is no longer a logging problem — it is a streaming, indexing, sampling, and cardinality-budget problem, and the architecture of your telemetry pipeline starts to dominate the architecture of your application.

A 10-service system makes a few GB/day of telemetry; a 400-service system makes 50–200 TB/day, and the difference is not a multiplier — it is a phase change. Logs, metrics, and traces each scale on a different bottleneck (storage cost for logs, label cardinality for metrics, sampling budget for traces), and a working observability stack is three independent pipelines tuned to those three bottlenecks. The instinct to "just keep everything" is what kills the budget and, eventually, what makes incidents undebuggable. Treat observability as a data engineering problem with a query SLA, not as a logging library you import.

Why "just log more" stops working at three or four services

A monolith on one host produces telemetry that one engineer can hold in their head: tail the syslog, watch the error rate, eyeball the request log. The cost of an extra logger.info(...) is negligible. The cost of "let's add a Prometheus counter" is a single label change. None of the trade-offs that dominate distributed-systems observability are visible at this scale, because the data volume per query stays in the gigabytes and the cardinality stays in the thousands.

That regime ends sharply. The moment you have ~30 services with ~10 instances each and any non-trivial traffic, three numbers cross over simultaneously: the total log volume goes from MB/day to TB/day, the total metric series goes from thousands to millions, and the total span volume goes from "every request" to "must sample". Each of those thresholds breaks a different part of the stack — the log indexer, the metric TSDB, and the trace store — and each breaks for a fundamentally different reason. The instinct that worked on the monolith ("emit more, search later") becomes the instinct that bankrupts the observability budget and locks the data in a shape no one can query.

The phase change in telemetry volumeThree stacked horizontal bars showing telemetry volume per day by system size. Top bar: 1 service / 1 host produces ~5 GB/day, lightly shaded. Middle: 30 services / 300 hosts produces ~2 TB/day. Bottom: 400 services / 8000 hosts produces ~120 TB/day, deeply shaded. Three vertical bottleneck markers show where each stack breaks: log indexer at 5 TB/day, metric TSDB at 5M active series, trace store at 100% sampling. Illustrative. Telemetry volume scales super-linearly — three bottlenecks, three different shapes 1 service / 1 host ~5 GB/day — one tail call, one grep 30 services / 300 hosts ~2 TB/day — log indexer breaks first 400 services / 8000 hosts ~120 TB/day — all three pipelines must be re-architected Where each stack breaks: Logs — break at storage cost: indexed logs at ~$3/GB/month means 120 TB/day = ~$130k/month Metrics — break at series cardinality: ~5M active series saturates a single Prometheus shard Traces — break at sampling: ~10B spans/day forces head-or-tail sampling decisions
The volume scales as services × hosts × requests; the bottleneck for each pillar scales differently. A 24× increase in services rarely produces a 24× increase in pain — it produces a 100× increase, because two of the three pillars hit a non-linear wall (cardinality and sampling).

The "wall" framing matters because the failure mode is not "things get expensive". It is "the queries you most need during an incident become impossible". When merchant-settlement p99 jumps at 02:14, Kiran does not need average behaviour — she needs to slice by merchant_id, by region, by version, and by payment_method simultaneously, in the last 4 minutes, and get an answer in under 10 seconds. The observability stack either was designed to support that query or it was not, and the design decision that determines the answer was made months earlier when someone chose what to keep, what to drop, and what to sample.

The three pillars are three different data problems

Logs, metrics, and traces are three distinct data primitives with three distinct cost models. Treating them as one pipeline ("send everything to Splunk") is the architectural mistake that produces the bill nobody can defend. Treating them as three pipelines, each tuned to its own bottleneck, is what survives the phase change.

Logs, metrics, and traces — three pillars, three bottlenecksThree columns. Logs: timestamped string events, append-only storage, indexed by time + source, cost dominated by storage and indexing. Metrics: numeric time series with labels, Prometheus-style TSDB, cost dominated by active-series cardinality, label explosion is the killer. Traces: parent-child span trees with trace_id, sampled at head or tail, cost dominated by per-span overhead and sampling decisions. Illustrative. Logs vs metrics vs traces — three primitives, three cost models Logs Shape: string events, time-ordered Schema: optional (JSON/text) Volume per req: 0.5–10 KB Index: time + source + body Bottleneck: $/GB stored Question they answer: "What did this one request do, in narrative form?" Tools: ELK, Loki, ClickHouse, Splunk, Datadog Logs Trap: keeping all DEBUG forever — costs explode Metrics Shape: (name, labels, value, t) Schema: enforced; pre-aggregated Volume per req: ~16 bytes/series Index: by (name, label-set) Bottleneck: cardinality Question they answer: "How is the system behaving in aggregate?" Tools: Prometheus, VictoriaMetrics, M3, Mimir, InfluxDB Trap: putting user_id or trace_id as a label Traces Shape: parent-child span tree Schema: OpenTelemetry / Jaeger Volume per req: 0.5–5 KB/span Index: by trace_id, service, op Bottleneck: sampling rate Question they answer: "Where in this 47-hop call tree did latency happen?" Tools: Jaeger, Tempo, Zipkin, Honeycomb, Lightstep Trap: head-sampling 1% then missing every error trace
The three pillars answer different questions and break for different reasons. A team that picks one tool to "do all three" eventually rebuilds it as three. PaySetu emits ~80 TB/day of logs (S3 + ClickHouse), ~3M active metric series (VictoriaMetrics), and ~12B spans/day with 0.5% head sampling plus 100% error tail-sampling (Tempo).

The bottleneck for each pillar is the load-bearing fact of its architecture. Why cardinality breaks metrics first: a Prometheus-style TSDB indexes by (metric_name, label_set). Each unique label combination is its own time series, with its own per-series memory footprint (~3 KB resident). 5 million active series ≈ 15 GB of RAM for the index alone, before any sample storage. Add merchant_id (200k values) as a label to a counter and the cardinality multiplies — 5 metrics × 200k merchants × 50 status codes = 50M series, which is ~150 GB of index. The cluster does not gracefully degrade; it OOMs.

Why storage cost dominates logs: a log line is ~1 KB compressed, indexed for full-text search at ~3× the raw size. At 50 TB/day raw → 150 TB/day indexed, retention of 30 days = 4.5 PB online. Hot indexed storage on commercial vendors runs 1.5–3/GB/month, so the bill is7M–$13M/year for indexed retention alone. The fix is tiering: hot index for 24 hours, then move to columnar object-storage (ClickHouse/S3/Parquet) for 30 days, drop after.

Why sampling dominates traces: a fully-sampled production system at 100k requests/sec with 30 spans per request emits 3M spans/sec, ~260 billion spans/day. At 1 KB per span that is 260 TB/day for traces alone. Head sampling (decide at the root) misses error traces because most errors happen on rare paths. Tail sampling (decide after the trace completes) requires buffering every span for the trace duration — solvable, but it pushes complexity into the collection tier.

A worked example: the cardinality math that decides what becomes a label

The most common production mistake is adding a label that "would be useful" without doing the cardinality multiplication. Most engineers do not have an intuition for it, because the numbers come out terrifying surprisingly fast. The exercise below is the one to run before every metric design review.

# cardinality_estimator.py — what your label set will cost in active series.
# Run this before adding any label to a metric in production.
from itertools import product

def cardinality(metric_name, label_cardinalities):
    """Active-series count = product of per-label cardinalities."""
    total = 1
    for label, n in label_cardinalities.items():
        total *= n
    return total

def cost_estimate(series_count, bytes_per_series=3000):
    """Rough TSDB resident-memory estimate."""
    gb = (series_count * bytes_per_series) / (1024 ** 3)
    return gb

scenarios = {
    "before — sane labels": {
        "service": 400,        # 400 microservices
        "endpoint": 20,        # ~20 endpoints per service avg
        "method": 4,           # GET/POST/PUT/DELETE
        "status_class": 5,     # 1xx..5xx
        "region": 3,           # ap-south-1, ap-southeast-1, us-east-1
    },
    "after — engineer added merchant_id": {
        "service": 400, "endpoint": 20, "method": 4,
        "status_class": 5, "region": 3,
        "merchant_id": 200_000,  # PaySetu's onboarded merchants
    },
    "worse — also user_id": {
        "service": 400, "endpoint": 20, "method": 4,
        "status_class": 5, "region": 3,
        "merchant_id": 200_000,
        "user_id": 80_000_000,  # PaySetu's monthly active users
    },
}
for name, labels in scenarios.items():
    n = cardinality("http_requests_total", labels)
    print(f"{name:40s}  series={n:>20,}  RAM≈{cost_estimate(n):>12,.1f} GB")

Realistic output:

before — sane labels                       series=             480,000  RAM≈         1.3 GB
after — engineer added merchant_id         series=      96,000,000,000  RAM≈   268,221.0 GB
worse — also user_id                       series= 7,680,000,000,000,000  RAM≈ 21,457,672,000 GB

Walkthrough, line by line: cardinality() multiplies all per-label cardinalities — every distinct combination is a separate time series. The "before" scenario keeps labels bounded: 400 services × 20 endpoints × 4 methods × 5 status classes × 3 regions = 480k series, well within a single Prometheus shard. Adding merchant_id (200k values) multiplies the count by 200,000 — to 96 billion series, requiring 260 TB of RAM. Adding user_id (80M values) multiplies again, to 7.68 quadrillion. The output is not a bug; the multiplication is what cardinality means, and the line "let's just add merchant_id as a label" is the line that destroys the cluster. Why the answer is not "buy a bigger cluster": the cost grows multiplicatively with each high-cardinality label, and high-cardinality data is what you actually want to query during an incident — Kiran needs merchant_id. The architectural answer is to keep merchant_id out of metrics and put it into the trace tier (which samples) or the log tier (which is full-text-indexed but not pre-aggregated). The right tool for "show me p99 by merchant" is exemplars (link a metric bucket to a sampled trace_id) or a high-cardinality store (ClickHouse, Honeycomb), not a TSDB.

The collection pipeline is itself a distributed system

The naive picture is "applications emit telemetry, the observability backend stores it". The honest picture is that between the application and the backend sits a pipeline that is typically larger and more complex than several of the services it monitors — and when this pipeline degrades, you go blind exactly when you need to see most.

The telemetry pipeline as a distributed systemLeft to right flow. Application pods on the far left emit logs / metrics / spans into a per-host agent (Vector, Fluent Bit, OpenTelemetry Collector). The agent batches and forwards to a regional Kafka cluster. From Kafka, three independent consumer pipelines fan out: log indexer (writes to S3 + ClickHouse), metric ingest (VictoriaMetrics), trace processor (tail-samples then writes to Tempo). Each stage shows queue depth and lag warnings. An animated red dot moves through showing back-pressure propagation when the indexer falls behind. Telemetry pipeline — the distributed system that watches your distributed system App pods 8000 hosts SDK emit: slog / OTLP / prom-client Buffer: stdout + in-process ring Drop policy: tail-drop on full Per-host agent Vector / Fluent Bit / OTel Collector Batch, parse, enrich (k8s labels), redact PII Disk buffer: ~1 GB per host Failure mode: disk-full = drop Kafka tier 3 topics: logs, metrics, spans Retention: 4h replay buffer Decoupling: lets indexer fall behind without losing data Log indexer → S3 + ClickHouse lag alert: >5 min Metric ingest → VictoriaMetrics cardinality guard Trace processor → tail-sample → Tempo buffer: 30s/trace Logs UI Grafana / Kibana Metrics UI Grafana / Perses Trace UI Jaeger / Tempo A red dot animating left-to-right shows one record's path; back-pressure originates at the indexer and propagates upstream through Kafka.
Six stages, three independent pipelines, one shared Kafka tier. The Kafka decoupling is what lets the log indexer take a 30-minute incident without dropping any data — but only if the host-agent disk buffers and the Kafka retention together cover the recovery time. PaySetu sizes Kafka retention to 4 hours after a 02:00 incident in 2025 where the indexer lagged for 90 minutes and 38 minutes of logs were lost to retention rollover.

The pipeline is itself subject to every distributed-systems failure mode covered earlier in this curriculum: partitions (the wall on partitions), back-pressure, clock skew between agents and ingest, and the load-balancing decisions inside each tier. The bitter pattern: the observability stack goes degraded during incidents — the same network event that broke your service breaks the agents reporting on the service, the surge of error-path logs spikes Kafka producer queues, and the dashboards are the first thing to time out at 03:00 when you most need them. Two specific design rules survive this:

  • Telemetry pipeline gets its own infrastructure budget and on-call. Co-tenanting telemetry collection on the same Kubernetes cluster as the applications it monitors means a cluster-control-plane failure takes both down. PaySetu runs the agent → Kafka → indexer chain on a dedicated VPC with separate IAM and a separate on-call rotation. The cost is real (~8% of total infra spend); the alternative is "no observability during the incident".

  • The replay buffer must outlast the worst expected indexer outage. If the log indexer falls behind by 30 minutes once a quarter, Kafka retention must be >30 minutes. If the worst observed outage is 4 hours, retention must be >4 hours plus margin. This is sized empirically, not by guess, and revisited every quarter against incident logs.

Common confusions

  • "OpenTelemetry replaces logs/metrics/traces" — OpenTelemetry is a wire format and SDK convention, not a backend. It standardises how you emit telemetry; the storage, indexing, and query bottlenecks still live in your TSDB, log indexer, and trace store. Migrating to OTLP without redesigning the backend pipeline solves nothing about cardinality or retention cost.

  • "Cardinality is just a Prometheus problem" — every label-indexed TSDB has the same multiplicative cardinality cost (VictoriaMetrics, M3, Mimir, InfluxDB, Datadog). The cost model is the same; the limit at which a single shard explodes varies by ~3×. The architectural decision (what becomes a label) is portable across vendors.

  • "Sampling means we lose data we need" — head sampling at 1% and tail sampling on errors at 100% covers the two queries that matter (aggregate p99 and per-error trace) at ~2% of the cost. The "we'll keep all traces" instinct is what prices teams out of distributed tracing entirely. The right question is "which traces are worth keeping", not "how do we keep them all".

  • "Logs and traces are duplicates" — they answer different questions. A trace shows you the call tree across services; a log shows you what happened inside one service in narrative form. A trace spans 47 services; a log line tells you why the database connection pool was full. You need both, but the right amount of each is wildly different — typically 10–50× more log volume than trace volume by bytes.

  • "Just send everything to a vendor and they'll handle scale" — vendors handle scale by charging for it. At 120 TB/day, vendor pricing for indexed retention runs $5–10M/year. The architectural decisions (sampling, label cardinality, retention tiers) are the same whether you self-host or vendor-host; the vendor just makes the bill explicit instead of buried in EC2.

  • "Observability is for SREs" — the highest-leverage observability decisions are made by application engineers when they choose what to instrument and what label to add. SRE teams can build the platform; they cannot retroactively make your payment_processed counter queryable by merchant_id if you put merchant_id in a label and OOM'd the cluster.

Going deeper

The exemplars trick — bridging the metric-trace gap

The right answer for "show me a sampled trace from any spike in p99 latency" is exemplars: a metric data point carries a pointer (one trace_id) to one trace that contributed to it. OpenMetrics defines exemplars as part of the wire format; Prometheus 2.26+ stores them; Grafana and Tempo render them as clickable points on a histogram. The mechanism: when a histogram bucket is incremented, the SDK optionally attaches the current span's trace_id to that increment. At query time, the user sees not just "p99=4.2s" but "p99=4.2s, click here for a representative trace". This is the dominant pattern that lets a sampled-1% trace store be useful for incident debugging — you do not need every trace, you need one trace per anomalous bucket. Honeycomb's BubbleUp and Lightstep's "change intelligence" are the same idea wrapped differently. Without exemplars, the metric tier and trace tier might as well be unrelated systems.

Tail sampling architecture — why it pushes complexity into the collector

Tail sampling decides "keep this trace?" after the trace completes, so the decision can use the actual outcome (was there an error, was latency anomalous). The mechanism: every span flows into a stateful collector that buffers spans by trace_id for a window (typically 30s — must exceed the longest expected trace duration), and on root-span completion runs a sampling policy that may inspect any span in the trace. The trade-off vs head sampling is sharp: tail sampling needs ~30 GB of RAM per 1M traces in flight (3M spans × 1 KB × 10× safety), and the collector cluster needs sticky routing so all spans from one trace land on the same collector node (or a Kafka topic keyed by trace_id, which is how the OpenTelemetry Collector's tail_sampling processor scales). Get this wrong and traces split across collectors lose half their spans. Get it right and you keep 100% of error traces, 100% of slow traces, and 0.5% of the rest — orders of magnitude cheaper than 100% sampling and orders of magnitude more useful than head-1%.

Streaming aggregation — when metrics aren't enough and you can't afford logs

Between the cheap-aggregate world of metrics and the expensive-detail world of indexed logs sits streaming aggregation: read the log/event stream, keep a sliding-window sketch (HyperLogLog for unique counts, t-digest for percentiles, count-min sketch for top-k), and write the sketch out at low frequency. The sketch is bounded in memory (5 KB per HyperLogLog, ~2 KB per t-digest) regardless of input cardinality. ClickHouse, Apache Pinot, Druid, and Materialise are all variants of this idea. CricStream uses streaming aggregation to track "concurrent viewers per match per region" — the cardinality (~10k matches × ~30 regions = 300k series) is too high for direct metric labels, the per-event volume (5M events/sec) is too high for indexed logs, but a streaming HyperLogLog over a 30-second window gives a 5%-accuracy unique-viewer count at near-zero cost. The pattern is: when neither metrics nor logs fit, the data is asking to become a stream-processing job.

Cost shape of an observability bill

A characteristic 400-service organisation in 2026 spends roughly: 45% of the observability bill on log retention (hot index 24h, warm columnar 30d, cold S3 90d), 25% on metric storage (active series + 90d retention), 15% on trace storage (sampled), 10% on the agent + Kafka tier, and 5% on the UI tier. Total tends to fall in the 5–8% of total infrastructure spend range — sometimes higher, but rarely lower without a deliberate retention policy. The biggest leverage point is almost always log retention (cut 30d hot retention to 7d hot + 30d cold, save ~30% of the bill); the biggest accidental cost driver is almost always a single high-cardinality metric label nobody has audited.

Where this leads next

This chapter closes Part 17 — the geo-distribution and edge-compute wall. The next part picks up where the operational-nightmare subsection of edge compute and this entire wall chapter leave off: a deep dive into observability as its own layer of the distributed system, with chapters on metrics design, trace propagation, structured logging conventions, and the SLO/SLI vocabulary that makes alerting actionable.

The thread that runs from here through Part 18 is unified by one claim: observability is not a tool you bolt on, it is a data engineering subsystem you design. The same things that make a distributed system hard — partial failure, async replication, sampling, clock skew, back-pressure — are the things that make its observability stack hard, because the observability stack is itself one of the larger distributed systems most organisations operate.

References

  1. Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — the canonical "three pillars" framing and its limits.
  2. Charity Majors, Liz Fong-Jones & George Miranda, Observability Engineering (O'Reilly, 2022) — the high-cardinality, event-based counter-argument to pillar-orthodoxy.
  3. Ben Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (Google Technical Report dapper-2010-1) — origin of the modern trace.
  4. Brendan Gregg, "USE Method" (2013) — utilisation/saturation/errors, the metric-design framework that survives cardinality discipline.
  5. OpenTelemetry Specification, Tail Sampling Processor — the canonical reference for stateful trace sampling.
  6. Cloudflare engineering, "How we built Cloudflare Logs" (2018) — a public post on log-pipeline scale at PoP-multiplied volume.
  7. Prometheus, Storage documentation — the cardinality-cost contract for label-indexed TSDBs.
  8. See also: edge compute: serverless at the edge, wall: clocks and NTP, wall: most of what you send breaks somewhere.