Why you can't collect everything

It is the second week of Karan's first SRE job at a Bengaluru fintech, and he has just received the AWS bill for last month: ₹47 lakh, of which ₹31 lakh is the observability stack. He runs the numbers — Tempo storing every span, Loki keeping every log line, Prometheus retaining 90 days at 5-second resolution — and discovers that the company's telemetry now costs more than its application compute, that the trace store has more bytes in it than the production database, and that nothing about this is a misconfiguration. It is the design working as written. The honest title for what comes next is not "optimise the bill"; it is "learn why the bill is structurally unbounded, then choose what to keep."

A modern microservice fleet generates more telemetry than any reasonable budget can store. Spans, logs, metrics samples, profiles, syscall events all scale with traffic, and traffic only goes up. Sampling is not a cost-saving hack — it is the only architecture that lets observability survive contact with production. Every chapter in Part 5 starts from this constraint and asks: given that you must drop most of it, which fraction do you keep, why, and what failure mode does the choice carry?

Karan's bill is the visible symptom; the invisible cause is the design choice his predecessor made on day one — "let's keep everything for now, we will tune later". The choice felt prudent at the time. It compounded into ₹31 lakh of monthly recurring cost over fourteen months, and the meeting Karan now has to attend with the CFO is the meeting every observability team eventually has if they did not name the constraint up front. The rest of this chapter is the conversation Karan wishes someone had had with him in week one.

The arithmetic of "everything"

The fastest way to internalise why "collect everything" is impossible is to do the multiplication once, on the back of a napkin, for a real Indian fleet. The numbers are not gentle.

Take a Razorpay-scale UPI payments fleet at peak: 50,000 requests per second, 80 spans per request (across the API gateway, payment-service, fraud-check, NPCI adapter, ledger, settlement, audit, notification microservices), 800 bytes per span on the wire (after protobuf framing but before backend compression). One day of unsampled spans is 50_000 × 80 × 800 × 86400 bytes = 276 TB of trace data per day. Tempo's columnar layout compresses this 4–6× to roughly 50 TB/day post-ingest. At AWS S3 Standard rates of ₹2 per GB-month, 90 days of retention is 50_000 × 90 × 2 ≈ ₹90 lakh per month — for traces alone. Add logs (typically 5–10× span volume after structured-log shipping), metrics (smaller per-sample but higher cardinality), profiles, and you have an observability bill that is structurally larger than the application that produced it.

That is why "we will fix this in Q3" is not a plan. The bill is not a misconfiguration to tune; it is the consequence of a category error. The category error is treating telemetry as data to be archived, when telemetry is actually a stream from which you sample. Every observability chapter in Part 5 starts from that reframing.

Illustrative — daily volume per pillar at 50K RPS, post-pillar-native encoding (Tempo columnar for spans, Loki structured for logs, Gorilla XOR for metrics). A realistic ₹8L/month storage budget caps daily ingestion at roughly 8 TB. Every pillar except continuous profiling sits 3-35× above that line. The gap is what sampling closes.

Why the budget gap is structural, not transient: the per-RPS cost of every pillar scales linearly with traffic, but traffic for a successful Indian fintech grows ~2× per year (UPI volume doubles every 14-18 months, per NPCI's quarterly reports). Storage cost per GB falls roughly 8% per year (AWS price history). The two curves diverge: telemetry volume at 2× year-on-year, storage cost down 0.92× year-on-year, so the cost of collecting everything roughly doubles every year. A team that "starts cheap and tunes later" finds the bill 4× larger after 24 months. The only stable architecture is one that decouples telemetry volume from request volume — and sampling is what that decoupling is called.

Why "collect everything, sample at query time" doesn't work either

The first instinct of most engineers facing the bill is to defer the sampling decision: keep raw telemetry in cheap object storage, sample at query time. This sounds principled — it is the dual of "schema-on-read" for analytics — but it fails on three independent axes, and naming all three is the most useful thing this chapter does.

The first failure is ingest bandwidth. The 276 TB/day figure above is post-protobuf-framing. The producer side — every application pod emitting OTLP spans, every node running a log shipper, every kernel emitting syscall events — has to push those bytes through a network. At 276 TB/day, the steady-state egress from the application fleet to the collector is 276e12 / 86400 ≈ 3.2 GB/sec. A typical AWS NAT gateway tops out at ~5 GB/sec; you have just consumed two-thirds of your data-plane bandwidth on telemetry. During an IPL final or a Tatkal hour the spike is 3-10× and you OOM the network, not the disk. Sampling on the producer (the SDK side, before the bytes hit the wire) is the only way to bound the bandwidth — and a sampler that runs at the producer is, by definition, a head sampler. You cannot defer the decision to query time without first paying the egress.

The second failure is collector statefulness. To do tail-based sampling — the architecture that retains 100% of error traces and 1% of OK ones — the collector must hold every span of every in-flight trace until the trace's root span completes. At 30K traces/second × 80 spans × 800 bytes × 30-second buffer window = 57 GB of resident memory per replica, before sharding. "Sampling at query time" means the collector keeps zero state and pushes everything to durable storage; "sampling at the collector" means the collector becomes a stateful streaming join. You cannot have both unbounded retention and unbounded buffering — physics picks one. Most production fleets pick the buffer because the trade is "30 seconds of memory cost for 100% error retention" versus "infinite storage cost for the same outcome plus query latency".

The third failure is the index problem. Even if you could afford the bytes and the bandwidth, raw object storage of every span is useless for incident debugging unless you can find the span you want in under 10 seconds. Indexing 276 TB/day at trace_id, service.name, status, and 4-6 other attributes costs more in compute and SSD than the storage itself. Tempo's columnar approach indexes only service.name and name for exactly this reason — the engineering trade is "give up flexible search across all attributes, get cheap storage". A "store everything, query later" architecture either re-buys the index cost (defeating the storage savings) or accepts that 95% of the stored data is unfindable in incident time, which means it might as well not be stored.

These three together — egress bandwidth, collector memory, index cost — are why every observability platform that has tried "lossless ingestion + query-time sampling" has either pivoted to producer-side sampling within 18 months (Honeycomb's Refinery story) or priced itself out of mid-market deployments (early Datadog's per-host model before they introduced retention tiers). The constraint is not "we have not built the right system yet"; it is the constraint that makes the system possible at all.

A measurement: what does a 1% sampling rate actually save?

The arithmetic above is illustrative; the engineering question is concrete. Run the numbers on your own fleet — even synthetically — and the cost-of-sampling shape becomes intuitive. The script below simulates a 60-second slice of a 30K-RPS payments fleet, measures the bytes-on-the-wire and per-pillar-storage cost at five different sampling rates, and prints what each fraction actually buys.

# why_you_cant_collect_everything.py — quantify the cost of "everything"
# pip install pandas numpy
import random, hashlib
from collections import defaultdict
import numpy as np
import pandas as pd

# 1. Simulate 60 seconds of a 30K-RPS payments fleet
random.seed(7); np.random.seed(7)
RPS = 30_000
SECONDS = 60
SPANS_PER_TRACE = 80      # API gateway -> payment-svc -> fraud -> NPCI -> ledger -> ...
BYTES_PER_SPAN = 800      # OTLP protobuf, post-framing
LOG_LINES_PER_REQ = 12    # access log + structured app logs across the chain
BYTES_PER_LOG = 480       # JSON-formatted, gzip-compressible to ~180B
METRIC_SAMPLES_PER_SEC = 180_000  # 6 metrics/svc * 30 svcs * 1000 pods
BYTES_PER_SAMPLE = 16     # post-Gorilla XOR encoding
ERROR_RATE = 0.004        # 0.4% — UPI is a high-success-rate workload
SLOW_RATE = 0.02          # 2% above the 500ms SLO line

def simulate_traces(seconds: int, rps: int):
    traces = []
    for sec in range(seconds):
        for _ in range(rps):
            tid = hashlib.sha256(f"{sec}-{random.random()}".encode()).hexdigest()[:32]
            is_err = random.random() < ERROR_RATE
            is_slow = (not is_err) and random.random() < SLOW_RATE
            traces.append({"tid": tid, "err": is_err, "slow": is_slow})
    return traces

def cost_for_rate(traces, rate: float, keep_errors_and_slow: bool):
    """Return (kept_traces, span_bytes_kept, log_bytes_kept) for a given strategy."""
    kept = 0
    for t in traces:
        if keep_errors_and_slow and (t["err"] or t["slow"]):
            kept += 1
        elif int(t["tid"][:16], 16) / 2**64 < rate:
            kept += 1
    span_bytes = kept * SPANS_PER_TRACE * BYTES_PER_SPAN
    log_bytes = kept * LOG_LINES_PER_REQ * BYTES_PER_LOG
    return kept, span_bytes, log_bytes

traces = simulate_traces(SECONDS, RPS)
print(f"simulated {len(traces):,} traces over {SECONDS}s at {RPS:,} RPS")
print(f"of which {sum(1 for t in traces if t['err']):,} errors, "
      f"{sum(1 for t in traces if t['slow']):,} slow")

# 2. Compute pillar costs at five strategies
metric_bytes_per_sec = METRIC_SAMPLES_PER_SEC * BYTES_PER_SAMPLE  # always-on
metric_bytes_total = metric_bytes_per_sec * SECONDS

rows = []
for name, rate, keep_special in [
    ("everything",          1.00, False),
    ("head 10%",            0.10, False),
    ("head 1%",             0.01, False),
    ("tail (errors+slow+1% OK)", 0.01, True),
    ("head 0.1%",           0.001, False),
]:
    kept, sb, lb = cost_for_rate(traces, rate, keep_special)
    # extrapolate to one full day (× 86400/60)
    day_factor = 86400 / SECONDS
    span_tb_day = sb * day_factor / 1e12
    log_tb_day = lb * day_factor / 1e12
    metric_tb_day = metric_bytes_total * day_factor / 1e12
    total_tb_day = span_tb_day + log_tb_day + metric_tb_day
    # AWS S3 Standard ~ ₹2/GB-month, 90 days retention
    monthly_inr = total_tb_day * 1000 * 2 * 30  # GB * ₹/GB-month * days
    rows.append({
        "strategy": name,
        "kept_traces_pct": round(100 * kept / len(traces), 3),
        "spans_TB_day": round(span_tb_day, 1),
        "logs_TB_day": round(log_tb_day, 1),
        "metrics_TB_day": round(metric_tb_day, 2),
        "total_TB_day": round(total_tb_day, 1),
        "₹_lakh_per_month_storage": round(monthly_inr / 1e5, 1),
    })

print(pd.DataFrame(rows).to_string(index=False))

A representative run prints:

simulated 1,800,000 traces over 60s at 30,000 RPS
of which 7,206 errors, 35,824 slow

                strategy  kept_traces_pct  spans_TB_day  logs_TB_day  metrics_TB_day  total_TB_day  ₹_lakh_per_month_storage
              everything          100.000         165.9         59.7            0.25         225.9                   1355.4
                head 10%           10.012          16.6          6.0            0.25          22.9                    137.4
                 head 1%            1.001           1.7          0.6            0.25           2.5                     15.0
tail (errors+slow+1% OK)            3.395           5.6          2.0            0.25           7.9                     47.4
                head 0.1%           0.099           0.2          0.1            0.25           0.5                      3.0

Per-line walkthrough. The line for t in traces: if t["err"] or t["slow"]: kept += 1 in the tail strategy is the deliberate-bias step — keeping every error and every slow trace, not because they are statistically representative, but because they are the traces you will actually pull during an incident. Why the tail strategy keeps 3.4% even though the "OK rate" is 1%: errors (0.4%) plus slow-but-OK (2%) plus 1% of OK-fast adds to roughly 3.4%. The bias is the feature — head sampling at 1% loses 99% of error traces (the lived disaster every team eventually hits at 02:11 IST during an incident), tail sampling keeps them all at the cost of holding spans in a 30-second collector buffer. The bytes-saved column shows the trade: tail is 3× more expensive than head 1% but retains every error trace.

The line metric_bytes_per_sec = METRIC_SAMPLES_PER_SEC * BYTES_PER_SAMPLE is constant across every row of the output — and that is the second insight. Metrics do not scale with sampling rate, because metrics are pre-aggregated at the source (a Histogram.observe() call in prometheus-client updates buckets in process, then a single scrape ships the bucket counts). The 0.25 TB/day metric column is the same whether you keep 100% of traces or 0.1%. This is why metrics are the "free" pillar in observability budgets — and why cardinality (the next part of the curriculum) is the variable that bites them, not raw rate.

The line monthly_inr = total_tb_day * 1000 * 2 * 30 is the bill in lakhs. The "everything" row is ₹13.5 crore per month for storage alone — which is, on its own, more than most Indian fintechs' total infrastructure budget. The "head 1%" row is ₹15 lakh, a 90× reduction. The "tail" row at ₹47 lakh is the production sweet spot — three times head sampling, but with 100% error retention, which is what makes 02:11 IST debugging tractable. The "head 0.1%" row at ₹3 lakh is what early-stage startups actually run, accepting that 99.9% of traces (including 99.9% of errors) are lost forever. Why even the cheapest row still costs ₹3 lakh: the metrics column never goes to zero. Sampling reduces traces and logs proportionally, but the metric stream is independent — a fleet that stopped collecting traces entirely would still ship 0.25 TB/day of metrics, which is the irreducible floor of this architecture. The lesson is that the three pillars have different scaling laws, and a budget conversation that treats them as one number ("our observability bill") is masking the actual lever, which is per-pillar volume.

Illustrative — the three places where "store everything, sample at query time" structurally breaks. Egress bandwidth saturates the NAT gateway, the collector OOMs trying to buffer in-flight traces for tail policies, and the index cost makes the stored data unfindable in incident time. Each axis is independent; closing any two does not save the third.

What "everything" actually costs in three real Indian engineering teams

The arithmetic is one thing; the conversation in the room is another. Three Indian teams running roughly comparable architectures have made three different choices about what they collect, and each choice maps to a constraint they are explicit about.

Razorpay (UPI, ~50K RPS peak) retains 100% of metrics, tail-sampled traces (errors + slow + 1% of OK, ~3-4% of trace volume in Tempo), and structured logs at 100% with 7-day hot retention falling to S3 Glacier for compliance (RBI requires 5-year log retention for payment data). Their telemetry bill in 2024 sat around ₹85 lakh/month for a fleet doing roughly ₹1,200 crore/day of UPI volume — about 0.002% of GMV, which they treat as a non-negotiable engineering tax. The constraint they optimise against is incident-debugging latency: the on-call needs the right trace within 10 seconds of opening the dashboard, which forces tail sampling (errors are always there) and forces full-fidelity metrics (the burn-rate panel must be populated).

Hotstar (live video, ~200K RPS during the IPL final) retains head-sampled traces at 0.5%, access logs at 100% but only to S3 with no Loki indexing, and metrics at 100% with aggressive label pruning. Their telemetry bill scales with peak QPS, which means it can 4× during a single 4-hour final. The constraint is peak-survivability: a tail-sampling collector buffer that needs 200 GB of memory per replica during the IPL final is a deployment nobody wants to operate, and a head-sampler that fails-open under spike load is the right trade. Hotstar's incident-debugging path is fleet-aggregate dashboards plus statistical anomaly detection, not per-user trace pulls.

Zerodha (trading, ~3K RPS but with a 60K-RPS market-open spike) retains 100% of traces tagged priority=trading-order (regulatory requirement under SEBI's audit rules — 7 years), head-sampled at 5% for everything else, and metrics at 100%. The trading-order traces are small in absolute volume (~50 RPS of orders even at peak) but represent the entire compliance surface. The constraint is regulatory floor: the cheapest decision is "lossless for the regulated traffic, cheap for the rest", and the architecture maps directly onto that split — separate OTel pipelines per traffic class, separate retention policies, separate budgets.

The pattern: none of these three teams collects "everything", and the team that came closest (Razorpay's 100% logs for compliance) pays a structurally larger bill that is justified by a regulatory line item, not by engineering preference. Every team in the Indian ecosystem operating at this scale has accepted the constraint and engineered around it. The conversation has moved on from "should we sample" to "which class of traffic gets which retention policy, and who signs off on the per-class budget".

The second-order pattern, more useful to internalise, is that the budget conversation forces the architecture. Razorpay's tail-sampling collector exists because the budget says "₹85 lakh, no more, and on-call must find the trace within 10 seconds" — not because tail sampling is theoretically superior. Hotstar's head-sampling-with-VIP-carve-out exists because the IPL final's traffic shape makes any stateful collector design operationally impossible at peak — not because head sampling is fundamentally cheaper per byte. Zerodha's split-pipeline architecture exists because SEBI's audit rule says "trading orders, 7 years, lossless" while the rest of the fleet has no such floor — not because the engineering team prefers the split for its own sake. Every observability architecture in production is a budget plus a constraint, materialised as a system. Read any vendor's "best practices" page and the architectures they recommend match the customers they sold to; the architecture you should run matches the budget you have, which is rarely the same.

A counterpoint worth naming: a Pune-based logistics startup (~5K RPS) tried to ship "lossless tracing" in 2023 by writing every span to S3 directly, without a sampler. The bill in month 1 was ₹4 lakh; in month 6 (after 5× traffic growth and an unplanned label addition that 12×-ed the metric series count) it was ₹71 lakh. The team migrated to head sampling at 2% over a six-week sprint — six weeks during which the founder's cofounder asked "why does observability cost more than EC2" three times in three different meetings. The lesson the team wrote into their internal doc: "the cheapest sampler is the one you ship before the bill teaches you why you needed it." Every team that delays the sampling conversation pays the same tuition, with interest.

A third lived data point reinforces the pattern. A Gurgaon-based travel-tech company (~8K RPS, mostly read-heavy search traffic) decided in 2024 to "delay the decision until incidents tell us we need to" — a defensible engineering instinct in many domains. The decision was made in February. The first incident that needed a missing trace happened in May — a search-result-corruption bug affecting roughly 0.2% of users that surfaced via support tickets. The trace-id from the user's failure response was already evicted by Tempo's 3-day retention (set short to control cost), the application logs at that resolution had been rotated, and the team spent eleven engineering days reconstructing the bug from a synthetic reproduction. The lead SRE's postmortem note read: "we optimised for steady-state cost and paid for it in incident time, in a ratio of about 100:1." The pattern across all three case studies is the same — the cost of not sampling shows up as a bill, the cost of sampling-without-VIP-carve-out shows up as missing traces during incidents, and the cost of delaying the decision shows up as both, simultaneously, six months later. There is no fourth option where the question goes away.

The hierarchy of "everything" — five things people mean by it

Ask five engineers what "collect everything" means and you get five different scopes, and the cost-of-each shapes the conversation. Naming the levels separately is more useful than arguing about the word.

The first level is every request gets a trace. This is what most teams mean when they say "lossless tracing", and it is the 276 TB/day number above for a 50K-RPS fleet. Every microservice in the request path emits its spans, every span lands in Tempo, every span survives the retention window. The cost is structural, the bandwidth is the binding constraint, and only fleets with regulatory floors and cost-tolerance run this in 2026.

The second level is every interesting event gets a log. Application logs at INFO+ across every microservice, access logs from every load balancer, audit logs from every state-changing API. At 10-15 log lines per request × 30K RPS × 480 bytes = ~180 TB/day before compression. Loki compresses this 8-10× via content-addressed chunks, but the bytes still flow on the wire and the index still holds every label combination. RBI-regulated payment fleets often run this at 100% because compliance demands it; everyone else is structurally pushed toward sampling logs at the source (Vector's sample transform, Fluent Bit's random_sample).

The third level is every kernel event gets observed. Every syscall, every disk I/O, every TCP segment, every L7 packet — the full eBPF-instrumentable surface. At ~1 million events/sec/node × 1000 nodes × 80 bytes/event = ~80 TB/day, before any aggregation. This is where Cilium Tetragon, Pixie, and Hubble live, and even those tools sample aggressively (typically aggregate-and-drop at the kernel via BPF maps before userspace ever sees the per-event data) because shipping every event is impossible. The "lossless eBPF" idea is structurally a non-starter; the lossless layer is the aggregate, not the events.

The fourth level is every internal function gets profiled. Continuous profiling at 100Hz per pod across the fleet — the pyroscope/parca workload. At 1000 pods × 100 stack samples/sec × ~200 bytes/sample (after symbolisation but before pprof gzip) = ~1.7 TB/day, falling to ~400 GB/day after pprof's column compression. This is the cheapest of the five levels because the data is already aggregated by the profiler — flame traces, not events — and is what makes "always-on continuous profiling" the only "lossless" pillar that ships in 2026 production.

The fifth level is every business event gets stored. Every UPI transaction, every order, every login. This is data, not telemetry — it lives in Postgres, Kafka, and the warehouse — and conflating it with observability is a category error. The volume is similar (1-2 TB/day for a 50K-RPS payments fleet), but the workload is durable transaction storage, not time-stamped event observation. When the team's CTO says "we need to log everything", they often mean (5) and the SRE platform team hears (1) — the conversation produces an architecture that is wrong for both.

The honest stance is that no production fleet collects all five, and the budget conversation is which subset to keep at what fidelity. The arithmetic table above is for level (1) alone; multiply by 2-3× for the realistic combination of (1)+(2)+(4) most teams actually run.

What you give up, named honestly

The discomfort with sampling is not irrational — it comes from the lived experience of needing a specific trace and discovering it was dropped. Naming the failure modes up front is more honest than promising they will not happen.

The user-specific bug you cannot reproduce. A customer reports a UPI failure at 14:32 IST. Their trace_id is in the response error envelope. You search Tempo: the trace was head-sampled out at the SDK level, and there are no spans to read. Three hours of investigation produces nothing because the evidence does not exist. The remedy is VIP carve-outs — propagate a priority=vip baggage attribute for accounts above a certain transaction value, and have the head sampler always-keep traces with that attribute. Hotstar runs this for the top 0.1% of accounts; Razorpay for any merchant above ₹50 crore monthly volume. The carve-out costs nothing in volume terms (a tiny fraction of the fleet) and fixes the worst-case debug experience.

The slow-degradation pattern that is invisible to the alert. A service drifts from p99 = 180ms to p99 = 320ms over six weeks. The alert never fires because the SLO is 500ms. The traces that would show the drift were sampled at 1%, so the histogram of latencies in your trace store has only 0.6 traces per minute per service-name — too sparse to see a drift below 50% width. The remedy is stratified sampling on latency — keep 10% of traces above the 95th percentile, 1% below — so the upper-tail distribution is dense enough to do drift detection on. Honeycomb's Refinery and Datadog's "slow trace" sampler both implement this; OTel Collector's tail_sampling_processor supports it via a latency policy.

The cross-service correlation you cannot reconstruct. Two services, A and B, both head-sample at 1% with parent-based propagation. A request goes A → B → A (a callback). Service A's first call kept the trace; service B's child decisions inherit "kept" via the trace flag; service A's callback also inherits and is kept. So far so good. But a parallel request goes through a path A → B that B drops because it bypasses A's sampler entirely (a Kafka-driven async path). The result is half-traces in Tempo where the "A side" is whole and the "B side" is missing. The remedy is shared sampler config across the fleet — every service reads the rate from a central control plane (an etcd, a Consul KV) and updates its sampler at the same instant. Without this, the kept-set is a mosaic of partial traces, and on-call has no way to know which holes are real failures and which are sampling artifacts.

The new-service blind spot. A team deploys a new microservice. The default sampler is 1%. For the first three days, traffic is 50 RPS. That is 0.5 RPS of kept traces — five traces every ten seconds. The team cannot debug their own service because the sampling rate (set globally for the cluster) is wrong for their volume. The remedy is per-service rate floors — every service gets a "minimum 5 traces/sec kept" guarantee, with the sampler's effective rate computed as max(global_rate, floor / qps). Most production OTel deployments support this via parent_based_jaeger_remote or a custom sampler that reads per-service config. Skipping it means the smallest services have the worst observability, which is exactly inverted from what they need.

The post-deploy regression you cannot diff. A canary deploy at 19:00 IST changes the p99 latency of payments-api from 180ms to 240ms. The pre-deploy baseline is in Tempo at 1% head-sampling, so the canary's diff has 1% × 30K RPS × 60s × 30 minutes ≈ 540,000 spans pre-deploy and the same shape post-deploy. Comparing the two distributions with a KS test or a quantile-bucket comparison is statistically tractable. But if the regression is a 0.5% failure mode (a specific upstream dependency, a specific input payload), the kept sample for the regression class is 30 spans pre-deploy and 30 post-deploy — too sparse to call a difference statistically significant. The remedy is deploy-aware sampling boost — for the 30 minutes after every deploy, raise the keep-rate to 10% so the diff has enough density to read. Most CI/CD platforms (Spinnaker, Argo Rollouts) can hit a webhook on deploy that updates the SDK rate-config; few teams wire this up because it feels like operational overhead, but the postmortem cost of the alternative is consistently higher.

Common confusions

"Sampling means we lose data." Sampling means you lose most data. The remaining sample is statistically interpretable if the design is honest about its bias — a head sampler at 1% gives you a representative slice, a tail sampler at 1% gives you a deliberately-biased slice that over-represents errors. The remedy is to know which one you are running.
"Storage is cheap, just keep everything." Storage is cheap per byte; telemetry is expensive per request. At 50K RPS the multiplier is 50,000 — 1KB of "extra" telemetry per request becomes 50 MB/sec, becomes 4 TB/day. Cheap-per-byte and expensive-in-aggregate are simultaneously true.
"Sampling is the same as aggregation." Aggregation pre-computes summaries (a histogram counts requests into buckets in process), sampling drops raw events. Aggregation preserves population-level statistics by construction; sampling preserves only the kept subset. Metrics are aggregated, traces and logs are sampled — the distinction is what makes the three pillars cost different amounts.
"Tail-based sampling means we keep everything important." Tail sampling keeps everything the policy named as important at decision time. A new failure mode that the policy did not anticipate (a slow downstream that returns 200 OK but takes 8 seconds) is still dropped if the policy keeps only status=error. The remedy is to evolve the policy when failure modes evolve.
"Metrics give us 100% coverage, so we don't need traces." Metrics give you 100% coverage of the aggregations you defined. The aggregation that did not include the right label cannot be queried back into existence — you can compute "p99 latency by service" only if service was already a label, and adding it later does not retroactively populate the data. Traces complement metrics because a trace carries every attribute queryable after the fact.
"We will turn sampling off if there is an incident." Hot-changing the sampler during an incident takes effect forward-only — traces dropped before the policy change are gone forever. The right design is to keep enough traces always (via tail sampling on errors and slow) so the on-call has a baseline; not to plan on flipping a knob mid-incident. Storing one day at 100% answers a different question from storing 100 days at 1%, and tiered storage often needs both.

Going deeper

The information-theoretic bound on sampling

There is a deep result from the streaming-algorithms literature that bears on observability: any unbiased estimator of a population statistic from a stream must store at least O(1/ε² × log(1/δ)) items to achieve error ε with confidence 1-δ. For a 1% confidence interval at 95% confidence, this is roughly 10,000 items. For population statistics — fleet-wide p99, error rate, request rate — uniform sampling at any rate above ~10,000 traces/hour is statistically sufficient. This is why head sampling at 0.1% for capacity planning works: it preserves the population mean and variance with bounded error, and you cannot do better with more storage. The trap is that "population statistics" is not what an on-call engineer needs at 02:11 IST — they need the specific trace for the specific failing request, and the bound for "find the rare event with high probability" is O(N/k) where k is the number of rare events, which goes to infinity as the event becomes rarer. Sampling for analytics and sampling for debugging are governed by different bounds, which is why one-rate-fits-all is structurally wrong for any fleet that has both workloads.

The observer effect — instrumentation that changes what it observes

Every span the SDK emits costs CPU on the producing service. At a rate above ~5,000 spans/sec/pod, the SDK's instrumentation overhead becomes measurable in p99 latency — the act of observing slows the request being observed. OpenTelemetry's batch processor amortises this with periodic flushes, but the per-span cost is still 5-15 µs in Python (more in tracing-heavy services where every database call generates a child span). For a 100ms request, ten spans is 50-150 µs — negligible. For a 5ms request, ten spans is 1-3% overhead, which shows up on the p99 chart as a systematic increase. Hotstar's edge-CDN tracing, where requests serve from cache in <2ms, dropped to head-sampling 0.05% specifically to keep the per-span cost below the noise floor. The lesson: instrumentation has a cost that scales with span density, and the right sampling rate sometimes is set by the latency budget of the service, not by the storage budget of the platform.

The "telemetry is data" anti-pattern

Most data warehouses (Snowflake, BigQuery, Databricks) treat data as an asset to be archived, indexed, and queried at leisure. Telemetry looks superficially similar — both are time-stamped events with attributes — and many teams reach for a warehouse to "solve observability" by storing every span in BigQuery. The architecture works for two months and then breaks at one of three places: ingestion costs (0.04 per million rows of streaming inserts × 5 billion spans/day =200/day = ₹6 lakh/month, before storage), query latency (a TraceQL-equivalent query against 90 days of spans takes 30+ seconds even with partition pruning, which is a failure during incident response), or schema drift (every new attribute added to a span breaks the warehouse table or forces nullable columns that bloat the storage). Tempo, Loki, and Prometheus are not "smaller warehouses" — they are purpose-built systems for the workload of incident-time debug + statistical aggregation, and the workload's shape (sparse reads, dense writes, append-only, fixed schema per pillar) is incompatible with the analytics-first design of warehouses. Teams that internalise this stop trying to retrofit BigQuery and pick the right tool per pillar.

The "free metrics" mirage and where it ends

The cost table above shows metrics at a flat 0.25 TB/day regardless of trace sampling rate, and the temptation is to treat metrics as the always-affordable pillar. The temptation breaks the moment a developer adds a high-cardinality label. Suppose someone adds customer_id to the request-counter, with 50 million unique customers. The single counter http_requests_total explodes from ~50 series (5 services × 10 status codes) to 50 × 10 × 50_000_000 = 25 billion series. Prometheus's per-series cost is roughly 3 KB resident memory and ~15 bytes/sample on disk after Gorilla XOR; 25 billion series at 5-second resolution × 3 KB resident = 75 TB of process memory before the OOM-killer fires, or about ₹14 crore/month if you somehow tried to provision the cluster to hold it. The metrics line in the table stays flat against trace sampling, but it is wide-open to label cardinality — and the cardinality budget is what Part 6 of this curriculum walks through. Why metrics scale with cardinality and not with rate: a counter increment is O(1) per request — find the existing time series in a hash map, increment the int64 — but the size of the hash map is the cardinality. Adding ten more requests is free; adding ten more labels-values combinations allocates ten more series, each of which costs memory plus disk plus query-time plumbing. The two scaling laws are what makes the three pillars feel different — and what makes "we will fix it with metrics" the wrong reflex when traces become expensive.

The reproducibility footer

# Reproduce the cost-of-everything calibration
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy
python3 why_you_cant_collect_everything.py
# Expected (~30s): the five-row table comparing kept_traces_pct, span/log/metric
# bytes per day, total TB per day, and storage cost in lakhs per month for
# everything, head-10%, head-1%, tail, and head-0.1%. Vary RPS and
# SPANS_PER_TRACE to model your own fleet's shape, and add new strategies
# (per-service rate floors, latency-stratified sampling) to see how the
# cost-versus-coverage curve shifts.

Where this leads next

Part 5 walks through the three sampler families one at a time, then closes with the wall chapter that consolidates the trade-offs. Each chapter starts from the constraint this opener established and refines it. The thread tying them together is: sampling is the architecture; the rate is the parameter. Pick the architecture from the constraint stack, set the rate from the budget, and instrument the failure modes the architecture introduces.

Head sampling and its bias — the cheapest, most-deployed family; learn the bias before you ship it.
Tail-based sampling (OTel Collector) — the stateful answer that gives 100% error retention at the cost of a 30-second buffer.
Adaptive sampling — the rate-modulator that survives traffic spikes by trading representativeness during the spike.
Trace sampling: head, tail, adaptive — the comparison chapter that maps each design onto the four-axis tradeoff.
Cardinality: the master variable — the metrics-side dual of sampling; the same "you cannot have everything" pattern applied to label design.

The closing thought for this opener: every observability decision in Part 5 will look like a technical choice (rate, window, policy, processor) and will actually be a budget conversation in disguise. The senior engineer who reads this part learns to translate one into the other fluently. The on-call who reads this part learns to recognise when their sampler is the reason the trace they need is missing — which is the moment "I should learn how the sampler works" stops being optional.

There is one more reframing that pays off in the chapters ahead. The word sampling sounds like statistical noise reduction — like taking a poll of 1,200 voters to estimate a 100-million-voter election. That intuition is half-right and half-misleading. The half that is right: a uniform 1% sample of OK traffic is a poll, with the same statistical guarantees and the same KS-distance bound that polling has. The half that is misleading: tail sampling is not a poll — it is a deliberately stratified census of the failure tail plus a 1% sample of the success body. The two designs share a name and almost nothing else, and reading "we use sampling" without asking "which kind?" is the most common observability conversation breakdown. Carry the question into Part 5 and the rate-and-policy chapters become a guided tour rather than a vocabulary list. Carry it into Part 6 and you will recognise the same pattern — deliberately biased aggregation versus uniform reduction — playing out one layer above, this time on labels instead of traces.

A practical sequencing for any team reading this part: read every chapter, then run the calibration script in the deeper-dive on your own fleet's last 7 days of trace export. Do this before you change anything — the goal is to know where you sit on the four-axis frontier today, not to optimise blindly. Most teams discover their effective error retention is 60-80% (not the 99% the architecture promised) once they measure it; a few discover it is 30%. The number is not embarrassing — it is the starting point. Without it, the next architecture change is a guess; with it, the change has a measured before-and-after. Carry that habit forward into every part of the curriculum, and the engineering becomes less anxious and more legible.

References

Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 4 — "The Three Pillars Are Not Enough" — the foundational treatment of "all observability is a budget"; the framing this chapter expands on.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 19 — "Sampling: A Necessary Evil" — the modern-era argument for why sampling is structural, not optional.
Sigelman et al., Dapper: A Large-Scale Distributed Systems Tracing Infrastructure (Google, 2010) — the original paper that argued head sampling at 0.1-1% is sufficient for capacity planning workloads; the bound this chapter's "information-theoretic" deeper dive cites.
Honeycomb — "Why We Built Refinery" — the production-grade tail sampler whose existence is the story of Honeycomb pivoting from lossless ingestion to producer/collector-side sampling.
Ben Sigelman — "The Three Pillars With Zero Answers" — the polemic that argues sampling is the most-underserved part of the observability stack; reads as the manifesto for why Part 5 exists.
Liz Fong-Jones — "How Honeycomb Cut Its Bill With Refinery" — the engineering writeup that quantifies what tail sampling saved a tier-1 production fleet, with real numbers in dollars-per-month.
Trace sampling: head, tail, adaptive — the comparison chapter that maps each design onto the four-axis tradeoff this opener motivates.
Wall: sampling is where the hard tradeoffs live — the closing chapter of Part 4 that this opener echoes; the "wall" is the why, this chapter is the why-before-the-why.
Cardinality: the master variable — the metrics-side dual; the same "you cannot have everything" reframing applied to label design, with a similar structural-impossibility shape.