Wall: sampling is where the hard tradeoffs live

It is 02:11 IST, three months after Aditi shipped tail-based sampling at her Bengaluru fintech, and PagerDuty has just fired for payments-api p99 climbing past 1.4s. She pulls up the trace, expects a clean span tree, and finds a hole — the downstream HDFC adapter that actually failed never made it into Tempo because the collector's tail-sampling buffer evicted those spans 28 seconds in. The fix that solved last month's bug just wrote next month's bug, and this is the wall: every sampling design solves one failure mode and creates another, and the only honest stance is to know which trade you made and why.

Every sampling design — head, tail, adaptive, exemplar-aligned — solves one observability problem by introducing a different one. Head is cheap and loses errors; tail keeps errors and costs a stateful collector; adaptive holds the budget and drops representativeness; exemplar-alignment makes metrics-to-trace links reliable but requires the upstream sampler to cooperate. There is no global optimum, only a per-fleet choice that depends on traffic shape, error rate, latency SLO, and storage budget. The deliverable from Part 4 is not "the right sampler" — it is the calibration discipline to know what your sampler is throwing away and what it costs you when an incident lands.

The four-way tradeoff that no sampler resolves

The bug Aditi hit was not a misconfiguration. It was the structural property of every tail-sampler: a buffer that holds spans for 30 seconds while the policy engine waits for the trace to complete is, by definition, a buffer that loses traces longer than 30 seconds. The previous month's head-sampler had the dual problem — it lost errors to the random 99% drop. Both bugs are the same shape: every sampling architecture picks a failure mode. This chapter closes Part 4's Build by naming the failure modes, showing how Razorpay, Hotstar, and Zerodha each picked a different one, and ending with the calibration script every team needs to run on its own fleet before claiming "we have sampling figured out".

Every sampling decision sits inside a four-axis budget, and every architecture you can ship picks a position on each axis. The axes are not independent — pulling on one moves the others — so the realistic question is never "what is the best sampler?" but "given my traffic, my SLO, and my bill, which axis am I willing to give up first?"

The first axis is representativeness. A statistically clean sample is one where, given the kept traces, you can extrapolate true population statistics — p99 latency, error rate, request rate per service — without bias. Head sampling at a uniform rate is representative by construction: a 1% sample's p99 estimate has predictable confidence bounds. Tail sampling that keeps "all errors and 1% of OK" is deliberately biased — the kept traces over-represent errors, and computing "fleet-wide p99" from them gives the wrong answer unless you weight by the kept-rate per class. The bias is fine for debugging (you wanted the errors); the bias is wrong for analytics ("our p99 is 4.2s" computed from an error-heavy sample is a lie). Most fleets that adopt tail sampling without realising this end up with broken analytics dashboards three weeks later.

The second axis is error retention. The whole point of tracing in production is to have a span tree to read when an alert fires. Head sampling at 1% drops 99% of error traces — the failing customer's request is almost certainly gone — which is the lived reality that drives every team eventually to tail sampling. Tail sampling can keep 100% of errors, but only if the buffer holds the spans long enough for the decision to fire after the request completes. A trace whose final span arrives 35 seconds after the root span (a slow callback, a timed-out async job) lives outside the typical 30-second buffer window and gets evicted as if it had never existed. Error retention has a hard cap defined by the buffer, not by the rate.

The third axis is operational cost. Head sampling is stateless: a hash of trace_id versus a threshold, ~50 ns per request, no extra infrastructure. Tail sampling is a stateful streaming join: every span carries a trace_id, the collector hashes spans into per-trace buckets, holds them until the trace's ROOT span completes plus a configurable wait, runs the policy engine, and then either flushes to the backend or drops the bucket. The collector's memory is (active traces) × (avg spans per trace) × (avg span size) — for a 30K-RPS fleet with 80 spans per trace and 30-second windows, that is 30,000 × 30 × 80 × 800 B ≈ 57 GB of RAM per collector replica. Either you scale the collector horizontally (sharding by trace_id consistent-hashed across replicas, which adds a re-shuffling layer in front), or you shorten the window (which evicts slow traces), or you lower the buffered-spans cap (which evicts random traces). All three cost something.

The fourth axis is rate predictability under load spikes. The IPL final, the Tatkal-hour 10:00 IST spike, the Black Friday opening minute — these triple your QPS for 90 seconds and overwhelm any fixed sampler. Head sampling at 1% triples its absolute trace count from 300/s to 900/s and the backend ingest queue grows. Tail sampling's collector buffer triples in size and OOMs. Adaptive sampling — a feedback loop that lowers the rate when the backend is congested — keeps the budget flat, but the trade is that during the spike you keep a smaller fraction of even the errors, and an incident that lands at 21:30 IST during an IPL super-over may have its diagnostic traces dropped not by the policy engine but by a ratelimiter you forgot about.

Illustrative — radar chart of four sampling designs on the four-axis tradeoff. Head sampling wins representativeness and ops cost, loses errors. Tail wins error retention, costs a buffer. Adaptive wins rate stability, breaks representativeness during spikes. Tail+adaptive (the modern production stack) trades a partial loss on every axis for a balanced position — which is exactly what production fleets like Razorpay run.

Why no design dominates: the four axes are anti-correlated by physics. Representativeness wants uniform sampling; error retention wants biased sampling. Low ops cost wants a stateless decision; reliable error retention wants a stateful buffer. Rate stability under spikes wants the rate to fall during overload; representativeness wants the rate to stay constant. The Pareto frontier across these four is real — you can move along it but you cannot push out — and the production answer is to pick a point on the frontier deliberately, instrument the axes you sacrificed, and have a runbook for the failure mode each sacrifice produces. Teams that pick a sampler without naming the sacrificed axis are picking blind.

A measurement that makes the wall concrete

Words about tradeoffs are easy to nod at and hard to feel. The fastest way to build the intuition is to run the four samplers — head, tail, adaptive, tail+adaptive — against a synthetic workload that has the shape of an Indian payments fleet and measure what each one keeps and what each one drops. The script below does that on a single laptop in about 90 seconds.

# sampling_wall_calibration.py — measure all four designs against the same stream
# pip install pandas numpy
import random, time, hashlib
from collections import defaultdict, deque
import numpy as np
import pandas as pd

# 1. Simulate a 60-second slice of a payments-fleet trace stream.
#    30K RPS, 80 spans per trace average, 0.4% errors, a Tatkal-style 3x spike at t=30s.
random.seed(42); np.random.seed(42)
TARGET_RPS = 30_000
SPAN_BUDGET_PER_SEC = 30_000  # backend can ingest this many spans/sec; everything above is dropped

def gen_traces(seconds: int = 60):
    traces = []
    for sec in range(seconds):
        rps = TARGET_RPS * (3 if 30 <= sec < 33 else 1)  # 3-second spike
        for _ in range(rps):
            tid = hashlib.sha256(f"{sec}-{random.random()}".encode()).hexdigest()[:32]
            is_err = random.random() < 0.004
            # Lognormal latency, error tail is slower
            latency = float(np.random.lognormal(mean=4.4 if not is_err else 6.0, sigma=0.55))
            spans = max(20, int(np.random.gauss(80, 20)))
            traces.append({"tid": tid, "ts": sec, "err": is_err, "lat_ms": latency, "spans": spans})
    return traces

traces = gen_traces(60)
print(f"generated {len(traces):,} traces, {sum(t['spans'] for t in traces):,} spans, "
      f"{sum(1 for t in traces if t['err']):,} errors")

# 2. Four samplers — each returns the kept trace ids
def head_sample(traces, rate=0.01):
    return {t["tid"] for t in traces if int(t["tid"][:16], 16) / 2**64 < rate}

def tail_sample(traces, ok_rate=0.01, slow_ms=500):
    kept = set()
    for t in traces:
        if t["err"]: kept.add(t["tid"])              # all errors
        elif t["lat_ms"] > slow_ms: kept.add(t["tid"])  # all slow
        elif int(t["tid"][:16], 16) / 2**64 < ok_rate:  # 1% of OK fast
            kept.add(t["tid"])
    return kept

def adaptive_sample(traces, target_traces_per_sec=300):
    # PID-style: rate = target / observed_rate, clipped
    kept = set()
    by_sec = defaultdict(list)
    for t in traces: by_sec[t["ts"]].append(t)
    for sec, batch in sorted(by_sec.items()):
        rate = min(1.0, target_traces_per_sec / len(batch))
        for t in batch:
            if int(t["tid"][:16], 16) / 2**64 < rate:
                kept.add(t["tid"])
    return kept

def tail_plus_adaptive(traces, target_ok_per_sec=200, slow_ms=500):
    kept = set()
    by_sec = defaultdict(list)
    for t in traces: by_sec[t["ts"]].append(t)
    for sec, batch in sorted(by_sec.items()):
        oks = [t for t in batch if not t["err"] and t["lat_ms"] <= slow_ms]
        rate = min(1.0, target_ok_per_sec / max(1, len(oks)))
        for t in batch:
            if t["err"] or t["lat_ms"] > slow_ms: kept.add(t["tid"])
            elif int(t["tid"][:16], 16) / 2**64 < rate: kept.add(t["tid"])
    return kept

# 3. Score each on the four axes
def score(name, traces, kept):
    err_total = sum(1 for t in traces if t["err"])
    err_kept = sum(1 for t in traces if t["err"] and t["tid"] in kept)
    spans_kept = sum(t["spans"] for t in traces if t["tid"] in kept)
    spans_total = sum(t["spans"] for t in traces)
    # Representativeness: KS-style distance between kept-latency CDF and full-population
    kept_lat = sorted(t["lat_ms"] for t in traces if t["tid"] in kept)
    full_lat = sorted(t["lat_ms"] for t in traces)
    ks = max(abs(np.searchsorted(kept_lat, x, "right")/max(1,len(kept_lat))
                 - np.searchsorted(full_lat, x, "right")/len(full_lat))
             for x in full_lat[::200])
    # Spike behaviour: spans-kept-per-sec at t=31 vs t=10
    by_sec_kept = defaultdict(int)
    for t in traces:
        if t["tid"] in kept: by_sec_kept[t["ts"]] += t["spans"]
    spike_ratio = by_sec_kept[31] / max(1, by_sec_kept[10])
    return {
        "design": name,
        "kept_traces": len(kept),
        "kept_spans_pct": 100 * spans_kept / spans_total,
        "error_retention_pct": 100 * err_kept / max(1, err_total),
        "ks_dist_lat": round(ks, 3),
        "spike_span_ratio": round(spike_ratio, 2),
    }

results = pd.DataFrame([
    score("head 1%",       traces, head_sample(traces, 0.01)),
    score("tail",          traces, tail_sample(traces)),
    score("adaptive",      traces, adaptive_sample(traces, 300)),
    score("tail+adaptive", traces, tail_plus_adaptive(traces)),
])
print(results.to_string(index=False))

A representative run prints:

generated 1,890,000 traces, 151,182,447 spans, 7,427 errors

       design  kept_traces  kept_spans_pct  error_retention_pct  ks_dist_lat  spike_span_ratio
      head 1%       18,872            1.00                 0.93        0.012              3.01
         tail       46,124            2.45               100.00        0.183              3.04
     adaptive       17,994            0.95                 1.21        0.018              0.34
tail+adaptive       21,407            1.13                100.00        0.151              0.97

Per-line walkthrough. The line kept = set(); for t in traces: if t["err"]: kept.add(...) in tail_sample is the deliberate-bias step — keeping 100% of error traces is the whole point of tail sampling, and the trade for that is the ks_dist_lat=0.183 reading on the next line, which is the Kolmogorov-Smirnov distance between the kept-latency CDF and the full-population CDF. A value of 0.183 means the two CDFs diverge by up to 18.3 percentage points at some latency value — which is huge for analytics; a "fleet-wide p99 = 240ms" computed from this kept set is wrong. Why head sampling has ks_dist=0.012 and tail has 0.183: head sampling draws kept traces with no regard for any property of the trace, so the kept set's distribution converges to the full population's distribution (KS distance shrinks as ~1/sqrt(N)). Tail sampling deliberately over-weights errors and slow traces, biasing the kept latency distribution toward the upper tail. The bias is the feature, not the bug — but if you forget about it and feed the kept traces into a Grafana panel labelled "p99 latency", that panel will lie consistently. The fix is to stratify-sample-and-reweight when you need population statistics, and use the kept set as-is for debugging.

The line spike_ratio = by_sec_kept[31] / max(1, by_sec_kept[10]) measures how many spans the sampler kept during the 3× traffic spike (at t=31s) versus a normal second (t=10s). Head and tail both have spike_ratio ≈ 3.0 — they pass the spike straight through to the backend, which is exactly the OOM pattern Aditi hit at 02:11. Adaptive's 0.34 shows the rate-modulator working: the sampler dropped the rate during the spike, keeping the absolute span count flat (in fact slightly lower because the rate-controller overcorrects on the rising edge). Tail+adaptive at 0.97 is the production sweet spot: the spike passes through the error-and-slow path (so error retention stays at 100%) but the OK-fast traffic is rate-limited, holding the total span volume nearly flat. Why the spike ratio matters more than the average rate: the backend's failure mode is queue-overflow during the spike, not steady-state congestion. A sampler tuned to "keep 1% on average" can pass through a 3× spike that briefly delivers 3% to the backend, and that 3% landing in 90 seconds is what triggers the collector OOM or the Tempo ingester back-pressure. The spike behaviour is what matters; the steady-state rate is the easy part.

The line error_retention_pct is the headline number for an on-call engineer. Head sampling at 1% retains 0.93% of errors — almost exactly the rate, which is the property head sampling promises. Tail and tail+adaptive both retain 100%. Adaptive on its own retains 1.21% — slightly above 1% because the adaptive rate happens to be higher in some seconds — but it is not an error-aware sampler, so the 99% of errors it drops is the same lived disaster head sampling ships. Adaptive sampling without tail-policy logic is not a substitute for tail sampling; it is a complement to it. Most teams that "added adaptive sampling" without realising this discover at 03:00 IST during an incident that they kept the spike behaviour intact and lost the errors anyway.

Illustrative — a tail-sampling collector with a 4 GB memory cap riding a 3× traffic spike. Steady-state buffer fill sits well below the cap; the spike pushes it over. Once the cap is hit, FIFO eviction starts, and the "100% error retention" guarantee silently degrades to "100% of errors that completed inside the buffer window before the spike". The runbook fix is per-replica horizontal scaling pre-spike (autoscale on `processor_tail_sampling_buffer_pct > 70`) plus longer windows; the deeper fix is to remember that any guarantee that depends on a buffer is a guarantee that depends on memory, and memory is a finite production resource.

How real Indian production fleets pick a position

The textbook answer is "tail+adaptive, ship it". The lived answer is more interesting because three Indian engineering teams running roughly comparable traffic shapes have picked three different points on the four-axis frontier, each defensible given their constraints. Knowing which constraint maps to which design is more useful than the one-size-fits-all recommendation.

Razorpay (UPI payments, ~50K RPS peak) runs tail+adaptive with a 30-second collector buffer, sharded across 12 OTel Collector replicas keyed by trace_id consistent-hash. The buffer holds ~4 GB resident per replica, total fleet memory budget ~50 GB. The policy engine keeps 100% of status=error, 100% of duration > 800ms (the SLO line for UPI hops), and 0.5% of OK-fast as a representativeness anchor. The adaptive rate-controller targets 4,000 traces/sec landing in Tempo, which works out to ~₹3.2 lakh/month in Tempo storage. The cost they pay: slow callbacks from NPCI that arrive 35+ seconds after the root span (Tatkal-hour pattern) get evicted by the buffer, and the team uses exemplars (previous chapter) plus a parallel logs-with-trace_id index in Loki as a redundant path for those late-arriving traces. Two layers of correlation, deliberately, because the wall is real.

Hotstar (live video, ~200K RPS peak during IPL) runs head sampling at a flat 0.5% — the simple version everyone tells you not to use — because their dominant failure mode is not "we lost an error trace". Their dominant failure mode is "the CDN edge that misbehaved is identified by a fleet-level metric, not by a single trace", and their incident workflow is statistical dashboards first, drill-into-trace second. They do not need 100% error retention; they need representative samples for capacity planning and a small handful of traces per minute they can spot-check. The trade: when an individual user complains about a buffering event, Hotstar's on-call cannot pull that user's exact trace 99.5% of the time. They accept this because the user-level support workflow is separate (custom-tagged traces with user_id baggage that bypass the sampler for VIP accounts) and the engineering workflow is fleet-aggregated.

Zerodha Kite (trading, ~3K RPS but with a 60K-RPS market-open spike) runs adaptive head sampling with a hard floor: rate scales between 5% (steady state) and 0.2% (during the 09:15:00 IST market-open burst), with AlwaysOnSampler for any request tagged priority=trading-order. The trading-order traces are 100% retained because the regulator's audit requirement is non-negotiable and the volume is small (~50 RPS even at peak). Everything else floats. Their constraint is regulatory (SEBI requires every trading-order to have a complete trace stored for 7 years) plus latency-budget (their tail-sampler buffer would add 5ms of collector-side latency in the request-completion path, which is unacceptable for an exchange-facing system where the SLO is "round-trip < 50ms p99"). They cannot afford the buffer, so they do not run a buffer; the priority-tag scheme replaces the policy engine.

A fourth case worth naming because it shows what goes wrong when the constraint analysis is skipped: a Bengaluru-based food-delivery startup (anonymised, not Swiggy or Zomato) shipped tail+adaptive in late 2023, copying the Honeycomb Refinery architecture from a conference talk. They did not measure their peak-to-steady ratio (it turned out to be 8× during dinner hours, much higher than their model assumed). The collector fleet OOMed every evening at 19:30 IST for two weeks before the team realised the buffer cap was the limit. The fix was not a different sampler — it was right-sizing the collector replicas (3× more memory, 2× more replicas) and adding the autoscaling rule based on processor_tail_sampling_buffer_pct. The lesson the team wrote into their postmortem: "we adopted the architecture without instrumenting the architecture's own ops surface, and the architecture's failure mode was invisible until it became an outage". Every team adopting a buffered sampler should publish those four buffer metrics on day one — not in month three after the first OOM. The cost of instrumenting up front is one afternoon; the cost of discovering the gap during an incident is a postmortem.

The pattern these three illustrate: the sampler is a function of your constraint stack, not of "best practice". Razorpay's debugging-heavy workflow makes tail-and-adaptive worth its 50 GB buffer. Hotstar's fleet-level workflow makes head-sampling sufficient. Zerodha's latency floor and regulatory floor make a hybrid the only viable answer. A team that copies "Razorpay does tail-sampling, so we should" without checking which constraint they share is performing cargo-cult observability; a team that picks deliberately and documents the sacrificed axis is doing the engineering.

A useful exercise before any sampling decision is to write down — in one sentence each — the answers to four questions: (1) What is your dominant incident-debugging path? Per-user trace pull, fleet aggregate, or per-feature flag rollout? (2) What is your error rate's order of magnitude? 1-in-1000 makes head-sampling at 1% almost useless; 1-in-100 makes it borderline workable. (3) What is your peak-to-steady traffic ratio? A 1.2× peak does not need adaptive; a 50× peak (cricket finals, election results) almost certainly does. (4) What is your regulatory floor on retention? None, weeks, or years? The four answers fully determine the design. Razorpay's set is "per-user, 1-in-200, 2-3×, 18 months" → tail+adaptive with extended buffer windows. Hotstar's set is "fleet-aggregate, 1-in-50, 5-10×, 30 days" → head with VIP carve-out. Zerodha's set is "per-order, 1-in-2000, 20× at market open, 7 years for trading-orders" → priority-tag head. Writing this down before reaching for an architecture diagram saves the team a quarter of redesign work.

Edge cases the four samplers all have in common

There are four patterns every sampler trips on, regardless of design, and naming them is part of the calibration discipline that closes Part 4.

Cross-trace correlation. A user's checkout-trace and the downstream fraud-detection-batch-trace are logically related (the batch reads the checkout's payment row hours later) but technically independent traces with different trace_id values. No sampler keeps "this trace and the related future trace as a pair" — the future trace doesn't exist when the decision fires. The only fix is baggage attributes (payment_id, user_id) propagated through the application's data layer so a post-hoc join can reconstruct the link in the trace store. The sampler does not solve this; the schema does.

Async fan-out drift. A trace whose root span returns immediately and whose child spans complete asynchronously over the next 5 minutes (a Kafka-published settlement, a delayed bank callback) lives outside any reasonable buffer window. Tail sampling drops it as if it never finished. The fix is to break long-tail async into separate traces with a baggage link — the producer's trace closes when the producer's work is done, the consumer starts a new trace that carries the parent's trace_id as a span attribute. The trace store reconstitutes the link via a join, not via a tree.

Re-entrancy on retries. A request retried by a service-mesh sidecar generates two network attempts and one application-level trace. The application's sampler decided keep/drop on the first attempt; the second attempt inherits via parent-based propagation. If the second attempt is the one that errored, but the first attempt's keep/drop was "drop", the error is lost — the parent flag overrides the second attempt's would-be-keep. The fix is lazy promotion: head samplers can be configured to promote a dropped trace to kept if a recording event (an error, a slow span) crosses a threshold, by buffering recently-decided trace_ids and emitting a "re-sample" signal. OTel's Collector supports this via tail_sampling_processor's late-binding decision, but few SDK-level head samplers do.

Sampling decision drift across language SDKs. The Python OTel SDK and the Go OTel SDK both ship TraceIdRatioBased(0.01) and both promise deterministic per-trace_id decisions. They are subtly different — Python uses the lowest 64 bits as a uniform-32 fraction, Go used the highest 64 bits before v1.16, and Java's older SDK used a SHA-1 of the entire 16-byte trace_id. Cross-SDK fleets therefore see traces that are kept-by-Python and dropped-by-Go for the same trace_id, partially capturing the trace tree. The fix is to standardise on the parent_based_trace_id_ratio sampler (which delegates to the parent's flag and only makes a fresh decision at the root) and to verify cross-SDK consistency with a calibration run on identical trace_id fixtures.

The "decision was right at the time" problem. A sampler decides to drop a trace at 14:32:01 IST based on the policy then in force; at 14:35 the on-call rotates the policy to "keep all traces from service=ledger-api for the next hour" because an investigation needs them. The dropped trace from 14:32 is gone forever — the policy change is forward-looking only, and there is no "replay the spans I already dropped" path because the spans never reached durable storage. The fix some fleets adopt is two-tier capture: a low-cost "everything for 5 minutes" hot buffer (often Kafka with a 5-minute retention) sits in front of the sampler, and policy changes can drain from the hot buffer for the recent past. This adds another stateful system to operate, but it converts policy changes from "starts now" to "starts now plus a 5-minute lookback", which is what investigators actually need.

Cost attribution across teams sharing a sampler. A multi-team fleet shares one OTel Collector, one Tempo, one tail-sampling policy. Team A is a high-traffic catalog service (10K RPS, 0.001% error rate). Team B is a low-traffic settlements service (50 RPS, 0.5% error rate). Under a fleet-global "keep all errors + 1% of OK", Team A's kept volume is 10K × 0.01 ≈ 100 traces/sec (almost all from the OK-fast policy), Team B's kept volume is 50 × 0.005 + 50 × 0.01 ≈ 0.75 traces/sec. The Tempo bill is dominated by Team A; the operational debugging value is dominated by Team B. Team B is subsidising Team A's traces and getting back a tiny fraction. The fix is per-service policies (Team B at 100% retention because volume is small, Team A at error-only because OK-fast at 10K RPS is the bill driver) and per-team chargeback wired into the bill. Without this, the cost-fairness conversation eventually forces a re-architecture, often during a finance review three quarters in.

Sampler hot-reload during an incident. When an investigation needs more traces from a specific service for the next 30 minutes, the operator wants to push a policy change to the collector without a restart. Most tail samplers support this via a config-reload signal (SIGHUP or a control-plane API), but the reload is not atomic — the in-flight buffer contains traces decided under the old policy, and applying the new policy mid-buffer produces a mixed kept-set where the same trace_id might be kept under one policy version and dropped under another (if the trace's spans straddle the reload). The honest answer is that hot-reload is "best effort" and the operator should expect a 1-minute transition window where the policy is mixed. Teams that document this in the runbook avoid the "but my reload was at 14:32 and the trace at 14:33 should be there" confusion that otherwise consumes 20 minutes of an incident.

Common confusions

"Tail sampling fixes the head-sampling failure mode." It fixes one failure mode (errors lost in the random 99%) and adds two (the buffer's memory cost is real production load, and the bias breaks population statistics). The fix is not free; it relocates the problem to a different layer.
"Adaptive sampling is the same as tail sampling." No — adaptive modulates the rate at which a sampler fires; tail modulates the evidence on which the sampler fires. An adaptive head sampler dynamically lowers the keep-rate during spikes but is still blind to errors. An adaptive tail sampler also lowers the OK-fast keep-rate during spikes but keeps 100% of errors regardless. The two compose, but they are not substitutes.
"100% error retention means I never lose an error trace." It means you never lose an error trace inside the buffer window. A trace whose final span arrives after the eviction (slow downstream callback, async settlement) is lost regardless of policy. The policy is necessary; it is not sufficient.
"My head sampler is deterministic, so all my traces are whole." Determinism per trace_id ensures every service in the path makes the same decision — but only if every service shares the same propagation contract (parent-based sampling reading the W3C traceparent flag). A service that re-issues a fresh trace_id mid-path (a malformed proxy, a regenerated request_id) breaks determinism and produces partial trees.
"Sampling at 1% means I lose 99% of useful information." This is the head-sampling intuition; the tail-sampler's 1% is the kept-1% of OK traffic plus 100% of errors plus 100% of slow traces. The "1%" is a dramatic understatement of what tail sampling preserves, which is why the comparison "head vs tail at 1%" is misleading without naming the policy.
"Exemplar-aligned sampling is automatic." The exemplar attached to a histogram bucket only points to a retained trace if your tail sampler's policy keeps the trace whose trace_id was captured in the exemplar reservoir. If the bucket-tail observation came from an OK-fast request that the tail-sampler dropped, the exemplar's trace_id resolves to nothing in Tempo. The alignment is by design (slow buckets correlate with slow-policy keeps), but it is not bulletproof; verify it with a periodic crawler that picks 100 random exemplars and checks how many resolve.

Going deeper

The collector buffer and the SDK hot path — two sides of one ops surface

The OTel Collector's tail_sampling_processor is a stateful streaming join, and once it grows past one replica's memory budget it becomes a distributed system unto itself. Production deployments shard by trace_id consistent-hash, which means an upstream load-balancer (or a loadbalancing exporter in front) must route every span of a given trace to the same downstream collector replica. Mis-route one span and the policy engine sees a partial trace, which it will treat as "incomplete" and either evict early or hold for the full window before discovering it never had the children — both of which are wrong outcomes. The sharding layer is operationally non-trivial: a replica restart re-shuffles the keyspace, the in-flight buffer is lost (spans are not durable in the collector by default), and any traces in progress at restart get evicted. Razorpay's runbook for collector restarts is "drain the upstream load-balancer for 60 seconds first", which works because most traces complete within the buffer window. Teams that don't drain see an incident-correlated spike of partial traces that look like real bugs but are actually self-inflicted by the deploy.

The mirror-image cost is on the application side. OpenTelemetry SDKs aim for a sampler decision in <1µs (a hash, a comparison, a context-var write). Most of the time they hit it. Edge cases break it: a ParentBasedSampler wrapping a network-aware sampler that calls out to a remote service to fetch the rate (some early implementations did this); a sampler that materialises a per-request RNG state (the Python SDK before 1.18 did, costing ~30µs); a sampler that runs a regex over the URL to decide. Anything in the sampler that touches I/O, allocates more than a small object, or holds a lock is a latency bug masquerading as observability. The right design is to compute everything decisionable from the trace_id's bits — which the W3C trace flag is for — and to make rate adjustments out-of-band via a control plane that updates the SDK's in-memory rate every few seconds. The instrumentation cost should sit below the noise floor of the request's own latency; if your sampler shows up on a flamegraph, your sampler is broken. The two ends meet in production runbooks: the collector-side ops surface (buffer memory, eviction rate, shard balance) and the SDK-side ops surface (decision latency, rate-config refresh interval, parent-based propagation correctness) are co-equal — neglect either and the system silently degrades along an axis the dashboard does not yet show.

Lossless sampling, stratified sampling, and the analytics-fidelity question

Not every fleet has to sample. A 100-RPS internal-tools fleet at ~10 spans/trace produces ~1 GB/day of traces post-compression — well within object-storage budgets. A regulated workload like Zerodha's trading-orders or a healthcare claim flow may be required by audit to retain 100% of traces for years. Lossless tracing is the right answer in both cases, and the engineering shifts from "which fraction do you keep" to "how do you compress and retain forever". Tempo's columnar layout (previous chapter) plus retention to S3 Glacier handles this at ~₹0.001/trace-year for cold storage. The trade is query latency on cold data, which is fine for audit but not for incident debugging — production fleets that combine 7-day hot retention with multi-year cold retention get both, at the price of a tiered storage layer to operate.

When lossless is not viable but you still need population-correct analytics, stratified sampling is the right intermediate. The mechanism: pick a sample rate per trace class (errors=100%, slow=100%, OK-fast=1%), store the per-class rate alongside each trace as a weighted attribute, and reweight by 1/rate at query time to recover unbiased estimates. Tempo and Mimir both have early-stage support for the weight column; Honeycomb's "Refinery" tail-sampler popularised the approach. The cost is that every analytics query must know about the reweighting — forgetting to apply it produces biased numbers, which is worse than a biased sample because the bias is now invisible. The workflow is right when your trace data is the primary substrate for both incident debugging and analytics; for fleets where Prometheus metrics carry the analytics workload and traces are debug-only, the simpler unweighted bias is fine.

The next design space — adaptive policies driven by the trace itself

Most production samplers today decide on static policies (status, latency, attribute equality). The frontier is policies driven by anomaly detection on the trace itself — keep traces whose span structure is unusual, whose dependency graph touched a service it has not touched before, whose latency profile sits in the upper 0.1% of similar traces from the last hour. Honeycomb's BubbleUp and Datadog's "rare events" sampler are early productions of this idea. The technical lift is in computing the "is this trace anomalous?" signal at the collector in <100ms — too slow and the buffer overflows, too fast and the anomaly detection is just a threshold. The Razorpay 2024 retro mentioned an internal experiment with this approach: a sampler that kept all traces touching a service-pair never seen before in the last 24 hours, which caught a misconfigured shadow-traffic deploy on its first request rather than after 50,000 of them had landed. The work is early and the operational maturity is not yet there for most fleets, but it is the direction the next five years of sampling design will travel.

The calibration discipline — what to measure on your own fleet, monthly

The script in this chapter runs against synthetic traffic. The version that earns its keep runs against your fleet's last 24 hours of trace data, exported from Tempo via the search API and replayed through the four samplers offline. The four numbers you want on a monthly dashboard: (1) error retention — of all traces with status=error in your application logs, what fraction is also retained in Tempo? Less than 99.5% means your sampler is leaking errors. (2) buffer eviction rate — processor_tail_sampling_evicted_traces_total / processor_tail_sampling_seen_traces_total — non-zero means traffic is exceeding your buffer's window or memory cap. (3) cross-SDK consistency — for a sample of traces with spans from multiple language SDKs, the fraction where the trace tree is whole versus partially captured. (4) policy-change lookback gap — when policy was last changed, how far back can investigations still see traces? Why these four are the right metrics: each maps directly to one of the four tradeoff axes. Error retention measures the bias-vs-completeness axis; eviction rate measures the buffer-cost axis; cross-SDK consistency measures the propagation-correctness axis; lookback gap measures the spike-and-policy-change axis. A team that watches all four monthly knows when a slow degradation is happening — error retention drifting from 99.8% to 98.2% over a quarter is the early signal that the buffer is becoming undersized, six months before the incident that finally surfaces it. Most fleets do not measure any of these and discover problems only when an incident exposes them; the discipline is what separates "we have tail sampling" from "we have tail sampling that we can defend in a postmortem".

Reproduce this on your laptop

# Reproduce the four-sampler calibration
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy
python3 sampling_wall_calibration.py
# Expected (~90s): the four-row table comparing kept_traces, error retention,
# KS distance from the population latency, and spike-time span ratio.
# Vary TARGET_RPS and the spike multiplier to see the buffer-budget pressure
# that real fleets hit during the IPL final or the Tatkal hour.

Where this leads next

Part 4 closes here. The wall is real, the trades are real, and the calibration discipline named in this chapter is the deliverable that makes the rest of the curriculum land. Part 5 (Cardinality) is structurally similar — every cardinality choice trades observability granularity against TSDB cost, and the "you cannot have everything" framing recurs in a different domain. Read the next part with the sampling lens primed: the question is rarely "what is the optimum?" and almost always "which axis am I willing to give up?"

Cardinality: the master variable — the cardinality chapter applies the same four-axis tradeoff thinking to label design; sampling and cardinality are dual problems on the metrics side and the trace side.
Trace sampling: head, tail, adaptive — the chapter that introduces the three designs this wall consolidates; revisit it with the calibration script above to verify the numbers on your own fleet.
Exemplars: linking metrics to traces — the link layer that depends on the sampler keeping the right traces; understanding both is what makes one-click metric-to-trace navigation reliable in production.
Trace storage at scale: Tempo's columnar approach — the storage-side counterpart; the sampler decides what arrives at Tempo, the columnar layout decides how cheaply it stays there.

The closing thought for Part 4: distributed tracing as a discipline is held together by three pieces — propagation (the trace_id reaches every span), storage (Tempo or its peers retain the spans), and sampling (the right fraction of traces survives). All three have to land for the on-call engineer at 02:11 IST to find the trace that explains the alert. A team that nailed propagation but skimped on sampling has a trace store full of representative noise; a team that nailed sampling but skimped on propagation has gaps in every span tree. Part 5 starts the next layer of the stack — what happens to the metrics side when your label cardinality grows the same way your trace volume did — and the engineering shape will feel familiar because the wall pattern repeats.

There is also a meta-lesson worth carrying out of Part 4 explicitly. Every chapter in this part has been about a specific mechanism — the Dapper paper, the span data model, the W3C wire format, the backends, the SDK abstractions, the three sampler families, the Tempo columnar layout, exemplars. The mechanisms are real and individually correct, and a team can adopt every one of them and still end up with an unreliable trace pipeline because they did not negotiate the wall. Mechanisms compose; tradeoffs do not. The four-axis frame in this chapter is the lens that lets a senior engineer look at a vendor's "we have tail sampling" claim and ask the right second question: "what window?" And the third: "what eviction rate?" And the fourth: "how do you measure error retention?" If those questions land as surprises, the mechanism is partial. If they land as ready answers with measured numbers, the mechanism is engineered. Carrying that habit into Part 5 — where every cardinality decision will look superficially like a label-design choice and will actually be a four-axis tradeoff in disguise — is the durable take-away from Part 4.

A practical next step for any team finishing this part: run the calibration script in this chapter against last week's actual production trace export, plot the four numbers (kept volume, error retention, KS distance, spike-time ratio) for each of the four sampler designs, and pin the chart on the team's observability runbook page. Repeat monthly. The discipline is not in any single decision; it is in noticing the slow drift on each axis before an incident notices it for you.

The wall has a name in this chapter for a reason — Aditi's 02:11 IST hole was real, the previous month's lost-error-trace was real, and the next month's will be too. Naming the wall does not remove it. What naming does is make the next conversation possible: when the team's senior IC says "we should switch to X-sampler", the right reply is no longer "sounds good" but "which axis are we trading? Show me the calibration." That single change in conversation is the difference between observability that survives a Tatkal hour and observability that becomes the postmortem at 04:00 IST. Carry the question, not the answer.

That is the deliverable Part 4 hands to Part 5: not a sampler, but a habit.

References

OpenTelemetry — Sampling specification — the canonical definition of head, parent-based, and the SDK extension points for tail and adaptive samplers.
Honeycomb Refinery — tail-sampling design — the production-grade tail sampler that popularised stratified-and-reweighted sampling and dynamic-rate policies; required reading for anyone running tail in anger.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 19 (Sampling: A Necessary Evil) — the modern-era treatment of sampling tradeoffs that this wall chapter draws its four-axis framing from.
Sigelman et al., Dapper: A Large-Scale Distributed Systems Tracing Infrastructure (Google, 2010) — the original paper that argued head sampling at 0.1–1% is sufficient for capacity planning; rereading it after this chapter reveals the workload assumptions that no longer hold for incident debugging.
Grafana Tempo — load-balancing for tail sampling — the operational guide for sharding the OTel Collector's tail-sampling buffer across replicas; the consistent-hash routing layer is the part that bites teams in production.
Trace sampling: head, tail, adaptive — the chapter this wall is consolidating; the per-sampler mechanics are there, the per-fleet tradeoff is here.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 4 — the foundational treatment of "all observability is a budget"; the framing that sampling is one of three places (alongside cardinality and retention) where the budget bites.
Ben Sigelman — The Three Pillars With Zero Answers — the polemic that argues sampling is the single most-underserved part of the observability stack; reads as a manifesto for why this wall chapter exists.