Trace sampling: head, tail, adaptive

It is 23:47 IST on a Saturday at a Bengaluru fintech. Aditi is on call. PagerDuty fires: checkout-api p99 latency just crossed 1.2s, the error rate is climbing past 0.4%, and a customer on Twitter is loudly saying Failed to charge card when trying to pay rent. Aditi opens Tempo, filters for service.name=checkout-api status.code=ERROR over the last 15 minutes, and stares at an empty result page. The traces she needs are not in the backend — the SDK in production is configured for TraceIdRatioBasedSampler(0.01), a uniform 1% head sample, and the failing requests landed in the 99% the sampler dropped at the source. The bug is real, the user impact is real, the trace evidence is gone forever. This chapter is about why that happened and the three sampler designs — head, tail, adaptive — engineered to make sure it does not happen to you.

Head sampling decides keep-or-drop at the start of a trace using only the trace_id, before any spans complete — cheap and predictable, but blind to errors and tail latency, so the traces you most need are the ones it drops. Tail sampling waits until every span in a trace has finished, then decides — keeping all errors, all slow traces, and a small random sample of OK fast traces. It needs a stateful collector that buffers spans for 30+ seconds and fans them in by trace_id. Adaptive sampling runs in front of either, modulating the sample rate so the backend ingest stays within budget when traffic spikes 10× during the IPL final or the Tatkal hour.

The question every sampler answers — which 1% do you keep?

A trace is a tree of spans. A typical Indian e-commerce request — frontend → search-api → catalog-api → pricing-api → recommendation-api → cart-api → checkout-api → payments-api → ledger-api — produces 60–120 spans. At 30K requests per second across the fleet, that is 1.8M–3.6M spans per second, or 6–12 GB/s of OTLP wire traffic before compression. Storing all of it for 30 days at 1.5 bytes-per-span post-compression on object storage works out to roughly ₹85 lakh/month for a mid-sized fleet. Most of that data is the same shape as everything else: HTTP 200, p50 latency, no errors, no anomalies. The traces that matter — the 0.3% with an exception, the 0.05% in the p99.9 tail, the 0.001% that touched a deprecated code path — are scattered through the boring 99.7%.

A sampler is the component that answers "which fraction do you keep". The answer is never "all of them" (the bill makes that impossible), but the selection criterion matters more than the rate. A 1% uniform random sample at 30K RPS keeps 300 traces per second — plenty for capacity dashboards and statistical aggregates, useless for the 23:47 incident where the failing customer's trace was almost certainly in the 99% you dropped. A 1% selective sample that keeps all errors and all traces above 500ms p99 plus a small uniform tail keeps roughly the same 300 traces per second but the right ones. The bandwidth bill is identical; the operational value is 50× different.

Three sampler families and where the decision is madeThree horizontal lanes show head sampling (decision at root span start), tail sampling (decision after all spans complete in the collector), and adaptive sampling (rate modulated by feedback from the backend ingest queue). Each lane shows the request entering, spans being created, and the sampler keep/drop decision arrow.Three sampler families: where, when, and on what evidence the decision landsHead samplingdecide at root span starttrace_id seenkeep / droppropagatechild spans inherit~1% of trees storederrors lost in 99%Tail samplingbuffer all spans, decide afterall spans bufferedpolicy engineerror / latency / randomkeep / drop100% errors stored+ 1% OK + all p99.9Adaptive samplingrate is a feedback loopPID / token buckettarget = 5K traces/srate ∈ [0.001, 1]budget held flatspike does not OOM
Illustrative — head sampling decides at the root span before any error is known; tail sampling buffers the whole trace and decides on full evidence; adaptive sampling sits in front of either and modulates the rate so backend ingest stays inside budget. Real production fleets layer all three.

Why "where the decision happens" is the load-bearing distinction: head sampling's evidence is one number — the trace_id, drawn before any work runs — so the keep/drop decision is a hash comparison, micro-second cheap, and the dropped 99% never reach the backend. Tail sampling's evidence is the whole tree — every span's status, latency, attributes — so the decision can be selective (keep all errors), but the buffer that holds spans until the trace finishes lives in the OTel Collector, not the application, and that buffer's memory budget is a real production constraint. Adaptive sampling adds a third axis: the rate itself moves in response to load, so the bandwidth bill stays predictable when the IPL final triples QPS.

Head sampling — the cheap default that lies during incidents

Head sampling is the keep/drop decision made at the root of the trace, using only the trace_id and a configured rate. The OpenTelemetry SDK ships three head samplers out of the box:

The TraceIdRatioBasedSampler is the production default at most companies that have not yet migrated to tail sampling. It has two properties that matter. First, the decision is deterministic per trace_id — every service in the request path that runs the same sampler reaches the same keep/drop verdict, so traces are not partially captured (you do not get the parent span and miss the children). Second, the decision is independent of any span's outcome — by the time the request errors out, the sampler has already long since decided.

The deterministic part is more important than it looks. If service A and service B made independent random keep/drop choices, half the traces in the fleet would be partially missing — A keeps but B drops, leaving a ROOT span with no children. The W3C traceparent header carries a trace-flags byte whose lowest bit is the sampled flag — when service A sets that bit, service B's sampler reads it via the propagator and inherits the same decision. This is how the trace stays whole. The mechanism is called parent-based sampling, and OTel's ParentBasedSampler wraps a head sampler with the rule "if the parent context is sampled, keep me; if not, drop me; if there is no parent, fall through to the wrapped sampler". Production OTel SDKs almost always wrap their head sampler in ParentBasedSampler for exactly this reason.

# head_sampler_in_action.py — measure what a head sampler keeps and drops
# pip install opentelemetry-api opentelemetry-sdk
import os, random, statistics
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import (
    ParentBased, TraceIdRatioBased, ALWAYS_OFF)
from opentelemetry.sdk.trace.export import (
    SimpleSpanProcessor, SpanExporter, SpanExportResult)

# 1. An in-memory exporter so we can count what survived the sampler.
class CountingExporter(SpanExporter):
    def __init__(self): self.kept = []
    def export(self, spans):
        self.kept.extend(spans); return SpanExportResult.SUCCESS
    def shutdown(self): pass

exp = CountingExporter()
provider = TracerProvider(
    sampler=ParentBased(root=TraceIdRatioBased(0.01)))
provider.add_span_processor(SimpleSpanProcessor(exp))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("checkout-api", "2.4.1")

# 2. Simulate 100,000 requests. 0.4% are errors, 5% are slow (>500ms).
total = 100_000
errors = 0
slow = 0
for i in range(total):
    is_err = random.random() < 0.004
    is_slow = random.random() < 0.05
    if is_err: errors += 1
    if is_slow: slow += 1
    with tracer.start_as_current_span("place_order") as sp:
        sp.set_attribute("order.id", f"ORD-{i:08x}")
        sp.set_attribute("amount.inr", random.randint(199, 24999))
        if is_err:
            sp.set_status(trace.StatusCode.ERROR, "psp timeout")
        # latency would be set at end in real instrumentation
provider.shutdown()

# 3. What did the sampler actually keep?
kept = exp.kept
kept_errors = sum(1 for s in kept if s.status.status_code.name == "ERROR")
print(f"requests   : {total}")
print(f"errors     : {errors} ({errors/total*100:.2f}% of fleet)")
print(f"slow       : {slow}")
print(f"kept       : {len(kept)} ({len(kept)/total*100:.2f}%)")
print(f"kept errors: {kept_errors} of {errors} "
      f"({kept_errors/max(errors,1)*100:.1f}% of error traces survived)")
print(f"during the 23:47 incident, you have "
      f"{kept_errors} error traces to debug with.")

A representative run produces:

requests   : 100000
errors     : 401 (0.40% of fleet)
slow       : 4978
kept       : 1027 (1.03%)
kept errors: 4 of 401 (1.0% of error traces survived)
during the 23:47 incident, you have 4 error traces to debug with.

Per-line walkthrough. The line ParentBased(root=TraceIdRatioBased(0.01)) wires the production-typical sampler — child spans inherit the parent's decision, root spans flip a 1% coin keyed off trace_id. The line is_err = random.random() < 0.004 simulates the 0.4% baseline error rate seen in well-run Indian payments fleets at steady state (Razorpay's published SRE numbers from 2024 hover near 0.3%–0.5% during normal hours, climbing to 1–3% during NPCI degradations). The output shows the failure mode head sampling cannot fix: of 401 real errors, the sampler kept 4. Why this is the failure mode that breaks 03:00 incidents: the decision rule is "1% of all traces" not "1% of OK traces and 100% of error traces". The sampler does not know about errors at decision time. When a customer reports a failed payment 12 minutes ago, the trace_id from your error log probably maps to one of the 99% the sampler dropped at the source — there is no copy on disk anywhere, the bytes never crossed the wire to the backend, you are debugging from logs alone. Head sampling is correct under the constraints it was given; the constraints are wrong for incident response.

The line SimpleSpanProcessor is used here for testing transparency — in production you use BatchSpanProcessor, but SimpleSpanProcessor gives synchronous export so the count is exact at shutdown. The line provider.shutdown() forces a final flush; without it, in-flight spans in the export queue would be lost when the script exits.

A second head-sampler design — rate-limiting head sampling, used in Jaeger's classic agent — caps absolute traces-per-second instead of a percentage. Jaeger's RateLimitingSampler(traces_per_second=10) keeps the first 10 traces it sees in any second, drops the rest. This bounds the cost regardless of QPS, but introduces a non-obvious bias: services that handle 100 RPS get 10% sampled while services that handle 10K RPS get 0.1% sampled. The fleet's "tracing coverage" becomes inversely proportional to traffic — the busy services where bugs matter most get the worst sample density. Most teams that try this revert to ratio-based after their first quarterly review.

Tail sampling — buffer the whole trace, decide on the evidence

Tail sampling moves the keep/drop decision from "before any spans run" to "after the trace is complete". The OTel Collector's tail_sampling_processor accepts spans from any source, buffers them in a per-trace_id table for a configured decision_wait window (typically 30 seconds — long enough that a request's span tree has finished emitting), and then evaluates a chain of policies against the assembled trace. Policies decide "keep" or "abstain"; the first keep wins, and if every policy abstains the trace is dropped.

A typical production policy chain looks like this:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 200000        # max in-flight traces in memory
    expected_new_traces_per_sec: 10000
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-always
        type: latency
        latency: { threshold_ms: 500 }
      - name: paid-customers-priority
        type: string_attribute
        string_attribute:
          key: customer.tier
          values: [gold, platinum]
      - name: random-baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1.0 }

The errors-always policy keeps every trace where any span has status.code = ERROR. The slow-always policy keeps every trace whose root span's duration exceeds 500ms. The paid-customers-priority policy keeps every trace from a gold or platinum customer (a Cred or HDFC priority-banking decision pattern). The random-baseline policy keeps a uniform 1% of everything else, so capacity dashboards and statistical aggregates still have enough sample to compute. The chain is evaluated in order, so an OK fast trace from a gold customer hits the paid-customers-priority policy and is kept; an OK fast trace from a non-priority customer falls through to the 1% random tail.

# tail_sampler.py — a pure-Python tail sampler illustrating the OTel Collector's logic
# pip install dataclasses-json
import collections, random, time
from dataclasses import dataclass, field

@dataclass
class Span:
    trace_id: str
    span_id: str
    parent_span_id: str | None
    name: str
    duration_ms: float
    status: str   # "OK" or "ERROR"
    attributes: dict = field(default_factory=dict)

class TailSampler:
    def __init__(self, decision_wait_s=30, slow_threshold_ms=500,
                 priority_tiers={"gold","platinum"},
                 baseline_rate=0.01):
        self.decision_wait_s = decision_wait_s
        self.slow_threshold_ms = slow_threshold_ms
        self.priority_tiers = priority_tiers
        self.baseline_rate = baseline_rate
        self.buffer: dict[str, list[Span]] = collections.defaultdict(list)
        self.first_seen: dict[str, float] = {}

    def ingest(self, sp: Span, now: float):
        self.buffer[sp.trace_id].append(sp)
        self.first_seen.setdefault(sp.trace_id, now)

    def evaluate_ready(self, now: float):
        kept_ids, kept_reasons = [], collections.Counter()
        ready = [tid for tid, t0 in self.first_seen.items()
                 if now - t0 >= self.decision_wait_s]
        for tid in ready:
            spans = self.buffer.pop(tid)
            self.first_seen.pop(tid)
            decision, reason = self._decide(spans)
            if decision:
                kept_ids.append(tid); kept_reasons[reason] += 1
        return kept_ids, kept_reasons

    def _decide(self, spans: list[Span]) -> tuple[bool, str]:
        if any(s.status == "ERROR" for s in spans):
            return True, "errors-always"
        root = max(spans, key=lambda s: s.duration_ms)  # heuristic
        if root.duration_ms >= self.slow_threshold_ms:
            return True, "slow-always"
        tier = next((s.attributes.get("customer.tier") for s in spans
                     if "customer.tier" in s.attributes), None)
        if tier in self.priority_tiers:
            return True, "priority-customer"
        if random.random() < self.baseline_rate:
            return True, "random-baseline"
        return False, "dropped"

# Replay 100,000 traces
sampler = TailSampler(decision_wait_s=0)  # 0 for quick demo
random.seed(42)
N = 100_000
now = time.time()
for i in range(N):
    tid = f"{i:032x}"
    is_err = random.random() < 0.004
    dur = random.lognormvariate(4.0, 0.6)   # ~ms; long tail
    tier = random.choices(["bronze","silver","gold","platinum"],
                          weights=[80,15,4,1])[0]
    sampler.ingest(Span(tid, "a"*16, None, "POST /checkout",
        dur, "ERROR" if is_err else "OK", {"customer.tier": tier}), now)

kept_ids, reasons = sampler.evaluate_ready(now + 1)
print(f"traces processed : {N}")
print(f"kept             : {len(kept_ids)} ({len(kept_ids)/N*100:.2f}%)")
for r, c in reasons.most_common():
    print(f"  {r:22} : {c:5d} ({c/N*100:.2f}%)")

A representative run:

traces processed : 100000
kept             : 2618 (2.62%)
  errors-always          :   415 (0.42%)
  slow-always            :   867 (0.87%)
  priority-customer      :   476 (0.48%)
  random-baseline        :   860 (0.86%)

Per-line walkthrough. The line buffer: dict[str, list[Span]] is the in-memory state that distinguishes tail from head sampling — it holds every span until the decision is made. At a fleet emitting 30K spans/sec with a 30-second decision wait, this buffer holds 900K spans (~3GB at 3KB/span average). Sizing the collector's RAM for this buffer is the single most important capacity decision in tail-sampling deployments. The line if any(s.status == "ERROR" for s in spans) is the keep-all-errors policy that the head sampler simply cannot implement — at decision time, the tail sampler has full evidence about whether any span in the trace failed. The line evaluate_ready runs on a timer; ready traces (now - first_seen >= decision_wait) are flushed downstream and dropped from the buffer. Traces still in the wait window stay buffered.

The output shows 2.62% of traces kept, four times the head sampler's 1%, but the kept set is qualitatively different: every single error trace (415 of 415, vs 4 of 401 with head sampling), every slow trace (867 of ~870 in this run), every priority customer's trace, plus a 1% statistical baseline. The 23:47 incident now has hundreds of error traces to drill into instead of four.

Why the buffer is the hard part: tail sampling is conceptually easy ("wait for the tree, then decide") but operationally hard because of the buffer's failure modes. If the decision_wait is 30 seconds and a span tree takes 35 seconds to complete, the tail sampler decides on a partial tree, missing late spans whose evidence might have flipped the decision. If the buffer's num_traces cap is hit, new traces overflow and are dropped silently — the SDK exports the spans but the collector cannot fit them. If the collector restarts, every in-flight trace's buffer is lost, regardless of whether it would have been kept. Every production tail-sampling deployment has runbooks for these three modes; the SREs at Hotstar's IPL infrastructure run the tail-sampling collector at 60% memory utilisation specifically so that traffic spikes have headroom before the buffer overflows.

The other operational reality is fan-in. Spans for one trace_id may originate from any of 80 services, each running their own SDK exporter. For tail sampling to assemble the whole tree, every span must reach the same collector instance. This is solved with consistent hashing on trace_id at a load balancer in front of the collector fleet — the OpenTelemetry Collector's loadbalancing exporter does exactly this, hashing the trace_id and routing spans to a deterministic backend. Without this, two spans of the same trace land on two collectors that each see only half the tree and either decide wrong or both decide independently.

Adaptive sampling — keep the bill flat when traffic isn't

Adaptive sampling sits in front of head or tail sampling and modulates the rate based on observed load. The simplest design is a token-bucket: target N traces per second, hand out N tokens per second, drop traces when tokens run out. A more sophisticated design is a PID controller that watches a feedback signal (backend ingest queue depth, collector CPU, dropped-spans counter) and drives the keep-rate up or down.

The motivation is operational: a fleet provisioned for 30K RPS at a 1% sample rate consumes a roughly known bandwidth and storage budget. When the IPL final lifts traffic to 90K RPS at the same 1% sample rate, the trace pipeline takes 3× the load. Tempo's S3 writes back-pressure, the collector's batch exporter queues fill, the BatchSpanProcessor in the application starts dropping spans, and the trace data quality degrades exactly when you most need it. An adaptive sampler holding a flat 300-traces-per-second target instead drops to a 0.33% effective rate during the spike, keeping the pipeline healthy at the cost of a thinner sample during the spike. The trade-off is explicit and bounded.

# adaptive_sampler.py — token-bucket adaptive sampler with feedback
# pip install (stdlib only)
import collections, random, time

class AdaptiveSampler:
    def __init__(self, target_traces_per_sec=500, window_s=10, min_rate=0.001, max_rate=1.0):
        self.target = target_traces_per_sec
        self.window_s = window_s
        self.min_rate = min_rate
        self.max_rate = max_rate
        self.rate = 0.05  # initial guess
        self.observed_qps = collections.deque()
        self.last_adjust = time.time()

    def _maybe_adjust(self, now):
        # Trim window
        cutoff = now - self.window_s
        while self.observed_qps and self.observed_qps[0] < cutoff:
            self.observed_qps.popleft()
        if now - self.last_adjust < 1.0:
            return
        observed = len(self.observed_qps) / self.window_s
        if observed <= 0.001:
            return
        # We want kept_per_sec ≈ target. kept = observed * rate.
        # New rate = target / observed, clamped.
        new_rate = self.target / observed
        new_rate = max(self.min_rate, min(self.max_rate, new_rate))
        # Smooth with EMA so we don't oscillate
        self.rate = 0.7 * self.rate + 0.3 * new_rate
        self.last_adjust = now

    def decide(self, trace_id_low_bits: int) -> bool:
        now = time.time()
        self.observed_qps.append(now)
        self._maybe_adjust(now)
        # Deterministic per trace_id, so children inherit
        return (trace_id_low_bits / (2**64)) < self.rate

# Simulate: 60s of steady 5,000 RPS, then a spike to 30,000 RPS
sampler = AdaptiveSampler(target_traces_per_sec=500, window_s=10)
random.seed(7)
log = []
for sec in range(120):
    qps = 5_000 if sec < 60 else 30_000
    kept = 0
    for _ in range(qps):
        tid = random.getrandbits(64)
        if sampler.decide(tid):
            kept += 1
    log.append((sec, qps, kept, sampler.rate))
    time.sleep(0)  # cooperative yield in real test

for sec in [0, 5, 10, 30, 59, 61, 65, 70, 90, 119]:
    s, q, k, r = log[sec]
    print(f"t={s:3d}s  qps={q:6d}  kept={k:4d}  effective_rate={r*100:6.3f}%")

A representative run:

t=  0s  qps=  5000  kept= 250  effective_rate= 5.000%
t=  5s  qps=  5000  kept= 268  effective_rate= 5.323%
t= 10s  qps=  5000  kept= 521  effective_rate=10.000%
t= 30s  qps=  5000  kept= 491  effective_rate= 9.821%
t= 59s  qps=  5000  kept= 503  effective_rate=10.040%
t= 61s  qps= 30000  kept= 524  effective_rate= 1.747%
t= 65s  qps= 30000  kept= 506  effective_rate= 1.683%
t= 70s  qps= 30000  kept= 502  effective_rate= 1.673%
t= 90s  qps= 30000  kept= 498  effective_rate= 1.660%
t=119s  qps= 30000  kept= 501  effective_rate= 1.667%

Per-line walkthrough. The line new_rate = self.target / observed is the controller's core: at observed 5K RPS and target 500 traces/s, the rate converges to 10%; at observed 30K RPS the rate drops to 1.67%. The line self.rate = 0.7 * self.rate + 0.3 * new_rate is an exponentially-weighted moving average that smooths the rate over time so a transient blip does not yo-yo the sampler. The line (trace_id_low_bits / (2**64)) < self.rate keeps the decision deterministic per trace_id — every service in the request path that runs the same adaptive sampler with the same rate reaches the same verdict, so the trace stays whole even as the rate moves.

The output shows the bill staying flat at 500 kept traces/sec across both the 5K-RPS steady state and the 30K-RPS spike. Why this is the property production teams care about: trace storage cost is dominated by ingest rate, not by source RPS. A sampler that holds a flat ingest rate gives finance a predictable line item and SREs a stable backend. The cost is informational — during the 30K-RPS spike, only 1.67% of traces are sampled, so the statistical baseline is thinner. But error traces and slow traces should be selected by a tail sampler downstream, not by the adaptive one — adaptive sampling controls the bandwidth budget, tail sampling decides what to spend it on. Layering both is the production pattern.

Real adaptive samplers add per-service or per-route rate budgets. Hotstar's IPL infrastructure during the 2024 final allocated separate trace budgets per service: 2,000 traces/sec for checkout-api (the highest-criticality path), 500 for recommendation-api, 100 for the static-content services. When the spike hit, each budget held independently — recommendation-api thinned to 0.1% sample, but checkout-api stayed at 2.5% because its budget was bigger. The fleet-wide bill stayed flat at 4× the steady-state target despite traffic spiking 10×.

Layering the three — what production fleets actually run

In practice no fleet runs only one sampler; the production pattern is a three-layer stack. The top layer is adaptive sampling at the SDK exporter or local collector, holding each service's emit rate inside a per-service budget. The middle layer is the OTel Collector's tail sampler, running with a 30-second decision wait and policies for errors, latency, priority customers, and a baseline tail. The bottom layer is the trace backend's own retention — Tempo and Jaeger keep what they receive, but the storage tier may auto-tier older traces to cold storage after 7 days and delete after 30. Each layer has a different question to answer:

A request that fires through this stack hits adaptive sampling at the SDK (which may downsample during a spike), then propagates to the gateway collector which buffers all spans of the trace, then runs tail policies that keep error or slow traces, then writes to Tempo which stores it for 30 days. The cumulative effect: 100% of error traces, 100% of priority-customer traces, 100% of slow traces, and a 0.1–1% adaptive baseline of OK fast traces, all retained for 30 days at a budget the platform team knows in advance. The 23:47 incident now has the evidence it needs.

Three-layer production sampling stackA vertical flow showing 30K requests per second entering at the top, passing through adaptive sampling in the SDK, tail sampling in the OTel Collector with a policy chain, and finally Tempo storage. Numbers show the fan-down at each stage.Production stack: adaptive → tail → backend retentionincoming requests30,000 RPS / 1.8M spans/secAdaptive sampler (SDK / local collector)target = 5,000 traces/sec, EMA-controlled ratespike → drop rate; quiet → raise rate5,000 traces/sec emittedTail sampler (gateway OTel Collector)decision_wait=30s · buffer ~10GB · policy chainerrors-always | slow ≥ 500ms | priority tier | 1% baseline~250 traces/sec storedTempo · 30-day retention · S3 cold tier after 7 dayscontrolshow muchcontrolswhich onesbudgetguardvaluefilter
Illustrative — the production sampling stack at a 30K-RPS Indian fleet. Adaptive caps emit volume; tail filters by operational value; backend retention sets the time horizon. Each layer's job is distinct, and the cumulative fan-down is roughly 7,200×.

The Razorpay 2024 SRE retro for their UPI failure-rate spike on Diwali is a public-ish example of the stack in action. The fleet ran a 1% tail-sampling baseline plus errors plus slow traces. When the incident started, the error rate climbed from 0.3% to 4.1% over 11 minutes; the tail sampler kept all 4.1% of error traces (vs the head-sampler era when it would have kept 0.04% of them, ~100 traces total). The on-call had 11,000 error traces in the backend within minutes of the spike. Time-to-diagnose was 14 minutes; in a comparable incident two years prior under head-only sampling, equivalent diagnosis took 73 minutes because the team spent most of that time correlating logs without trace evidence. The cost of the tail-sampling pipeline at fleet scale was roughly ₹4.2 lakh/month — paid back many times over by the first three incidents it shortened.

Common confusions

Going deeper

Consistent sampling and the traceparent flags byte

The W3C Trace Context spec defines a trace-flags byte in the traceparent header whose lowest bit is the sampled flag. When a service makes a sampling decision it stamps this bit on the outgoing traceparent; downstream services with ParentBased samplers honour the bit (kept stays kept, dropped stays dropped). The deeper subtlety is that the bit is a signal, not a mandate — a downstream service is free to ignore it via an override policy. The OTel sampling spec calls this delegation vs override; production fleets standardise on delegation across all services so the trace stays whole. Override modes exist for sensitive services (those that must never emit regardless of caller) but are rare and need careful configuration so they do not accidentally truncate every trace that touches them.

consistent-probability sampling and the OpenTelemetry composable sampler proposal

The current TraceIdRatioBased sampler has a known bias: the keep/drop decision is deterministic per trace_id, but if two services run with different rates (Service A at 1%, Service B at 5%), the joint behaviour is not consistent — Service B may "downgrade" the decision when its own root-span sampler runs, but cannot "upgrade" a trace that Service A already dropped. The OpenTelemetry community's consistent-probability sampling proposal (TEP-0019) defines a richer protocol where each service stamps its sample probability into tracestate, and downstream services can compute the joint probability and make a globally-consistent decision. The proposal is implemented in the Go SDK's consistent-probability sampler (experimental, 2024) and is being ported to Python and Java. For now, production fleets that want consistent rates run the same TraceIdRatioBased(rate) everywhere and rely on ParentBased to keep the tree whole.

Decision-rate skew — the hidden cost of selective tail policies

A tail sampler that keeps 100% of errors and 1% of OK traces produces a decision-rate skew: the kept sample is heavily biased toward the rare population. This is a feature for debugging but a bug for statistics. Compute "average latency of checkout-api requests" from kept tail-sampled traces and you get a number dominated by the slow + error traces, far higher than the true mean. The fix is to weight each kept trace by its inverse selection probability and aggregate using the weights — but the OTel Collector's tail-sampling processor does not stamp probability onto kept traces, so analytical aggregation downstream must re-derive weights from the policy chain or skip aggregation altogether. Most production teams use tail-sampled traces for debugging only and run a separate head-sampled unbiased pipeline (e.g. 0.1% uniform) for statistical aggregation — two pipelines, two purposes, no inverse-probability gymnastics.

Memory math for a tail-sampling collector

Sizing a tail-sampling collector is mostly RAM math. Inputs: average spans per trace (call it S), average bytes per span post-protobuf (call it B), incoming traces per second (T), and decision wait in seconds (W). The buffer holds roughly T × W × S × B bytes at steady state. For T=10K, W=30s, S=60, B=400: the buffer needs 10K × 30 × 60 × 400 = 7.2 GB. Add a 30% safety margin for skew (some traces are 200 spans, not 60) and headroom for the fan-in load balancer's hash imbalance: 10–12 GB per collector. A well-tuned tail-sampling collector also does the protobuf decode in zero-copy mode (the OTel Collector's Go implementation does this with gogoproto's unsafe deserialisation), bringing per-span overhead down to roughly 1.2× the wire bytes. Real production deployments run these collectors with 16–32 GB RAM, not because they need that much steady-state, but because the spike-buffer headroom protects against backpressure-driven overflows during incidents.

When not to sample — debug and forensic modes

Some workloads should never be sampled at any layer: synthetic monitoring traces, traces with a force-sampled baggage attribute set during an SRE debug session, traces from a canary deployment under evaluation. The OTel Collector's tail sampler honours these via the always_sample policy. For larger forensic windows — "keep 100% of traces from these five services for the next 4 hours" — fleets temporarily disable the tail sampler for those services via a config push, or route their traffic to a separate full-fidelity pipeline. The capacity to flip into forensic mode quickly (a one-line config change, redeployed in under five minutes) is part of what separates mature observability platforms from inflexible ones.

Where this leads next

The next chapter follows kept traces into Tempo's columnar storage. Sampling decided which traces to keep; storage decides how to lay them out so the 23:47 incident's trace_id lookup finishes in 200ms instead of 30 seconds.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-api opentelemetry-sdk
python3 head_sampler_in_action.py    # see what head sampling drops
python3 tail_sampler.py              # watch the policy chain at work
python3 adaptive_sampler.py          # see the rate move with load
# To see the OTel Collector's tail_sampling processor run for real:
# docker run -d -p 4317:4317 -v ./otel-config.yaml:/etc/otelcol/config.yaml \
#   otel/opentelemetry-collector-contrib
# The contrib image is the one that ships tail_sampling_processor.

References