Tail-based sampling (OTel Collector)

It is 22:48 IST on a Friday, the IPL final has hit the second-innings power-play, and Aditi — now an SRE at a Bengaluru streaming company — is watching her on-call dashboard for the third hour straight. A user has reported that the "Buy Premium" CTA at half-time froze for 8 seconds before returning a 502. She pastes the trace_id from the user's network tab into Tempo. Tempo finds the trace. Every span is there: 14 services, 47 spans, the slow one is payments-svc waiting 7.6 seconds on a downstream UPI hop that returned DEADLINE_EXCEEDED. She has the evidence. Six months ago, on a 1% head sampler, that trace would have been gone before she even logged in. Today her platform team runs the OpenTelemetry Collector's tail_sampling processor with a policy that says keep every trace whose root or any descendant span has status.code = ERROR OR has duration > 5s, plus 1% of everything else. The buffer cost the company a four-pod, 32-GB-RAM collector tier, and that is the trade. This chapter is about how that processor works, what its 30-second buffer actually costs, and the four policy patterns that separate a well-tuned tail sampler from one that drops the very traces it was supposed to keep.

Tail sampling makes the keep-or-drop decision after the trace finishes, not at the root span. The OTel Collector's tail_sampling processor buffers every span by trace_id for a configurable wait window — typically 30 seconds — then evaluates an ordered list of policies (status_code, latency, string_attribute, numeric_attribute, probabilistic, composite) against the assembled trace. The cost is stateful collector RAM proportional to spans-per-second × wait-window, plus the operational risk that bursty traffic overflows the buffer and the unevaluated traces are dropped silently. Every production deployment hits the same three failure modes: overflow eviction, cross-collector trace splitting, and policy ordering bugs.

What "tail" means and why the buffer is structural

The word tail refers to where the sampler runs — at the tail of the trace, after every span has arrived. The decision is not taken when the root span is created (head sampling does that). It is taken when the collector decides "no more spans are coming for this trace_id", at which point the full span tree is in memory and every signal is available: every error status, every span duration, every resource attribute, every event, every link. The sampler can therefore stratify on rare-event categories that head sampling cannot see: keep all traces where any span errored, keep all traces in the slowest 0.5%, keep all traces with merchant.tier = "platinum". The bias of head sampling — uniform subsampling of all categories — is structurally fixed.

The cost is statefulness. To decide on the full trace, the collector must hold every span until it is sure the trace is complete. There is no clean "end of trace" signal in OpenTelemetry — a trace ends when no more spans arrive, and "no more spans" is defined operationally as "no new span has arrived for this trace_id in the last decision_wait seconds". The default decision_wait is 30 seconds; the practical range is 10s (for low-latency traces) to 60s (for traces that fan out into long-running async workers). Every span must sit in collector RAM, indexed by trace_id, until its trace's wait window expires. A fleet at 100K spans/sec with a 30-second wait window holds 3 million spans in RAM at steady state. At ~1KB/span (after OTel-internal compression), that is 3 GB of working set per collector pod — and it grows linearly with both throughput and wait window.

Illustrative — head sampling decides at t=0 with only the trace_id. Tail sampling waits until the trace finishes (t=180ms here) plus a configurable decision_wait (typically 30s) so any late-arriving spans from async or retry hops have time to land. Every span sits in collector RAM the entire time. The trade is statefulness for evidence.

Why decision_wait is not zero: spans arrive at the collector out of order, sometimes by seconds. A retry RPC fired at t=150ms by the parent service may produce a child span that the downstream service exports only when its batch processor flushes — typically 5 seconds later. An async Kafka consumer may produce a span attached to the original trace_id 20 seconds after the synchronous part of the request finished. If the sampler decided at t=180ms (the moment the synchronous chain ended), it would emit the trace minus those late-arriving children, and the on-call would see a partial trace with no obvious indication that spans are missing. The 30-second default is a heuristic that catches >99% of real fan-out latencies on web-shaped traffic; for fleets with longer async pipelines (Hotstar's ad-attribution graph, Razorpay's settlement workers), the value is bumped to 60s or 90s. The trade is RAM proportional to the wait — 60s doubles the working set.

The OTel Collector's tail_sampling processor is the canonical implementation. It runs in the collector's data path, after the receiver decodes OTLP into in-memory pdata.Traces, and before the exporter ships to Tempo / Jaeger / a vendor. Its config is a YAML block with three pieces: decision_wait (the buffer window), num_traces (the maximum traces to hold simultaneously, a safety bound on RAM), and policies (an ordered list of decision rules). The processor maintains a hash map keyed by trace_id; each entry is a list of spans plus the timestamp of the most recent span. A background goroutine sweeps the map every decision_wait/4 seconds, evaluates policies on traces whose newest span is older than decision_wait, and either ships them to the exporter (if any policy says keep) or drops them.

The processor's policies, ordered

The policies list is the architecture's lever. Every trace that emerges from the buffer is evaluated against every policy in order; the first policy that returns Sampled wins, and the trace is shipped. If no policy says Sampled, the trace is dropped. This first-match-wins ordering is critical and is the source of most production bugs — a probabilistic policy at the top of the list overrides every later status_code or latency policy, because the random keep-decision wins before the error-keep ever runs. Order policies from most-specific to most-general.

The seven policy types in the OTel Collector v0.96+:

status_code — keep if any span in the trace has status.code matching a list ([ERROR] is the standard).
latency — keep if the trace duration (root span end - root span start) exceeds a threshold in milliseconds.
string_attribute — keep if any span has an attribute matching a regex/value list (e.g. {key: "merchant.tier", values: ["platinum", "gold"]}).
numeric_attribute — keep if any span attribute is in a numeric range (e.g. {key: "http.response_size_bytes", min_value: 10000000}).
rate_limiting — keep up to N spans per second; coarse fairness ceiling, used to bound output bandwidth.
probabilistic — keep with a fixed probability (the "1% baseline" that head sampling replaces).
composite — combine sub-policies with AND / OR / weighted rates; the most powerful and most foot-gun-prone.
and — explicit AND of multiple policies, each of which must pass independently.

A typical Indian fintech production config:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: keep-vip-merchants
        type: string_attribute
        string_attribute:
          key: merchant.tier
          values: [platinum, gold]
      - name: keep-large-payloads
        type: numeric_attribute
        numeric_attribute:
          key: http.response_size_bytes
          min_value: 5000000
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

This config keeps every error trace, every trace over 1 second, every trace from a top-tier merchant, every trace with a >5MB response, plus 1% of the rest. On a 30K-RPS fleet with 0.4% errors, that is roughly: 120 error traces/sec + ~30 latency-tail traces/sec + ~150 VIP traces/sec + ~5 large-response traces/sec + 297 baseline traces/sec — total ~600 kept per second, a 2% effective rate, biased toward the categories that matter for incident debugging. The architecture compares to the head-sampling chapter directly: head at 1% kept ~1% of every category uniformly; tail at this config keeps 100% of errors and ~1% of OK traces, the exact stratification head sampling cannot achieve.

Why first-match wins matters and where teams trip: a config that puts baseline (1% probabilistic) at the top of the list will keep 1% of all traces and the keep-errors policy below it never runs — every error trace is rolled against the 1% probability and 99% are dropped. The fix is trivial (move baseline to the bottom), but the bug is invisible until an incident — the dashboard says "we have 600 traces/sec, sampler is working" and only when an on-call cannot find a specific error trace does the policy ordering get audited. Every production OTel Collector deployment that runs tail sampling has a regression test for policy ordering, or has had a P1 caused by missing one.

A measurement: simulate the OTel tail sampler on 200K real-shaped traces

The arithmetic above is illustrative; the engineering question is concrete. The script below simulates a fleet of 200,000 traces with the same shape as the head-sampling chapter (0.4% errors, log-normal latency, 0.025% specific-failure category, top-tier merchant tag on 0.5% of traffic) and runs them through a Python implementation of the OTel tail sampler. It compares retention across the four bias dimensions to a 1% head sampler, so the cost-benefit is concrete.

# tail_sampler_measurement.py — OTel tail_sampling processor in Python
# pip install pandas numpy
import random, hashlib
import numpy as np
import pandas as pd

random.seed(42); np.random.seed(42)
N = 200_000
ERROR_RATE = 0.004
P99_9_THRESHOLD_MS = 1200
SPECIFIC_FAILURE_RATE = 0.00025
VIP_RATE = 0.005
SLOW_THRESHOLD_MS = 1000

def make_traces(n):
    rows = []
    for i in range(n):
        tid = hashlib.sha256(f"r-{i}-{random.random()}".encode()).hexdigest()[:32]
        is_err = random.random() < ERROR_RATE
        lat = float(np.random.lognormal(mean=4.5, sigma=0.9))  # median ~90ms
        is_p99_9 = lat > P99_9_THRESHOLD_MS
        is_specific = random.random() < SPECIFIC_FAILURE_RATE
        merchant_tier = "platinum" if random.random() < VIP_RATE else "standard"
        rows.append({
            "tid": tid, "err": is_err, "lat_ms": lat,
            "p99_9": is_p99_9, "specific": is_specific,
            "merchant_tier": merchant_tier,
        })
    return rows

# Ordered tail-sampling policies — first match wins (mirrors OTel behaviour)
def tail_sample(traces, baseline_pct=1.0):
    kept = []
    for t in traces:
        # Policy 1: status_code -> keep all errors
        if t["err"]:
            kept.append(t); continue
        # Policy 2: latency -> keep all slow traces
        if t["lat_ms"] > SLOW_THRESHOLD_MS:
            kept.append(t); continue
        # Policy 3: string_attribute -> keep VIPs
        if t["merchant_tier"] == "platinum":
            kept.append(t); continue
        # Policy 4: probabilistic baseline
        if random.random() < baseline_pct / 100.0:
            kept.append(t); continue
    return kept

def head_sample(traces, rate):
    threshold = int(rate * (2**64))
    return [t for t in traces if int(t["tid"][:16], 16) < threshold]

traces = make_traces(N)
totals = {
    "err": sum(t["err"] for t in traces),
    "p99_9": sum(t["p99_9"] for t in traces),
    "specific": sum(t["specific"] for t in traces),
    "vip": sum(t["merchant_tier"] == "platinum" for t in traces),
}
print(f"input: {N:,} | errors={totals['err']} | p99.9={totals['p99_9']} "
      f"| specific={totals['specific']} | vip={totals['vip']}")

rows = []
for label, kept in [("head 1%", head_sample(traces, 0.01)),
                    ("head 10%", head_sample(traces, 0.10)),
                    ("tail (errors+slow+vip+1%)", tail_sample(traces, 1.0))]:
    rows.append({
        "config": label,
        "kept": len(kept),
        "kept_pct": round(100 * len(kept) / N, 2),
        "err_kept": sum(t["err"] for t in kept),
        "err_retention_pct": round(100 * sum(t["err"] for t in kept) / max(totals["err"], 1), 1),
        "p99_9_kept": sum(t["p99_9"] for t in kept),
        "specific_kept": sum(t["specific"] for t in kept),
        "vip_kept": sum(t["merchant_tier"] == "platinum" for t in kept),
    })
print(pd.DataFrame(rows).to_string(index=False))

A representative run prints:

input: 200,000 | errors=789 | p99.9=412 | specific=43 | vip=987

                  config   kept  kept_pct  err_kept  err_retention_pct  p99_9_kept  specific_kept  vip_kept
                 head 1%   2010      1.00         9                1.1           4              0         9
                head 10%  20015     10.01        82               10.4          39              4        99
tail (errors+slow+vip+1%)   5347      2.67       789              100.0         412             43       987

Per-line walkthrough. The line if t["err"]: kept.append(t); continue is the status_code policy, the most-valuable of the standard set — every error trace is preserved, period. Why this single line is the headline of the chapter: the head-sampling chapter showed that at 1% sampling, 98.86% of error traces are gone. This line keeps 100% of errors, full stop. The engineering cost is the buffer; the engineering value is that an angry-customer-with-a-trace_id always finds their trace if there was an error. Comparing the rows above: tail keeps 789 of 789 errors (100% retention) while head at 1% keeps 9 of 789 (1.1%). The improvement is 87×, not "marginally better" — it is a different category of guarantee.

The line if t["lat_ms"] > SLOW_THRESHOLD_MS: kept.append(t); continue is the latency policy, which catches the slow-but-not-errored traces — UPI hops that took 7 seconds and returned 200 OK, the kind of slow path that user-perceived performance depends on but error-rate dashboards miss. The simulation kept all 412 p99.9 traces, vs 4 at head 1% — a 100× improvement on tail-latency visibility, exactly the bias-2 fix promised in the head-sampling chapter.

The line if random.random() < baseline_pct / 100.0 is the probabilistic baseline. Note that it runs only after the first three policies have decided to drop. Why the order matters concretely: if the baseline ran first, errors would be rolled against 1% probability and 99% would be lost — the same bias as head sampling. The first-match-wins ordering is the difference between "tail sampling" and "tail-sampling-config-that-acts-like-head-sampling". Half of the production OTel Collector misconfigurations the OTel community Slack discusses are inverted policy orderings; the symptom is "we configured tail sampling but our error retention didn't improve".

The total kept set is 5,347 of 200,000 — a 2.67% effective rate that retains 100% of every category that matters. The bandwidth cost is 2.67× the head-1% baseline, the storage cost is 2.67× the head-1% baseline, but the debug coverage is qualitatively complete. For most production fleets, the math is favourable: triple the trace-store bill in exchange for never losing an error trace again, and never having to explain to an angry customer that the platform's sampler decided their request was not interesting.

Illustrative — the OTel Collector's tail_sampling processor architecture. Spans land in a trace_id-indexed hash map; a sweep goroutine fires every decision_wait/4 seconds, evaluates policies on expired traces, and forwards or drops. The three boxed failure modes are the production gotchas the next section unpacks.

Three failure modes nobody warns you about

Beyond the obvious cost of buffer RAM, the OTel Collector's tail sampler has three operational failure modes that surface only after the system runs in production for several months. They are not in the OTel docs and they do not appear in vendor whitepapers because they are nobody's marketing story — they are the lived discoveries of teams who run tail sampling at Indian fleet scale.

The cross-collector trace split. Tail sampling is stateful per-collector. If a load balancer round-robins OTLP traffic across N collector pods, spans from the same trace_id land on different pods, each of which sees a fragment of the trace. Each pod evaluates policies on its fragment — a pod holding only the synchronous spans sees no error (because the error span landed on a different pod) and votes drop. The other pod, seeing the error, votes keep. The result: half the trace lands in Tempo, half is dropped, and the on-call sees a partial trace with no obvious "spans missing" indicator. The fix is load_balancing exporter chained in front of the tail sampler — a hash-routing layer that sends every span with the same trace_id to the same downstream collector pod by hashing the trace_id. The OTel Collector ships a loadbalancing exporter for exactly this; it replaces the round-robin LB with a consistent-hash router. Razorpay's platform team rolled out tail sampling without loadbalancing first, lost ~30% of traces to splits, and added the routing layer in week three. Every production tail-sampling deployment has this layer; every documented one mentions it; first-time deployments still forget it.

Buffer overflow eviction during traffic spikes. The num_traces parameter caps the hash map at a fixed size — default 50,000. At a steady state of 5,000 traces/sec arriving and a 30-second wait window, the buffer holds 150,000 traces — already 3× over the default cap. When traffic spikes (Hotstar's IPL toss spike, Flipkart's BBD opening, IRCTC's Tatkal-hour 10:00 IST burst), the buffer fills, and the OTel Collector silently evicts the oldest entries — meaning the traces closest to having their decision evaluated, the ones whose policies were about to run, are the ones that get dropped. The eviction is logged as a counter processor_tail_sampling_traces_evicted_total, but most teams do not alert on it because the metric was not in the runbook. Why eviction is silent and not loud: the OTel Collector's design point is "do not drop spans on the data path because that triggers a retry storm from the SDK side that makes the problem worse". Silent eviction trades visibility for stability — the collector keeps running, the SDK does not retry, the only signal is the counter. The team that does not alert on traces_evicted_total > 0 discovers the eviction during an incident when a trace they need is missing, and the timestamps line up with a traffic spike. The fix is to size num_traces for peak, not steady-state — for a 5K-RPS steady fleet that hits 50K-RPS during IPL spikes, set num_traces to 50_000 × 30 = 1.5M, not the default 50K. The RAM cost rises linearly; the visibility cost of overflow drops to zero.

Late-arriving span orphans. A span that arrives after the decision_wait window expires — because an async Kafka consumer fired a fan-out 45 seconds after the synchronous trace ended, or a slow gRPC retry took 35 seconds to complete — finds its trace_id no longer in the buffer. The processor has two options: drop the span silently, or create a new buffer entry for the trace_id and run the policies on just that one span. The default is to drop. The result: a trace that was kept (because its synchronous portion had an error) shows up in Tempo missing the async fan-out spans, even though those spans existed and were exported. The on-call follows the parent_span_id pointer in the kept trace and finds it points to a child that does not exist in the store. The diagnostic is to run a synthetic that crosses every async boundary in the architecture and verify the full span tree lands; the fix is to raise decision_wait until the worst-case async fan-out fits, accepting the linear RAM cost. Hotstar runs decision_wait: 90s on their ad-attribution pipeline because some attributions fan out 60+ seconds after the user click; their ingest pod runs at 12GB working set as a result, and they accept that cost rather than miss attribution spans.

Five lived patterns Indian teams ship

The official OTel Collector docs cover the policies; they do not cover the operational shape of running tail sampling in a real Indian production fleet. Five patterns recur across teams that have run it for more than a year.

Pattern 1: composite policies with weighted rates per category. A pure status_code policy keeps every error — but during a partial outage where 30% of requests are erroring, "every error" is 9,000 errors per second on a 30K-RPS fleet, which overwhelms Tempo's ingest. The composite policy lets the team express "keep the first 100 errors per second, then sample the rest at 10%". Syntactically:

- name: errors-with-rate-limit
  type: composite
  composite:
    max_total_spans_per_second: 100
    policy_order: [errors-priority, errors-baseline]
    composite_sub_policy:
      - name: errors-priority
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: errors-baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }
    rate_allocation:
      - policy: errors-priority
        percent: 70
      - policy: errors-baseline
        percent: 30

The composite ensures 70 errors/sec are guaranteed kept (the most recent / highest-priority) and the next 30 slots fill from a 10% baseline of remaining errors. PhonePe runs this for UPI failure traces during NPCI partial outages — the runbook says "if NPCI degrades, the composite cap protects Tempo's ingest from the trace storm".

Pattern 2: per-tenant policies via string_attribute on tenant.id. A multi-tenant SaaS like Postman or Razorpay's enterprise tier needs different sampling per customer. A string_attribute policy keyed on tenant.id with a per-value lookup gives "tenant A gets 100% of traces, tenant B gets 1%, tenant C gets 0.1%". The pattern is to drive the value list from a control plane — the platform team updates a Consul KV with {tenant_a: keep, tenant_b: 1, tenant_c: 0.1} and a Go script regenerates the collector config, hot-reloads via SIGHUP. The result: per-customer SLAs on trace retention without code changes.

Pattern 3: time-of-day rate boost via dynamic config. Zerodha's market-open at 09:15 IST is the highest-stakes 30 minutes of the trading day. The platform team raises decision_wait from 30s to 60s and lowers the probabilistic baseline from 1% to 0.1% during 09:00-09:45 IST, then restores defaults. The boost gives more buffer for the spike (avoiding overflow) and trades baseline retention for guaranteed error/latency capture during the critical window. The config swap is a cron-triggered SIGHUP via Argo Rollouts; the runbook documents the swap so on-calls know what to expect during market open.

Pattern 4: separate collectors for regulated vs unregulated traffic. SEBI compliance for trading platforms requires lossless retention of trade-order traces. The pattern, mirroring the head-sampling chapter's pattern 5: route trade-order traffic to a collector pool with tail_sampling removed (every span exported), and route everything else to the tail-sampled pool. The split is at the OTLP receiver level via a routing processor on traffic.class. The regulated pipeline runs at higher cost per byte, but the audit conversation is "100% retention, here is the receipt" — a one-sentence answer. Zerodha runs this; CRED runs the same pattern for their banking-tier transactions.

Pattern 5: the "verifier" canary. A team running tail sampling in production runs a parallel canary collector that retains 100% of traces for a 1-hour window per day, compares the canary's trace count to the production tail sampler's output, and alerts if the production sampler dropped a trace the canary kept that should have been kept by policy. The canary is small (a single 8GB pod handling 1% of traffic, replicated trace-id-by-trace-id) and the comparison is a Python script that reads both Tempo backends and joins on trace_id. The pattern catches policy regressions before they cost an incident; it is the test-in-production answer to "did we just deploy a config that drops 30% of errors silently". Cleartrip's observability team runs this and credits it for catching three bad config deploys before they hit on-call.

The pattern across all five: tail sampling at production scale is configuration, deployment, and verification engineering, not just a YAML block. The processor's policies are the easy part; the operational architecture around it is what separates a fleet that gets value from tail sampling from one that pays for buffer RAM and still loses traces.

Common confusions

"Tail sampling means I keep 100% of everything." No — tail sampling keeps 100% of categories the policies named. Errors and slow traces yes; ordinary OK traces still get the probabilistic baseline. The retention is qualitative (every error) plus quantitative (1% of OK), not "every span".
"The OTel Collector's tail sampler can run anywhere in the pipeline." It must run after all spans for a trace_id have been routed to the same collector pod. Round-robin upstream LB without loadbalancing exporter splits the trace and produces partial decisions. Layer order matters: loadbalancing exporter → tail collector pool → tail_sampling processor → final exporter to Tempo.
"Tail sampling is the same as 'sample after the trace ends'." It samples after decision_wait seconds of silence on the trace_id, which is later than the trace ending — late async spans need the buffer to remain open. A trace that ends at t=180ms is decided at roughly t=30.18s, not t=0.18s. The 30-second tail is the cost of catching async fan-out.
"Increasing decision_wait makes the sampler more accurate." Up to a point — past the worst-case async fan-out latency in the fleet, the additional wait buys nothing and costs RAM linearly. The pattern is to instrument the fan-out latency distribution, set decision_wait = p99.9(fan_out_latency) + 5s, and revisit quarterly.
"Buffer overflow drops a uniform random 1% of traces." It drops the oldest entries — the traces that were closest to having their policies evaluated. The eviction is biased toward dropping traces that matter, not traces that don't, which inverts the entire rationale for tail sampling. Always size num_traces for peak load and alert on traces_evicted_total > 0.
"Tail sampling and head sampling are alternatives — pick one." Production fleets layer both. A 1% head sampler runs at the SDK to bound bandwidth at the source; the tail sampler at the collector adds the error-keep and latency-keep stratification on top of the kept stream. The combined system: SDK keeps 1% baseline + always-keep VIPs (head); collector evaluates that 1% kept stream against tail policies and forwards the keep-set. The split protects the collector from full bandwidth while still getting tail's stratification benefit.

Going deeper

The `loadbalancing` exporter and trace-id consistent hashing

Cross-collector trace splits are the most-cited tail-sampling failure mode and the OTel Collector ships a fix: the loadbalancing exporter, configured upstream of the tail-sampling collector tier. It accepts spans from any source and routes them downstream to a tier of collectors using consistent hashing on the trace_id — every span with the same trace_id lands on the same downstream pod, regardless of which upstream pod produced it. The hash is murmur3 or xxhash (fast, well-distributed); the consistent-hash ring is rebalanced when the downstream tier scales. The architecture is two collector tiers: an "edge" tier running loadbalancing (stateless, scales horizontally with traffic), and a "tail" tier running tail_sampling (stateful, scales with peak buffer needs). Razorpay runs 12 edge pods and 8 tail pods; the edge tier is autoscaled, the tail tier is fixed and over-provisioned. The split is the standard production pattern for any tail-sampling deployment above ~5K spans/sec.

Decision-wait sizing — instrument before you guess

The decision_wait parameter is the most common knob teams tune wrong. The default 30s catches most synchronous-RPC-shaped traces; it misses async fan-out and long-running batch traces. The correct sizing procedure: (1) before enabling tail sampling, run a 100% retention pilot for one hour during peak traffic; (2) for each trace, compute last_span_time - root_span_end_time — the time from synchronous trace completion to the last async/retry span; (3) plot the distribution; (4) set decision_wait = p99.9 + 5s buffer. For most fleets this lands at 30-60s; for long-running async architectures (Hotstar's ad-attribution, Razorpay's settlement workers) it can be 90-120s. Why p99.9 specifically and not max: the maximum is dominated by stuck traces — Kafka consumers that never acked, Lambda invocations that exceeded their timeout, Celery tasks in retry-loops. Setting decision_wait to those tails costs proportionally more RAM and catches cases that are themselves bug indicators (the "trace took 5 minutes" trace is itself an incident signal, not a retention question). The p99.9 + 5s heuristic catches ~99.9% of legitimate fan-outs without paying for the bug-indicator tail.

Composite policies and the OR / AND / weighted-rate semantics

The composite policy is the most powerful and most foot-gun-prone of the seven types. Its semantics: a list of sub-policies with a policy_order, a rate_allocation table assigning percent-of-budget to each sub-policy, and a max_total_spans_per_second cap. The processor evaluates sub-policies in policy_order; each sub-policy that says "keep" claims a slice of the rate budget. When the budget is exhausted, no further keeps land — the trace is dropped even if a later sub-policy would have said keep. The trick is to put the highest-priority sub-policy first and give it a large rate slice (70% in the example earlier), so the priority category cannot be starved by a flood from a lower-priority one. The and policy is simpler — every sub-policy must say keep, no rate budgeting. Use composite for "keep all errors but cap the rate" patterns; use and for "keep errors only on tier-1 tenants" intersection patterns. Mixing them in the same config is legal but the evaluation order across composite + and + simple policies is documented as "implementation-defined" in the OTel spec — read the Collector source if you need to be sure.

The interaction with head sampling — layered architecture

Production fleets rarely run pure tail sampling. The combined head + tail architecture is: the SDK runs ParentBased(root=TraceIdRatioBased(0.05)) — keep 5% of traces uniformly. The collector receives only that 5% (95% never leaves the application), and runs tail_sampling on the kept stream to apply error-keep, latency-keep, VIP-keep policies. The total bandwidth into the collector is bounded at 5% of fleet traffic; the tail sampler then keeps roughly half of that 5% (errors + slow + VIPs) and drops the rest, for a final ~2.5% trace store retention. The architecture protects the collector from peak bandwidth (head bounds it at the SDK) while still benefiting from tail's stratification (errors in the 5% are 100% retained, OK traces are 50% retained = 2.5% of original). The trade-off vs pure tail: errors that were dropped by the head sampler (95% of them) are lost forever, even though tail would have kept them. The decision: if budget allows tail to see 100% of traffic, do that; if budget forces head to bound the input, accept the 5% error-retention ceiling and document the trade. Hotstar runs head-then-tail at 5%-then-tail; Razorpay runs head-then-tail at 1%-then-tail for non-payment traffic and tail-only for payment traffic. The split mirrors the bandwidth-vs-evidence trade per traffic class.

What tail sampling cannot fix

Tail sampling fixes biases 1, 3, and 4 from the head-sampling chapter — error retention, rare-category fade-out, the angry-customer trace_id. It does not fix bias 2 (tail latency aggregation) at population level — because the tail sampler is selecting traces, not aggregating, the kept set still over-represents the categories the policies named and a population p99 computed over the kept set is biased toward errors and slow paths. The lesson: trace store quantiles are never the right source for population latency aggregates; metrics histograms (Prometheus, OTel Histogram) are. Tail sampling makes the trace store better at incident debugging; it does not make it a metrics replacement. A team that treats the tail-sampled trace store as the source of truth for fleet-wide p99 will read a number 30-50% biased toward errors. Use the trace store for "show me the slow traces"; use the metrics store for "what is fleet p99". Different questions, different tools.

Reproducibility footer

# Reproduce the tail-sampler measurement on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy
python3 tail_sampler_measurement.py
# Expected: a three-row table comparing head 1%, head 10%, and tail
# (errors+slow+vip+1%) across kept count, error retention, p99.9 retention,
# specific-failure retention, and VIP retention. Tail should retain 100% of
# every category the policies named, while keeping ~2-3% total volume.
# To run a real OTel Collector with tail_sampling: docker run otel/opentelemetry-collector-contrib
# with the YAML config from §2 of this article — point any OTLP-emitting
# Python service at it (opentelemetry-sdk, opentelemetry-exporter-otlp).

Where this leads next

Tail sampling buys evidence-based retention at the cost of stateful collector RAM and a careful policy ordering discipline. The next chapter — adaptive sampling — addresses the orthogonal problem: how the rate itself should change in response to traffic. A 1% baseline that is right at 30K RPS is wrong at 300K RPS during the IPL final, and adaptive sampling closes that loop with feedback control.

Adaptive sampling — the rate-modulator that survives traffic spikes by trading representativeness during the spike for steady-state cost.
Head sampling and its bias — the chapter this one motivates the fix for; tail sampling is the structural answer to head's bias profile.
Trace sampling: head, tail, adaptive — the part-summary chapter mapping all three designs onto the four-axis cost / evidence / consistency / spike-tolerance trade-off.
Why you can't collect everything — the part-opener that establishes sampling as a budget conversation.
Cardinality: the master variable — the metrics-side dual; the same "what do you keep" question applied to label design rather than trace selection.

The single most useful thing the senior reader should walk away with: tail sampling is not "head sampling but better" — it is a different shape of sampler with different costs and different failure modes. The cost is buffer RAM proportional to throughput × wait window; the failure modes are cross-collector splits, overflow eviction, and policy ordering bugs. A team that is buying tail sampling is buying the retention guarantees and the operational surface that comes with it. Plan for both, alert on both, and run the verifier-canary pattern so policy regressions surface in days rather than during the next incident.

The closing reframing: every sampler has a bias profile, and the engineering question is whether the bias matches the use case. Head sampling is biased toward population-level uniformity (good for capacity planning, bad for incident debugging). Tail sampling is biased toward the categories the policies named (good for incident debugging, bad as a stand-in for population statistics). Adaptive sampling — the next chapter — is biased toward keeping rates feasible during spikes (good for survivability, weakest at consistency across time). No sampler is unbiased; the discipline is to name the bias and confirm it matches the question the trace store is being used to answer. Most production fleets layer two or three samplers because they have two or three questions, and the architecture that admits all of them is what working observability looks like at fleet scale.

References

OpenTelemetry Collector Contrib — tailsamplingprocessor — the canonical implementation, with full policy reference and config examples.
OpenTelemetry Collector — loadbalancingexporter — the consistent-hash router that fixes cross-collector trace splits; non-optional for any tail-sampling deployment above 5K spans/sec.
Honeycomb — "Refinery: Sampling at Scale" — Honeycomb's tail-sampling reference architecture; the same shape as the OTel processor with a different operational model.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 6 — the foundational treatment of why trace sampling is structurally different from metric aggregation, and why tail sampling exists.
Sigelman et al., Dapper: A Large-Scale Distributed Systems Tracing Infrastructure (Google, 2010) — the original paper; Section 4 discusses the head-vs-tail trade and concludes head was sufficient at Google scale, a conclusion the OTel community revisited.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 17 — the modern-era treatment of sampling architecture, including tail.
Head sampling and its bias — the prior chapter; tail sampling is the structural answer to the bias profile head sampling has.
Trace sampling: head, tail, adaptive — the comparison chapter that maps each design onto the four-axis trade-off this one situates within.