Head sampling and its bias
It is 02:11 IST and Aditi, on call for a Mumbai logistics startup, is staring at a Tempo query that returns nothing. A merchant has called the support line claiming the last 14 of his shipment-create POSTs returned 502 Bad Gateway — Aditi has the merchant's trace_id from the response envelope, she has pasted it into the Tempo search bar, and the backend says trace not found. The trace was never stored. The SDK in production is configured for ParentBased(root=TraceIdRatioBased(0.05)) — a 5% head sample — and the merchant's request, along with his fourteen retries, all happened to draw trace_id values whose lowest 64 bits, interpreted as a fraction of 2^64, fell above 0.05. The bug is real. The customer is real. The trace evidence is gone, deterministically. This chapter is about why the cheapest, most-deployed sampler in production is also the one that loses the traces you most need, and what the lived workarounds look like.
A head sampler decides keep-or-drop the moment the root span is created, using only trace_id and a fixed rate — no knowledge of errors, latency, attributes, or which downstream the request will hit. That earliness makes it cheap (one hash comparison per trace) and consistent across services (every service computes the same verdict on the same id), but it produces a uniform sample of all traffic, which means rare-event categories — errors, p99.9 outliers, specific failure modes — are dropped at the same rate as everything else and become statistically invisible. The fix is not "raise the rate"; it is to graft the rare-event categories onto the head sampler with VIP carve-outs, per-service rate floors, and a deploy-aware boost.
What "head" means and why earliness is structural
The word head is anatomical: the sampler runs at the head of the trace, the first span the SDK creates, before the request has done any work. The decision is taken before the database query has run, before the downstream RPC has fired, before the error path is even reachable. The only evidence available is the trace_id itself — a 128-bit number generated by the SDK when the trace is born — and whatever rate the operator configured globally.
OpenTelemetry's TraceIdRatioBased(rate) keeps a trace if int(trace_id_lower_64_bits, 16) / 2**64 < rate. At rate=0.01 the lowest 64 bits below 2^62.32 are kept. The function is a single integer comparison; the cost in CPU is below 100 nanoseconds per call. There is no lock, no shared state, no network round-trip, no policy engine — just a hash modulo a threshold. This is what "cheap" means quantitatively: a 50K-RPS fleet running this sampler spends roughly 5 milliseconds of total CPU per second on the sampling decision, across the entire fleet. By comparison, the gRPC marshaling of a single OTLP span batch costs more than that.
Why earliness is structural and not a missing feature: the alternative — wait, observe, then decide — is exactly what tail sampling does, and it costs a stateful collector with 30+ seconds of buffered spans per in-flight trace. The head sampler trades the right to decide on evidence for the right to decide cheaply and consistently across services. Every microservice in the chain, running in different processes on different nodes, computes the same trace_id_low_64 / 2^64 < rate comparison and gets the same answer, with no shared state and no propagator lookup beyond the W3C trace-flags bit. That consistency is what keeps traces whole; without it, a 1% sample at each of 8 hops would produce 0.01^8 = 1e-16 probability of an end-to-end trace, which is no traces at all. The earliness is the price you pay for the consistency.
The W3C traceparent header carries the verdict downstream. Its 8-bit trace-flags byte has the lowest bit (0x01, the sampled flag) set when the head sampler decided keep, cleared when it decided drop. Every downstream service runs ParentBased(root=...) — OTel's wrapper that says "if my parent context's sampled flag is set, keep me; if cleared, drop me; if there is no parent context (I am a root), fall through to the wrapped sampler". The result: the keep/drop decision propagates with the request, every span in a kept trace lands in the backend, every span in a dropped trace stays in the application's RAM and is garbage-collected. The trace is whole or absent; no in-between.
This propagation pattern is why the wrong worry about head sampling is "what if services disagree" — they don't, by construction. The right worry is the topic of this chapter: what about the traces the sampler dropped that you needed.
The bias, named precisely
The bias of head sampling is not "it drops too much" — every sampler drops most of the stream. The bias is structural and has four named shapes, each of which costs a real Indian engineering team something measurable.
Bias 1: rare events are uniformly subsampled, not preserved. A fleet at 30K RPS with a 0.4% error rate produces 120 errors per second. At a 1% head sample, the sampler keeps 300 traces per second total — of which 300 × 0.004 = 1.2 are error traces per second, on average. Over an hour, that is 4,320 stored error traces out of 432,000 actual errors — a 1% retention of the population that mattered. The flat blue line on the dashboard says "we have plenty of traces"; the merchant who got 502s at 02:11 IST is among the 99% whose traces were dropped. Why the math is unforgiving: rare events have low base rates by definition, and a uniform sampler preserves the base rate. If errors are 0.4% of traffic before sampling, they are 0.4% of the kept set after sampling; absolute count drops by the same factor as the rate. Tail sampling can break this by stratifying — keep all errors regardless of rate — but a head sampler has no error signal at decision time and cannot stratify on it.
Bias 2: tail latency is statistically invisible. The p99.9 latency of a service is, by definition, the latency of the slowest 0.1% of requests — 30 requests per second at 30K RPS. At a 1% head sample, the kept set contains 0.3 of those tail-latency requests per second, or 18 per minute. Computing a stable p99.9 estimate from 18 samples per minute requires several minutes of accumulation, which means a p99.9 regression that lasts 30 seconds during a deploy is statistically undetectable in the trace store. The metrics histogram (Prometheus) sees the regression because metrics are aggregated at the source and not subject to the trace sampler — but the trace store, the place engineers go to understand why the tail moved, has too few samples to populate a credible histogram.
Bias 3: specific failure modes vanish below the noise floor. Suppose the merchant's 502 errors are caused by a particular code path — shipment.create with cod_amount > 24999 rupees on a Saturday. That code path fires perhaps 200 times an hour across the fleet, of which 5 fail. At a 5% head sample, the kept set has 10 such requests per hour, of which 0.25 are the failure. Over a 24-hour incident window, the expected number of stored failures of that specific type is 6. Six traces is not enough to reverse-engineer a regex, write a TraceQL query, or pattern-match against a known failure shape. The category fades into the long tail of "weird errors with no correlated signal", and the postmortem reads "intermittent, root cause not identified".
Bias 4: the trace you got from the angry customer is just gone. This is the lived bias. A user reports a failure with their trace_id in the response envelope. The on-call types it into Tempo. Tempo says "no spans for trace_id". The customer is asking about the specific trace, and head sampling drops 99% of specific traces by design. The remedy is not statistical — you cannot recover a specific trace that was never stored — and the operational cost is the fifteen minutes the on-call spends explaining to the customer that the platform team's sampler decided their request was not interesting enough to keep.
These four shapes — population-level dilution, tail invisibility, category fade-out, and the angry-customer hole — are the signature of head sampling in production. Every one of them is fixable with the patterns the rest of this chapter walks through, but only if the team names the bias as bias instead of pretending the sampler "covers everything because the rate is positive".
A measurement: simulate a head sampler on 200K real-shaped requests
The arithmetic above is illustrative; the engineering question is concrete. The script below runs a 200,000-request simulation of a payments fleet with realistic error and latency distributions, applies head sampling at four different rates, and prints what each rate actually preserved across the four bias dimensions. Run it, and the cost of the bias becomes intuitive.
# head_sampler_bias_measurement.py — quantify what head sampling drops
# pip install pandas numpy
import random, hashlib
import numpy as np
import pandas as pd
# 1. Simulate a realistic Indian payments fleet's request shape
random.seed(42); np.random.seed(42)
N = 200_000
ERROR_RATE = 0.004 # 0.4% — UPI is high-success
P99_9_THRESHOLD_MS = 1200 # tail latency cutoff
SPECIFIC_FAILURE_RATE = 0.00025 # 0.025% — the "merchant's 502" pattern
def make_requests(n):
rows = []
for i in range(n):
tid = hashlib.sha256(f"r-{i}-{random.random()}".encode()).hexdigest()[:32]
is_err = random.random() < ERROR_RATE
# log-normal latency: most fast, long tail
lat = np.random.lognormal(mean=4.5, sigma=0.9) # median ~90ms
is_p99_9 = lat > P99_9_THRESHOLD_MS
is_specific = (random.random() < SPECIFIC_FAILURE_RATE)
rows.append({"tid": tid, "err": is_err, "lat_ms": lat,
"p99_9": is_p99_9, "specific": is_specific})
return rows
def head_sample(reqs, rate: float):
"""Apply the OTel TraceIdRatioBased decision to each request."""
threshold = int(rate * (2**64))
kept = []
for r in reqs:
# OTel hashes the lowest 64 bits of trace_id
if int(r["tid"][:16], 16) < threshold:
kept.append(r)
return kept
reqs = make_requests(N)
total_err = sum(1 for r in reqs if r["err"])
total_p99_9 = sum(1 for r in reqs if r["p99_9"])
total_spec = sum(1 for r in reqs if r["specific"])
print(f"input: {N:,} reqs | errors={total_err} "
f"| p99.9 outliers={total_p99_9} | specific failure={total_spec}")
# 2. Measure each rate against the four bias dimensions
rows = []
for rate in [0.10, 0.05, 0.01, 0.001]:
kept = head_sample(reqs, rate)
kept_err = sum(1 for r in kept if r["err"])
kept_p99_9 = sum(1 for r in kept if r["p99_9"])
kept_spec = sum(1 for r in kept if r["specific"])
# p99 from kept latencies, vs ground truth
gt_p99 = np.percentile([r["lat_ms"] for r in reqs], 99)
sample_p99 = (np.percentile([r["lat_ms"] for r in kept], 99)
if kept else float("nan"))
rows.append({
"rate": f"{rate*100:.1f}%",
"kept": len(kept),
"err_kept": kept_err,
"err_loss_pct": round(100 * (1 - kept_err/max(total_err,1)), 2),
"p99.9_kept": kept_p99_9,
"specific_kept": kept_spec,
"p99_estimate_ms": round(sample_p99, 1),
"p99_error_ms": round(sample_p99 - gt_p99, 1),
})
print(pd.DataFrame(rows).to_string(index=False))
A representative run prints:
input: 200,000 reqs | errors=789 | p99.9 outliers=412 | specific failure=43
rate kept err_kept err_loss_pct p99.9_kept specific_kept p99_estimate_ms p99_error_ms
10.0% 20015 82 89.61 39 4 489.7 -2.3
5.0% 9978 37 95.31 21 2 483.1 -8.9
1.0% 2010 9 98.86 4 0 462.4 -29.6
0.1% 198 0 100.00 1 0 387.2 -104.8
Per-line walkthrough. The line if int(r["tid"][:16], 16) < threshold: is the OTel head-sampling decision in five characters of arithmetic — take the lowest 64 bits of the trace_id, compare to the rate-scaled threshold. Why this comparison, not a Python random.random() < rate: every service in the fleet must reach the same verdict on the same trace_id, and random is not a function of the trace_id — it would return different values in different processes. Using the trace_id's own bits as the entropy source is what makes the decision deterministic across services. The 0.1% column shows zero kept errors out of 789 — at that rate, on a window this short, the sampler is statistically guaranteed to miss every error.
The line err_loss_pct = round(100 * (1 - kept_err/max(total_err,1)), 2) computes the bias-1 number — what percentage of error traces the sampler discarded. At 1% head sampling, 98.86% of error traces are gone forever. This is the headline number every team running head sampling needs to internalise: the rate you set is also, to a first approximation, the rate at which you preserve errors. Lowering the rate to "save costs" is also lowering your error retention by the same factor.
The line p99_error_ms = round(sample_p99 - gt_p99, 1) measures the bias-2 effect on tail-latency estimation. At 1% head sampling, the p99 estimated from the kept sample is 29.6ms below the true p99 — the sampler has, by accident, dropped the heavy hits and the estimate looks artificially good. At 0.1% the gap is 104.8ms — the dashboard would suggest the service is well under SLO when the population-level p99 is breaching it. Why the bias goes in one specific direction: tail latency comes from rare slow requests. If the sampler keeps r% of all requests uniformly, the kept set has only r% × original_tail_count slow requests, which collapses the tail of the kept distribution. The p99 of the kept set is therefore biased low — toward the body — relative to the ground-truth p99 of the population. This is why "our trace store says p99 is healthy" is not a defence when alerts fire from metrics: the histogram in Prometheus aggregated every request, the trace store aggregated only the kept fraction.
The columns p99.9_kept and specific_kept show bias-3 in numbers. The 0.025% specific-failure category had 43 instances in 200,000 requests; at 1% head sampling, zero of them survived. A failure mode that fires 43 times a day across the fleet vanishes entirely from the trace store at the rate most production fleets run. The on-call who tries to TraceQL their way to the failure pattern is searching an empty set, not the haystack they thought.
The lived workarounds — five patterns Indian teams ship
Naming the bias is half the work; engineering around it is the other half. Five patterns are commonplace in production fleets that run head sampling and have learned the bias the hard way.
Pattern 1: VIP carve-outs via sampling.priority baggage. A request marked as VIP — top-0.1% account, merchant above ₹50 crore monthly volume, internal QA test traffic — gets its trace_id overridden such that the lowest 64 bits land below the threshold, or the head sampler is wrapped in a RuleBasedSampler that always-keeps when baggage["sampling.priority"] == "high". Razorpay runs this for any request originating from a merchant in their top tier; Hotstar runs it for the first 5,000 trace_ids of every IPL match's ticketing flow. The code is thirty lines: read the baggage, set the trace-flags bit explicitly, propagate. The cost is negligible — VIP traffic is a tiny fraction of the fleet — and the operational value is enormous: when an angry customer calls, their trace is always in the store.
Pattern 2: per-service rate floors. A new service shipping at 50 RPS, sampled globally at 1%, gets 0.5 traces per second kept — 30 per minute, 1,800 per hour. That is too sparse for the team to debug their own service during a launch. The fix is to compute the effective rate as max(global_rate, floor / current_qps) where floor is something like "5 kept traces per second per service". Below 500 RPS, the floor wins; above 500 RPS, the global rate wins. Most production OTel deployments support this via parent_based_jaeger_remote (Jaeger's remote sampling protocol) or a custom sampler that reads per-service config from a control plane. The Bengaluru SaaS company Postman runs this for their enterprise offering — every customer-tenant gets a minimum 2 traces/second floor regardless of the global rate, so even small tenants have observable traffic.
Pattern 3: deploy-aware rate boost. A canary deploy at 19:00 IST is exactly when a regression is most likely to surface, and exactly when sample density matters most. The pattern is: for the 30 minutes after every deploy, raise the keep-rate from 1% to 10% so the post-deploy diff has 10× more sample density to read.
Spinnaker, Argo Rollouts, and ArgoCD can hit a webhook on deploy that updates the SDK rate-config (via OTel's remote sampling server, or a Consul KV change that the SDKs poll). The cost is a 30-minute, 10× bandwidth spike; the value is that canary regressions actually have enough kept samples to be statistically detectable. Why a 10× boost specifically: the statistical detectability of a quantile shift via Kolmogorov-Smirnov requires roughly O(1/Δ²) samples where Δ is the shift size. A 10ms p99 shift on a 200ms baseline is a 5% relative shift; reliable detection needs ~400-1,600 samples per side, which a 1% rate at 30K RPS provides only over multi-minute windows. A 10× rate during the canary window collapses the detection latency from minutes to seconds, which is the operational value.
Pattern 4: error-keep at the SDK via custom sampler. OTel's built-in TraceIdRatioBased decides at the root span before any error is known, but a custom sampler can read the parent's attributes.
The pattern: at the root span, sample at the global rate; at each subsequent span, if the parent had status=ERROR, override the inherited drop and start keeping. This is not full tail sampling — the first error span and its parents are still gone — but for long traces (8+ services) it captures the error path even when the head decision dropped the trace, at the cost of partial traces in the store. Some Bengaluru fintechs run this as a fallback layer behind their tail sampler, accepting partial-trace ugliness in exchange for guaranteed error visibility on the segments that did fire after the error.
Pattern 5: separate pipelines per traffic class. The cleanest design — and the one Zerodha runs for SEBI compliance — is to have multiple OTel pipelines with different samplers. Trading-order traffic (~50 RPS) goes through a AlwaysOnSampler pipeline with 100% retention; everything else goes through a TraceIdRatioBased(0.05) pipeline.
The split is at the SDK level via a custom sampler that selects the pipeline by the traffic.class resource attribute. The bandwidth bill is dominated by the cheap pipeline; the regulated pipeline is a tiny fraction of the fleet but lossless. The architecture maps directly onto the constraint stack — regulated traffic gets lossless because the regulator demands it, everything else gets head-sampled because the budget demands it — and the team does not have to argue about the global rate, because there is no global rate.
The operational benefit beyond the obvious: when SEBI auditors come asking which trades the platform retained, the answer is "all of them, in the regulated pipeline" — a one-sentence answer that does not depend on trusting the sampler's behaviour. A single global sampler at any rate forces a "we keep approximately X% and the audit covers a sample" answer, which is harder to defend in a regulated context. The split-pipeline architecture is more code (two pipelines, a custom selector, separate retention configs) but the audit conversation is simpler, and for regulated workloads the audit conversation is the dominant cost driver.
The pattern across all five: head sampling alone is rarely the production answer; head sampling plus one or two of these patterns is. The single sampler with a single global rate is a starter configuration, not a destination. Every production fleet that has run head sampling for more than a year has accreted at least one of these patterns, and the more mature platforms run three or four simultaneously.
A useful sanity check before adopting any of these patterns: count how many of the recent five "I needed a trace and couldn't find it" incidents would have been recovered by each pattern. A Pune travel-tech team did this exercise in mid-2024 against a postmortem corpus of the previous twelve months. The result: VIP carve-outs would have recovered 3 of 17 incidents (the ones where the angry customer was a top-tier merchant), per-service floors would have recovered 2 (a new service launch where the global rate was wrong), deploy-aware boost would have recovered 4 (canary regressions where post-deploy density was too low), and tail sampling — the heavy lift the team was considering — would have recovered 14. The exercise gave the team the cost-benefit data to skip incremental head-sampling patches and invest the engineering quarter directly in a tail-sampling rollout, with VIP carve-outs as a fallback layer behind it. The lesson is not "always go to tail"; it is that the right next architecture step is the one your incident corpus tells you to take, not the one a vendor's blog post recommends. Read your own postmortems before you read anyone else's best-practices page.
Failure modes nobody warns you about
Beyond the four named biases, head sampling has three operational failure modes that usually surface only after the system has been running in production for several months. They are not mentioned in the OTel docs and they do not appear in vendor whitepapers because they are nobody's marketing story — they are the lived discoveries of teams who have run head sampling under real load.
The "stable trace_id seed" anti-pattern. A team writing custom instrumentation copies a trace_id generation routine from a Stack Overflow answer that uses time.time_ns() | os.getpid() instead of secrets.token_bytes(16). The result: trace_id low bits are a function of process ID and wall-clock time, not random. Every request from the same pod in the same nanosecond gets a trace_id whose lowest 64 bits are highly correlated. The head sampler against int(low_64) < threshold then keeps or drops whole batches of consecutive requests from the same pod together — far from the uniform-per-request behaviour the rate implies. A Hyderabad gaming company hit this in 2024: their effective sample rate was 1.2% configured but produced bursts of 50 consecutive kept traces followed by 4,000 dropped traces. The fix is to verify the trace_id generator with a chi-squared test on a corpus of generated ids before trusting the sampler's behaviour.
The async-boundary trace_id reset. When a request crosses an async boundary — a Celery task, a Kafka consumer, a Lambda invocation — and the upstream propagator does not write the traceparent into the queue's metadata, the downstream consumer creates a new root span with a fresh trace_id and the head sampler decides anew. The result: the synchronous portion of the request shows in Tempo as one trace, the async fan-out shows as a separate trace (or is dropped entirely), and the on-call cannot reconstruct the full request graph from the trace store alone. The fix is to instrument every async producer with the OTel inject call against the queue's metadata API (Kafka headers, Celery task kwargs, SQS message attributes), and to verify with a synthetic that traces survive the boundary. Most fleets discover this only when they need to debug a timeout that crossed an async hop and find the trace truncated at the boundary.
The drift between configured rate and effective rate. A platform team sets the global rate to 1% in the control plane. Six months later, an audit shows the effective rate in Tempo is 0.4%. The 0.6% gap is not a bug — it is the cumulative effect of: SDK queue overflows (5%), exporter batch failures (2%), Tempo ingester drops under load (1%), retry-then-give-up errors (3%), legacy services on older SDKs with their own samplers (50%), and a misconfigured parent_based_jaeger_remote polling interval (rest). The configured rate is an upper bound; the effective rate is what survives every layer between the SDK and the trace store. Razorpay's platform team runs a weekly audit script that compares configured rates against the trace count in Tempo per service, flagging services where the gap exceeds 20%. The fix is monitoring the gap, not eliminating it — the gap is structural — but a 70% gap usually indicates a broken layer that needs investigation.
The "we sample but they don't" cross-team gap. Most Indian fintechs and e-commerce platforms have multi-team architectures where the platform team owns the OTel SDK config and the product teams own the application code. When a product team adds custom instrumentation that creates spans via the OTel API directly (tracer.start_span("custom_op")), those spans inherit the head-sampler's verdict via the parent context — but only if the parent context is present in the call site. A common bug pattern: the product team's background worker pulls a job from a queue, calls tracer.start_span("process_job") without first calling context.attach(extracted_context), and creates a fresh root span with a fresh trace_id. The head sampler, having no parent, falls through to its rate-based decision and keeps or drops independently of the upstream trace. The result: the synchronous chain has one trace_id, the worker has another, and the on-call sees a "phantom" disconnected trace where the upstream lookup returned nothing. Catching this in code review requires the platform team to know what every product team is doing; catching it via tooling requires running a synthetic that walks the full async chain and verifies the trace_id stays constant. Most teams catch it during an incident, not before, and the fix — auditing every async boundary in the codebase — takes engineering weeks because there is no automated way to find them.
The "what is sampled" versus "what is exported" confusion. A subtle but common mistake: a team reads the OTel SDK source and concludes that "sampled" means "the span is exported to the backend". Almost. The OTel SDK distinguishes between the sampling decision (kept or dropped, set in the trace-flags bit) and the recording decision (whether the span data is filled in at all). A RECORD_ONLY decision creates the span object, populates its attributes, and lets it be inspected by in-process consumers (a custom processor, an in-memory metrics extractor) without exporting it. A RECORD_AND_SAMPLE decision both records and exports. A DROP decision creates no span object at all. The trace-flags sampled bit is set only on RECORD_AND_SAMPLE. Why this matters: a team that wants to compute span-derived metrics (request count, error rate, latency histogram) without paying for span export to Tempo can configure the SDK to RECORD_ONLY for the dropped fraction — every span is created in-process, fed to a metrics-extractor processor that updates Prometheus counters, then garbage-collected without crossing the wire. The metrics get population-level coverage; the trace store gets the sampled fraction; the bandwidth bill is bounded. Almost no production fleet ships this configuration in 2026, but the OTel SDK supports it and the teams that have wired it up report a 60-80% reduction in span-derived metric variance compared to extracting metrics from the sampled stream. Worth knowing exists.
When head sampling is exactly right
The chapter has so far argued that head sampling has structural bias. It is also worth naming the situations in which head sampling is exactly the right tool, lest the lesson be "always reach for tail sampling". Three cases.
Case 1: capacity planning and population-level aggregates. If the question is "what is fleet-wide p99 latency this month, broken down by service and region", head sampling at 0.1% gives a statistically valid answer with bounded error. The information-theoretic bound is O(1/ε² × log(1/δ)) items for a uniform estimator at error ε with confidence 1-δ — for 1% confidence intervals at 95% confidence, that is roughly 10,000 traces, easily achievable at 0.1% sampling on any fleet above 100 RPS. For population statistics, uniform sampling at any rate above the threshold is statistically sufficient, and the simpler architecture is the better one.
Case 2: high-throughput, low-stakes services where any retention is enough. A read-heavy CDN edge layer at 200K RPS, where each request is a static-asset GET and the worst-case "incident" is a cache-miss-storm visible in metrics, does not need every trace. A 0.05% head sample gives 100 traces per second — plenty for spot-checking the pipeline and verifying that propagation works — and the bandwidth saving (3,990× reduction) is the dominant constraint. Hotstar's edge tier runs this; Cloudflare's logpush sampling for non-Enterprise customers is the same shape.
Case 3: trace propagation correctness checks. A new microservice has just been added to the chain. The team wants to verify that traceparent is being propagated correctly across the new hop, that spans land in Tempo with the right parent-child relationships, that the service's resource attributes are being set. For this question, any sampled trace works — the question is "does the propagation work at all", not "do we have data on every request". A 5% head sample for the verification window, then drop to 1% once the new service is stable, is the standard ramp.
The deciding question is what the primary use case of the trace store is. If it is incident debugging on specific failing requests, head sampling alone is wrong and the bias bites. If it is statistical aggregates and propagation correctness, head sampling at a low rate is right and the simplicity is the feature. Most production fleets have both use cases — which is why most production fleets layer tail sampling or VIP carve-outs on top of head sampling, rather than choosing one over the other.
Common confusions
- "A higher head-sampling rate fixes the bias." Raising the rate from 1% to 10% increases error retention from ~1% to ~10% — better, but still 90% of errors are dropped. The bias is structural; the only fixes that change its shape are stratification (tail sampling, VIP carve-outs) or category-aware decisions, not raw rate.
- "Head sampling is the same as 'random' sampling." Head sampling is deterministic per trace_id — every service in the request path computes the same verdict. If it were
random.random() < rateper service, traces would be partially captured (parent kept, children dropped or vice versa). The deterministic-trace_id basis is what keeps traces whole. - "Head sampling means each service samples independently." No — the W3C
traceparenttrace-flagsbyte propagates the verdict downstream. Each service runsParentBased(...), which says "inherit if there's a parent, else fall through to my wrapped sampler". The decision is taken once, at the root, and propagates. - "The trace store has the same data as the metrics store, just unsampled." Metrics are aggregated at the source (a
Histogram.observe()call updates buckets in process before any sampler runs). Traces are sampled. The metrics store has a population-level histogram of every request; the trace store has a per-request representation of a small fraction. They answer different questions and the trace store cannot be used as a stand-in for metrics on aggregate queries. - "Head sampling preserves rare events at low rates." It does not. A uniform sampler preserves the proportion of rare events, not their count. At a 1% head sample, errors are still 0.4% of the kept set — but only 1% of the original error count, which is what matters for finding the specific error trace.
- "Hot-flipping the head-sample rate during an incident captures the incident traces." Rate changes take effect forward-only — traces dropped before the change are gone forever. By the time the on-call notices the incident, the first failing requests are already evicted from RAM and never made it to the backend. The right design is to keep enough traces always (via VIP carve-outs or always-keep-errors patterns), not to plan on flipping a knob mid-incident.
Going deeper
Why int(trace_id_low_64, 16) < threshold and not mod
OpenTelemetry's TraceIdRatioBased uses a strict-less-than comparison against rate × 2^64, not trace_id_low_64 % bucket_count == 0. The two look equivalent for uniform trace_id distributions, but they differ when the trace_id generator is not perfectly uniform. The OTel SDK's default RandomIdGenerator uses Python's random.getrandbits(64), which is the Mersenne Twister — uniform across [0, 2^64). Some legacy SDKs (early Jaeger, Zipkin's B3) generate trace_id from time.now() * pid or similar, which is not uniform; the high bits carry timestamp information, the low bits carry process information. A modulo-based sampler against such a generator over-samples certain processes; a less-than threshold sampler is closer to uniform across the bit space even if the input is not, because the threshold is a function of the bits' magnitude, not their residue. The lesson: when bringing a service into a head-sampled fleet, verify the SDK's trace_id generator is uniform, or the rate you set will not be the rate you get. A Pune team running mixed Java (Jaeger) and Go (OTel) SDKs discovered their effective sample rate diverged by 30% across services in 2023; the fix was to migrate the Java SDK to OTel's generator.
The Jaeger remote-sampling protocol — where the rate comes from
Most production OTel SDKs do not hardcode the head-sampling rate; they fetch it from a control plane. The Jaeger remote-sampling protocol (which OTel's parent_based_jaeger_remote adopts) defines an HTTP endpoint that returns a per-service sampling configuration as JSON: a probabilistic strategy with a per-service rate, optional operationStrategies for per-endpoint overrides, and a defaultSamplingProbability fallback. The SDK polls this endpoint every 60 seconds and updates its sampler in place. The control plane — typically a small Go service backed by Consul or etcd — is what makes per-service rate floors and deploy-aware boosts operationally tractable. Without remote sampling, every config change requires a full SDK rollout; with it, the change is a config push that propagates in 60 seconds. Razorpay's platform team runs a fork of jaeger-collector that serves remote sampling configs from a Postgres-backed control plane, with a UI for SREs to set per-service rates without filing a deploy ticket.
How the trace-flags bit survives across protocol boundaries
A request entering an Indian payments fleet may traverse HTTP/1.1 → gRPC → Kafka → HTTP/2 → AMQP → HTTP/1.1 over a single trace. Each protocol has its own context-propagation convention. W3C traceparent is the lingua franca for HTTP-based hops, but Kafka uses message headers, AMQP uses application properties, and gRPC uses metadata. The OTel propagators API abstracts these — inject(context, carrier) writes the trace_id, span_id, and trace-flags to whatever carrier the outgoing protocol uses; extract(carrier) reads them on the receiving side. The sampled bit (lowest bit of trace-flags) survives all six hops because every propagator is implemented to carry the full byte. The failure mode is when a custom proxy or middleware strips unknown headers — an old NGINX rule that allows-listed only Content-* and X-Auth-* headers, dropping traceparent. The result: the receiving service treats the trace as a new root, generates a fresh trace_id, and the head sampler decides anew. The fleet ends up with two disjoint traces for one logical request, half-stored and half-lost. The diagnostic is to set up a synthetic tracer pinging the route end-to-end, then look at Tempo for traces with single-service span trees that should have been multi-service. A Mumbai e-commerce team caught this exact bug in 2024 — an Envoy filter was stripping traceparent because of a misconfigured request_headers_to_remove rule — and the fix was a one-line config change, but the diagnostic took two weeks because head sampling masked the problem (most traces were dropped anyway, so the disjoint-trace pattern only showed up in 5% of cases).
Coordinated-omission risk in head-sampled latency reads
If the trace store is being used to measure p99 latency — even though metrics are the right source for that — there is a subtle coordinated-omission risk. The OTel SDK's batch processor flushes spans on a timer (default 5 seconds) or when a batch size threshold is hit. Under load, the batch processor's queue can back up; spans get dropped or delayed at the queue. When the system is overloaded, the dropped/delayed spans are systematically the slowest ones, because slow requests took longer to complete and were the ones in flight when the queue filled. The kept set therefore under-represents tail latency by exactly the requests you most want to see — head sampling's bias-2 stacks with the SDK's queue-overflow drop, and the resulting p99 estimate from the trace store can be 30-50% below the true p99 during incidents. Why this is coordinated omission specifically: classical CO is the load-generator-side issue where slow responses prevent the next request from firing, so the slow requests are under-represented in the latency histogram. The trace-store version is the SDK-side dual: slow requests take longer to flush, the flush queue fills, the slowest requests get dropped from the export, and the trace store's latency distribution under-represents the tail. The HdrHistogram fix (correctedByLatency) does not apply directly — it assumes the load is open-loop — but the principle is the same: do not trust a latency distribution that was sampled from a queue that drops under load. Use the metrics histogram, not the trace store, for p99 reads.
The deterministic-bucket trick and what to monitor about the sampler itself
A subtle property of int(trace_id_low_64) < threshold is that the comparison is monotonic in the threshold. A trace kept at rate 0.01 is also kept at rate 0.05, 0.1, 1.0 — the keep-set at any rate is a superset of the keep-set at any lower rate. This is what makes head sampling stable when the rate changes mid-flight or differs across services. A platform team can run a 5% global head sampler in production and a 100% sampler in staging, and any trace_id that is sampled in production is also sampled in staging — the staging environment trivially captures every production-sampled trace if it sees the same trace_id. The trick generalises: a "VIP" tier at 100% can coexist with a "standard" tier at 1%, and the union of kept traces is just the standard tier plus any VIPs that happen to fall into the standard 1% (counted once, by trace_id). This is also what makes Honeycomb's "deterministic sampling" claim work — you can re-derive what would have been kept at a lower rate by simply re-applying the threshold to the trace_ids in the kept set, without re-running the workload. The property requires the comparison to be a less-than-threshold against a uniformly-distributed function of trace_id, which is why uniform trace_id generation is a hard requirement.
The flip side is that the production fleet running head sampling needs three observability metrics about the sampler itself: configured-rate-per-service (gauge, exposed by the SDK), kept-trace-rate-per-service (counter, derived from Tempo's trace count divided by the metric counter http_requests_total per service), and the gap between them. The gap is the single most useful signal — when it grows beyond 20% relative, something between the SDK and the backend is dropping traces beyond the configured rate, and the on-call wants to know before they need a trace they cannot find. Most teams ship a recording rule that computes 1 - (tempo_traces_received_total / http_requests_total) per service and alert when it drifts beyond the configured rate plus a tolerance band. The cost is one Prometheus query per service; the value is early warning of SDK queue overflows, network partitions to the collector, or backend ingestion pressure. None of these are visible in the configured rate; all of them show up in the gap, and a fleet that does not measure the gap is running a sampler whose effective behaviour drifts silently for months between audits.
The reproducibility footer
# Reproduce the head-sampler bias measurement on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy opentelemetry-api opentelemetry-sdk
python3 head_sampler_bias_measurement.py
# Expected: a four-row table showing kept count, error retention, p99
# estimation error, and rare-failure category survival across rates
# 10%, 5%, 1%, 0.1%. Vary ERROR_RATE and SPECIFIC_FAILURE_RATE to
# model your own fleet, and add a tail-sampling baseline to see how
# the four numbers change when the sampler can read the error status.
Where this leads next
Head sampling's bias motivates the next two chapters of Part 5 directly. Tail-based sampling fixes biases 1, 3, and 4 by buffering spans until the trace finishes, then deciding on full evidence — the cost is collector statefulness and a 30-second buffer window. Adaptive sampling fixes the orthogonal problem of bandwidth budget under traffic spikes, modulating the rate so a 10× IPL-final spike does not OOM the backend. The Wall chapter at the end of the part consolidates the trade-offs across all three.
- Tail-based sampling (OTel Collector) — the stateful answer that retains 100% of error traces at the cost of buffer memory.
- Adaptive sampling — the rate-modulator that survives spikes by trading representativeness during the spike.
- Trace sampling: head, tail, adaptive — the comparison chapter that maps each design onto the four-axis trade-off.
- Why you can't collect everything — the part-opener that establishes why every observability architecture is a budget plus a constraint.
- Cardinality: the master variable — the metrics-side dual; the same "what do you keep" question applied to label design.
The single most useful thing the senior reader should walk away with: head sampling is a tool for population-level questions, and reaching for it in a specific-trace question is a category error. Most production fleets have both questions; almost no production fleets have only one. The architecture that admits both is head sampling layered with VIP carve-outs and per-service floors — and, eventually, tail sampling for the workloads where the specific-trace question dominates.
There is one more reframing that pays off in the chapters ahead. The word bias in statistics usually means a systematic error to be corrected. In sampling-for-observability the bias is sometimes the feature — a tail sampler is deliberately biased toward keeping errors, and that is the right design for incident debugging. Head sampling's bias is the wrong direction for that use case (it under-samples errors), but it is the right direction for capacity planning (uniform samples preserve population statistics by construction). Carry that distinction into the rest of Part 5: every sampler has a bias, and the engineering question is whether the bias matches the use case. When the answer is "no", the fix is a different sampler or an additional layer, not a different rate.
A practical sequencing for any team reading this chapter: before changing your sampling architecture, run the measurement script in §3 against your fleet's last 24 hours of trace export, varying ERROR_RATE and SPECIFIC_FAILURE_RATE to match your observed numbers. The output will tell you what your current head sampler is actually preserving across the four bias dimensions, not what you assumed it was. Most teams discover the gap between assumed and measured retention is larger than they expected, and the gap is the starting point for the architecture conversation. A team that has measured can argue from numbers; a team that has not can only argue from anecdote — and the angry-customer-with-a-trace_id anecdote is rarely persuasive enough to drive a sampler change on its own. With the numbers, the same conversation takes thirty minutes and lands on a concrete change. Without them, the team continues running the configured rate and accepting the bias quietly until the next angry customer or audit forces the issue.
The closing thought: head sampling is not a bad sampler. It is a cheap sampler with a specific bias profile, and naming the profile honestly is the difference between a fleet that gets value from it and a fleet that pays its costs without naming what it lost. The patterns in §4 — VIP carve-outs, per-service floors, deploy-aware boosts, error-keep overrides, separate pipelines — are not workarounds for a broken tool; they are the full architecture that head sampling is one component of. Read every observability vendor's "best practices" page and the architectures they recommend match the customers they sold to. The architecture you should run matches the use cases your trace store has to serve, which almost never lines up with a single sampler at a single rate. Knowing which patterns to layer is most of the work, and naming the bias is the first step.
One last note for the team that has read this far and is wondering whether to act: the cheapest first move is not a sampler change. It is a measurement run — the script in §3, applied to your fleet's last 24 hours of traffic — and a one-page summary distributed to the on-call rotation explaining what your current head sampler is keeping and what it is dropping. That document, more than any architecture change, is what makes the rest of the work tractable. Engineers who know the sampler's behaviour ask better questions during incidents, write better postmortems when traces are missing, and propose better architecture changes when the team has budget for one. Engineers who do not know the sampler's behaviour treat trace-store gaps as random misfortune and patch around them with extra logging, longer log retention, and PagerDuty rules — none of which fix the underlying gap. The intervention is the documentation, not the code. The code change is what comes after the team has agreed on what is being measured and what the gap is. Sampling architecture is a conversation, materialised; the document is the artefact that makes the conversation possible.
References
- OpenTelemetry specification — Sampling — the canonical definition of
TraceIdRatioBased,ParentBased, and the trace-flags semantics this chapter walks through. - W3C Trace Context —
traceparentheader — the spec for the 8-bit trace-flags byte and the propagation contract that makes head-sampling decisions consistent across services. - Jaeger remote sampling protocol — the HTTP-based control-plane protocol most production fleets use to push per-service rates without an SDK redeploy.
- Honeycomb — "Why We Built Refinery" — the production-engineer-written argument for why head sampling alone is insufficient and what tail sampling buys.
- Sigelman et al., Dapper: A Large-Scale Distributed Systems Tracing Infrastructure (Google, 2010) — the original paper on which OpenTelemetry's head-sampling design is based; argues 0.1-1% is sufficient for capacity planning.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 6 — "Tracing" — the foundational treatment of why trace sampling is structurally different from metric aggregation.
- Trace sampling: head, tail, adaptive — the comparison chapter that maps each design onto the four-axis trade-off this chapter motivates for head specifically.
- Why you can't collect everything — the part-opener that establishes why every sampling architecture is a budget conversation in disguise.