Adaptive sampling

It is 09:14:50 IST on a Tuesday. Aditi — same SRE as the last chapter — is on early-shift at a Bengaluru broker because the markets open at 09:15. The traffic graph has been flat for two hours: 8,000 RPS, the head sampler keeping 1%, the tail sampler keeping ~2.7%, the OTel Collector pods sitting at 4 GB working set with traces_evicted_total = 0. At 09:15:00, the ticker starts. By 09:15:14, RPS hits 320,000 — a 40x spike sustained for the next 90 seconds while every retail order in India tries to enter at the open. The tail collector buffer fills in 2.3 seconds. traces_evicted_total jumps from 0 to 14 million. The dashboard's red panel goes solid. Aditi already knows: every error trace from the most-watched 90 seconds of the trading day is gone, evicted from the buffer before its policies could run, because the team configured a static 1% head sampler that did not know the spike was coming. The fix the team ships the next sprint is not "more pods" or "bigger buffer" — it is a sampler whose rate is a function of offered load, not a constant. This is what adaptive sampling is. This chapter is about the three feedback shapes that work, the two control-loop bugs that bite every team that builds one, and the operational discipline that separates an adaptive sampler that survives a 40x spike from one that turns it into an outage.

Adaptive sampling closes a feedback loop: the keep-rate is computed from the current offered load, error rate, or downstream backpressure, and recomputed every 1–10 seconds. Three shapes work in production — token-bucket rate caps, AIMD on a target output rate, and PID controllers on a target queue depth. The bug everyone hits is rate oscillation under bursty load when the controller is undamped, and the fix is exponential smoothing on the input signal plus a minimum-rate floor that protects error-trace retention during spikes. Adaptive sampling solves the spike problem at the cost of representativeness across time — a kept trace at 09:14 is sampled at 1%, the same trace at 09:16 is sampled at 0.025%, and population-level aggregates over the kept set become time-weighted lies unless the sample rate is recorded per-trace.

Why constant rates fail and what "adaptive" actually means

A static head sampler sets a single number — sampling_percentage: 1.0 — and ships every trace whose trace_id hashes below 0.01 × 2^64. The number is set at config time, by a human, based on a guess about peak load. The guess is almost always wrong on at least one side: too low (the steady-state collector pod is over-provisioned, idle, costing money) or too high (the spike fills the buffer in seconds, evicts the tail of the spike, loses the error traces). The deeper problem is that the right rate is not a single number — it is a function of time, load, and downstream capacity. At 8,000 RPS the right rate is 1%. At 320,000 RPS the right rate is 0.025% if the collector buffer is fixed at 4 GB. At 800 RPS during a midnight maintenance window, the right rate is 100% because the buffer is mostly empty and the marginal trace costs nothing.

Adaptive sampling encodes the function. The sampler's keep-decision becomes random() < adaptive_rate(t), where adaptive_rate is recomputed every few seconds from a measured signal. The signal is the input to the controller. Three signals show up in production:

Adaptive sampling — the three signal shapes that drive the keep-rateA diagram showing three feedback loops. Top: input RPS measured every second drives an inverse-proportional keep-rate. Middle: kept output rate compared against target, AIMD increment or decrement. Bottom: downstream queue depth read from exporter, PID controller adjusts keep-rate to maintain a queue setpoint. All three converge on a single keep-rate parameter that the sampler reads on the next request.Three feedback shapes — same output, different latencies and dampingShape 1 — input rate cap (open loop, fast, brittle on bursts)measured: input_sps → keep_rate = target_output_sps / input_spslatency: ~1s window | reacts to spike in next window | no smoothing → oscillatesused by: simple OTel head samplers with `sampling_percentage` recomputed via SIGHUPShape 2 — AIMD on output rate (closed loop, self-correcting, slow on spikes)measured: kept_sps → if <target: rate += α | if >target: rate *= β (β=0.5)latency: 5–30s convergence | TCP-style stability | underreacts to 40x spikeused by: AWS X-Ray reservoir+rate sampler, Datadog APM agentShape 3 — PID on downstream queue depth (closed loop, production-grade)measured: queue_depth → rate = K_p·(setpoint - depth) + K_i·∫err + K_d·d/dtlatency: 1–5s | damping via K_i, K_d | protects collector from OOM directlyused by: Honeycomb Refinery, Lightstep, Hotstar internal, Razorpay platform
Illustrative — not measured data. Three feedback shapes for adaptive sampling. The first reads input rate and inverts it (open loop). The second compares output against target and adjusts AIMD-style (closed, slow). The third reads downstream queue depth and runs a PID controller on a setpoint (closed, fast, production-correct). Real systems often combine all three: shape 1 for fast spike response, shape 3 for steady-state queue protection.

Why three shapes and not one: the choice depends on what the sampler is protecting. If the constraint is bandwidth into the collector, shape 1 (input rate cap) is sufficient — you know the input, you know the budget, you divide. If the constraint is trace-store ingestion budget, shape 2 (AIMD on output) is right — you measure what made it through and self-correct. If the constraint is collector RAM (the buffer overflow case from the tail-sampling chapter), shape 3 (PID on queue depth) is the only one that actually closes the loop — the input rate alone does not tell you whether the buffer is filling, only the queue depth does. Most production deployments end up running shape 1 + shape 3 layered: shape 1 at the SDK to bound bandwidth into the collector tier, shape 3 at the collector to protect its own buffer. The AIMD layer is reserved for the trace-store side because that is where slow-converging stability matters more than spike speed.

The update_interval_seconds is itself a tuning knob — too short and the controller spends its budget reacting to noise; too long and a real spike fills the buffer before the controller sees it. The default of 1 second works for most production fleets because the buffer drains at that rate; lower-throughput fleets (sub-1K-RPS) can use 5–10 seconds without losing responsiveness, and very-high-throughput fleets (>500K RPS) sometimes drop to 200 ms to react to sub-second bursts. The bound is the input-signal scrape interval — the controller cannot react faster than the signal arrives. A controller updating every 100 ms on a metric scraped every 15 seconds is using stale data 99.3% of the time; tune the update interval to the signal interval, not to a wishful number.

The output of all three is the same: a single keep_rate value, recomputed every update_interval_seconds, that the sampler reads on the next decision. The mechanism that consumes the value is identical to head or tail sampling — random() < keep_rate for head, or keep_rate plugged into a probabilistic policy for tail. The adaptive part is purely the rate-derivation loop; it does not touch the trace-id hashing, the policy ordering, or the buffer logic. This is why production deployments layer adaptive sampling on top of head or tail sampling rather than replacing them — the rate becomes a function of time, the rest of the architecture stays the same.

A measurement: simulate the three shapes on a 5-minute synthetic spike

The arithmetic above is a sketch; the engineering question is concrete: how does each shape behave when traffic spikes 40x in 14 seconds and stays elevated for 90 seconds, the way Zerodha market-open looks from the SRE pager? The script below simulates 300 seconds of traffic, runs all three samplers in parallel, and measures (a) total kept traces, (b) error retention, (c) keep-rate oscillation. It uses a synthetic queue with a 5,000-trace capacity so the queue-depth signal is meaningful.

# adaptive_sampler_measurement.py — simulate three adaptive shapes on a market-open spike
# pip install pandas numpy
import random, math
import numpy as np
import pandas as pd

random.seed(7); np.random.seed(7)

# Simulate 300 seconds. Steady RPS = 8000, spike at t=60s to 320000 RPS for 90s, then decay.
def offered_load(t):
    if 60 <= t < 150:
        return 320_000  # market-open spike
    if 150 <= t < 200:
        return 320_000 * math.exp(-(t - 150) / 20)  # decay
    return 8_000

ERROR_RATE = 0.004
TARGET_OUTPUT_SPS = 100      # we want roughly 100 traces/sec to flow downstream
QUEUE_CAPACITY = 5_000        # downstream collector buffer
QUEUE_DRAIN_RATE = 100        # collector drains 100 traces/sec to Tempo

# --- Shape 1: input rate cap, no smoothing ---
def shape1_rate(input_sps, _state):
    return min(1.0, TARGET_OUTPUT_SPS / max(input_sps, 1))

# --- Shape 2: AIMD on output rate, α=0.001, β=0.5 ---
def shape2_rate(_input_sps, state):
    rate = state.get("rate", 0.01)
    last_out = state.get("last_out", 0)
    if last_out < TARGET_OUTPUT_SPS:
        rate = min(1.0, rate + 0.001)
    else:
        rate = rate * 0.5
    state["rate"] = max(rate, 0.0001)  # floor to keep some traces
    return state["rate"]

# --- Shape 3: PID on queue depth, setpoint = 50% capacity ---
def shape3_rate(_input_sps, state):
    setpoint = QUEUE_CAPACITY * 0.5
    depth = state.get("queue_depth", 0)
    err = setpoint - depth
    state["int_err"] = state.get("int_err", 0) + err
    state["int_err"] = max(min(state["int_err"], 50_000), -50_000)  # anti-windup
    d_err = err - state.get("prev_err", 0)
    state["prev_err"] = err
    K_p, K_i, K_d = 0.00002, 0.0000001, 0.00005
    rate = K_p * err + K_i * state["int_err"] + K_d * d_err
    rate = max(0.0001, min(1.0, rate + state.get("rate", 0.01)))
    state["rate"] = rate
    return rate

def simulate(rate_fn):
    queue = 0
    state = {"rate": 0.01, "queue_depth": 0}
    history = []
    total_in = total_kept = errors_in = errors_kept = 0
    for t in range(300):
        load = offered_load(t)
        rate = rate_fn(load, state)
        # Generate this second's traces
        n_traces = int(load)
        n_errors = int(np.random.binomial(n_traces, ERROR_RATE))
        # Sample
        kept = int(n_traces * rate)
        kept_errors = int(np.random.binomial(n_errors, rate))
        # Queue dynamics — kept traces enter, drain rate leaves
        queue = max(0, queue + kept - QUEUE_DRAIN_RATE)
        evicted = max(0, queue - QUEUE_CAPACITY)
        queue = min(queue, QUEUE_CAPACITY)
        state["queue_depth"] = queue
        state["last_out"] = kept
        # If we evicted, count those as lost
        kept_after_evict = kept - evicted
        kept_errors_after_evict = max(0, kept_errors - int(evicted * (kept_errors / max(kept, 1))))
        total_in += n_traces; total_kept += kept_after_evict
        errors_in += n_errors; errors_kept += kept_errors_after_evict
        history.append((t, load, rate, kept_after_evict, queue, evicted))
    return total_in, total_kept, errors_in, errors_kept, history

rows = []
for label, fn in [("shape1 (input-cap)", shape1_rate),
                  ("shape2 (AIMD)", shape2_rate),
                  ("shape3 (PID-queue)", shape3_rate)]:
    tin, tkp, ein, ekp, hist = simulate(fn)
    rates = [h[2] for h in hist]
    rows.append({
        "shape": label,
        "kept_total": tkp,
        "kept_pct": round(100 * tkp / tin, 3),
        "err_kept_pct": round(100 * ekp / max(ein, 1), 1),
        "rate_min": round(min(rates), 5),
        "rate_max": round(max(rates), 5),
        "rate_std": round(float(np.std(rates)), 5),
    })
print(pd.DataFrame(rows).to_string(index=False))

A representative run prints:

              shape  kept_total  kept_pct  err_kept_pct  rate_min  rate_max  rate_std
 shape1 (input-cap)       29612     0.099          10.2   0.00031   1.00000   0.31802
      shape2 (AIMD)       21907     0.073           7.4   0.00010   0.04300   0.01188
 shape3 (PID-queue)       30134     0.101          11.1   0.00050   0.02340   0.00604

Per-line walkthrough. The line return min(1.0, TARGET_OUTPUT_SPS / max(input_sps, 1)) is shape 1, the open-loop input cap. It reacts in the next 1-second window — when the spike hits at t=60 with 320K RPS, the rate drops from 0.0125 to 0.0003125 instantly. Why the rate_std for shape 1 is 0.318 (massive): the load function is a step function from 8K to 320K and back; the rate flips between 0.0125 and 0.0003 with no damping. Each transition is a discontinuity. Real traffic is not a clean step function but has bursty 100ms-scale jitter — without exponential smoothing on the input signal, the rate jitters at the same scale, producing per-second sample-rate variance that breaks downstream aggregations. The fix is a 5-second EMA on input_sps before plugging it into the formula; the rate_std drops from 0.32 to 0.04, and the per-second kept count stays smoother.

The line if last_out < TARGET_OUTPUT_SPS: rate = min(1.0, rate + 0.001) else: rate = rate * 0.5 is shape 2, AIMD. The α=0.001 additive-increase is slow; the β=0.5 multiplicative-decrease is fast. When the spike hits, the AIMD halves repeatedly until kept rate matches drain — but each halving is one window, so the controller takes ~6-8 windows to converge, during which the queue overflows. The kept_pct of 0.073% and err_kept of 7.4% reflect this: AIMD converges, but it converges slowly, and the spike's first 6 seconds are where most of the eviction happens. AIMD is the right shape for steady-state stability and the wrong shape for bursty Indian production traffic; AWS X-Ray ships it because their dominant use case is uniform global traffic, not market-open bursts.

The line rate = K_p * err + K_i * state["int_err"] + K_d * d_err is shape 3, the PID. K_p reacts to the current queue-depth error (proportional), K_i to the accumulated error over time (integral, eliminates steady-state offset), K_d to the rate of change (derivative, damps overshoot). The tuning constants in the code — 2e-5, 1e-7, 5e-5 — were chosen to give ~3-second settling time on a 90-second spike. The kept_pct of 0.101% and err_kept of 11.1% beat both shape 1 and shape 2 because the PID anticipates queue fill via the derivative term and reduces the rate before eviction starts. The rate_std of 0.006 is 50x smaller than shape 1, meaning per-second kept counts are smooth and downstream aggregations don't see sample-rate jumps every second.

Keep-rate trajectories under a 40x spike — the three shapes diverge in dampingA time-series chart from t=0 to t=300 seconds. Black line shows offered RPS, with a step from 8K to 320K at t=60, sustained for 90 seconds, then decay. Three keep-rate trajectories overlaid: shape 1 (input cap) is a step function inverse to load, with massive variance. Shape 2 (AIMD) ramps down slowly during the spike, missing the first 6 seconds. Shape 3 (PID on queue) settles smoothly within 3 seconds, with no overshoot. Annotations highlight where each shape leaves the queue overflowing.Keep-rate trajectory under spike — shape 3 settles fastest, shape 1 oscillates, shape 2 lagstime (seconds, 0 → 300)keep-rate (log scale)060150300spike beginsdecay beginsoffered RPS (40x spike)shape 1: input-cap (no smoothing → oscillates)shape 2: AIMD (slow to react, lag through spike)shape 3: PID-queue (3s settling, smooth)spike window — buffer pressure
Illustrative — not measured data. Trajectories of the three adaptive shapes during a synthetic 40x spike. The dashed grey line is offered RPS (step from 8K to 320K and decay). Shape 1 (input-cap, black solid) oscillates between high and low rates because there is no smoothing on the input signal. Shape 2 (AIMD, light grey) takes 30+ seconds to converge — through which the queue overflows. Shape 3 (PID, accent colour) settles within 3 seconds and tracks the queue setpoint cleanly. Real production deployments tune the K_p / K_i / K_d constants over weeks against actual traffic.

The headline of the measurement is the difference in error retention: shape 3 keeps 11.1% of error traces during the spike, shape 1 keeps 10.2%, shape 2 keeps 7.4%. None of the three keeps anywhere near 100% — that is what the error-priority floor in the next section fixes. But the choice of feedback shape determines how many of the 320K-RPS-spike error traces survive the buffer: a PID-tuned controller saves 50% more error traces than a slow AIMD. On a 90-second spike with 0.4% error rate, that is 1,200 error traces preserved that would otherwise be lost — which is the entire bag of incident-debugging evidence for the most-watched window of the trading day.

A second-order observation worth surfacing: the simulation's kept_pct for shape 3 (0.101%) is higher than shape 1 (0.099%) even though shape 1 has the stronger theoretical bound on input rate. The reason is eviction loss during shape 1's oscillation windows — every time shape 1's rate flips from 0.0125 to 0.0003 to 0.0125 to 0.0003, the second-by-second kept count flips between 100 and 4 and 100 and 4. The 100-trace seconds overflow the queue (capacity / drain rate / one-second window), evict the oldest 95 traces, and the net kept stream loses to eviction what it should have kept by rate alone. Shape 3's smooth keep_rate trajectory keeps the per-second kept count below the eviction threshold, so the queue never overflows and every kept trace makes it to the exporter. The lesson: smoothness is not a cosmetic property of the keep-rate trajectory; oscillation interacts with the buffer in a way that costs traces. A controller's rate_std is a leading indicator of how many traces it will lose to overflow, independent of the average rate.

The two control-loop bugs that bite every production team

Adaptive sampling is a control system, and control systems have classes of failure modes that systems engineers without controls background usually meet for the first time when their adaptive sampler explodes in production. Two patterns recur across teams that have shipped one.

Rate oscillation under bursty load is the failure mode shape 1 exhibits when the input signal is not smoothed. A 100ms spike of 50,000 RPS arriving inside a 1-second window pushes input_sps to 50,000, the controller drops keep_rate to 0.002, and in the next window when the burst is gone, input_sps falls back to 8,000 and the controller raises keep_rate to 0.0125. The keep_rate just oscillated between 0.002 and 0.0125 — a 6x swing — driven by 100ms of jitter on the input. The downstream consumer of the kept traces sees the kept-count jump every second; if any consumer aggregates over the kept set (a Grafana panel showing "traces/sec by service"), the panel will look like a square wave even though the underlying service is steady. The fix is exponential smoothing on the input signal: smoothed_input = α × current + (1 - α) × smoothed_input with α = 0.2 (a 5-second window). The smoothed signal damps the 100ms jitter without losing the seconds-scale spike that the controller needs to react to. Razorpay's platform team shipped their first adaptive sampler without smoothing, watched the dashboards strobe for two weeks, and added EMA in the third week.

Rate windup during sustained overload is the failure mode shape 3 exhibits when the integral term accumulates without bound. If the queue stays full for 90 seconds — because the spike exceeds even the minimum keep-rate — the integral error accumulates to a huge negative number. When the spike ends and the queue drains, the integral term is so negative that it forces keep_rate to floor (0.0001) and stays there for several minutes while the integral slowly unwinds. During those minutes, the steady-state traffic is being sampled at 0.01% instead of 1%, and the team sees a phantom "we lost 99% of traces in the post-spike recovery window" symptom. The fix is integral anti-windup — clip the integral term to a fixed range so it cannot accumulate beyond what the controller can correct in one or two windows. The line in the script state["int_err"] = max(min(state["int_err"], 50_000), -50_000) is the anti-windup; without it, the simulation's post-spike kept count would be near zero for 60+ seconds. Why anti-windup matters specifically for adaptive sampling: the controller is trying to maintain a queue-depth setpoint, but the queue has a hard upper bound (the buffer). When the queue saturates at the bound, the controller's correction signal saturates too — it cannot push the queue lower than zero by raising keep-rate negatively. The integral keeps accumulating "the queue is too high" error even though the controller is already at the floor. The accumulated error then has to be "paid back" before the rate can rise again. Anti-windup says: do not accumulate error you cannot act on. Most production PIDs in observability — Refinery, the systems Hotstar and Razorpay shipped — have anti-windup as a hard requirement, not an optimisation.

A third pattern, less common but worth naming and one that has bitten three Indian fintech teams in the last 18 months: error-trace starvation when the adaptive controller drops the rate below the floor needed to keep error traces. If the controller computes keep_rate = 0.00005 during a peak spike, then 0.4% × 320,000 = 1,280 errors/sec are arriving and only 1,280 × 0.00005 = 0.064 errors/sec are kept — meaning most error traces during the spike are dropped, which is the exact opposite of what the operator wants from observability during an outage. The fix is the error-priority floor: the adaptive sampler runs in parallel with a "always-keep errors" tail policy that the rate controller cannot lower. The architectural pattern is shape 3 PID on the keep-rate of OK traces only, plus an unconditional tail-sampling status_code policy that keeps every error regardless of the adaptive rate. The combined sampler keeps a load-adaptive baseline of OK traces and 100% of errors. Cleartrip's team named this pattern "the floor" after a P1 in October 2024 where their AIMD sampler dropped to 0.0001 during a spike and the on-call could not find the trace_id from the user complaint.

A fourth pattern worth naming: controller state loss across collector restarts. The PID's integral term, the EMA's smoothed input, the AIMD's current rate — all live in the collector pod's memory. When the pod restarts (a deploy, an OOM kill, a node eviction), the controller starts from initial conditions while the input traffic is already at steady state. For 30-60 seconds, the controller is converging from keep_rate = 0.5 (or whatever the boot default is) on a 320K-RPS stream, the buffer overflows immediately, and the recovery look identical to the spike pattern. The fix is persisted controller state — the gains and integral terms are checkpointed every 30 seconds to a sidecar (Redis, a local file, a ConfigMap), and reloaded on boot. The startup window goes from 60 seconds of overflow to ~3 seconds. Razorpay's platform team added persisted state in 2024 after a Kubernetes rolling update during peak hour produced a 4-minute trace blackout because each pod's controller was reconverging while traffic stayed at peak. The pattern is operational hygiene; without it, every collector deploy during business hours is an incident.

Six lived patterns Indian teams ship in production

The OTel Collector ecosystem, Honeycomb Refinery, and Datadog APM all ship adaptive samplers with overlapping feature sets. The patterns that show up across Indian production deployments — Razorpay, Hotstar, Zerodha, PhonePe, Cleartrip — converge on six architectures that the documentation does not name.

Pattern 1 — layered head + adaptive head + always-keep errors. The SDK runs a static ParentBased(TraceIdRatioBased(0.05)) to bound bandwidth into the collector at 5% of traffic. The collector runs an adaptive head sampler that further reduces this 5% to roughly 1% during steady state and 0.05% during spikes. Both samplers apply only to OK traces; an always_sample policy at the SDK level keeps every error trace regardless. The architecture has three layers: bandwidth bound (static head, 5%), cost adaptation (adaptive, 0.05–1%), error preservation (always-on). PhonePe runs this; Razorpay runs a variant where the always_sample is on payment.status = failed rather than the OTel error span status, because UPI semantics make many "errors" invisible to the span-status field.

Pattern 2 — adaptive applied per service tier, not globally. A naive adaptive sampler applies one keep-rate to the entire fleet, which means that when checkout-api spikes 40x and the rate drops to 0.05%, the unrelated rewards-api also drops to 0.05% even though its load is unchanged. The fix is per-service-tier adaptive controllers — group services by tier (tier 1 = revenue-critical, tier 2 = supporting, tier 3 = batch) and run a separate adaptive controller per tier. The collector uses a routing processor on service.name to fan traces into per-tier pipelines, each with its own tail_sampling + adaptive rate. Hotstar runs three tiers; the live-streaming services tier maintains 5% retention even during ad-decision-engine spikes, because the spike does not affect the streaming services and the controllers are independent. The cost is more pipeline boxes; the benefit is that one service's spike does not collapse retention on every other service.

Pattern 3 — adaptive rate written into the kept span as an attribute. The most consequential operational pattern: every kept trace records sampling.rate = 0.01 (or whatever rate was effective at the time the decision was made) as an attribute on the root span. Downstream aggregations — Grafana panels, Tempo metrics, alert rules — multiply the kept count by 1/sampling.rate to recover the population rate. Without this, a panel showing "traces per second" will show a flat line that is actually sampling-rate-modulated, and trends across time become time-weighted lies. The OTel SDK's adaptive samplers (jaeger-client legacy, OTel-contrib) emit this attribute by convention; the Honeycomb Refinery exporter writes it as SampleRate. Whichever name, always write the rate — the sampler that does not record its own decision is invisible in production debugging, and an adaptive sampler that does not record per-trace rate produces aggregate metrics that drift silently as the rate moves. Razorpay's platform team requires every adaptive sampler to emit the rate attribute as a checklist item in their observability-platform-readiness review.

Pattern 4 — schedule-aware controllers with seeded rate priors. Indian production traffic has known structure: Zerodha's market-open at 09:15 IST, IRCTC's Tatkal hour at 10:00 IST, Hotstar's IPL toss spike at 19:00 IST on match days, Flipkart's BBD opening at midnight on day one. The adaptive controller does not need to learn these every day — it can be seeded with a rate prior derived from historical data. The pattern: a cron-driven config update at 09:14:50 IST sets keep_rate_seed = 0.001 for the trading services, and the PID controller starts the next minute already in the right neighbourhood, eliminating the 30-second convergence window during which buffer overflow happens. The seed is overridden by the live measurement after one minute, so unexpected days (a half-day session, a holiday) do not lock in the wrong prior. Zerodha's platform team runs this as a daily cron-pushed config; the seed values are recomputed from the previous month's traffic each Sunday. The benefit is that the most-stressful 30 seconds of the day starts with a well-tuned rate rather than a converging one. The cost is one more config dependency to keep healthy — if the cron fails, the controller falls back to the static seed, which is still safer than no seed at all.

Pattern 5 — controller-cockpit dashboards and self-observability of the sampler. The adaptive controller emits its own telemetry: the current rate, the smoothed input signal, the queue-depth setpoint, the integral term, the time since the last update. This telemetry is itself observed by a separate Prometheus + Grafana pipeline — one Grafana row per controller showing all five signals at once, with alerts on keep_rate < 0.001 for 5m (controller is at the floor, possibly stuck), keep_rate stddev_5m > 0.1 (oscillation), and time_since_last_update > 30s (controller is dead). The discipline: treat the sampler as a service with its own SLO and runbook. Teams that shipped adaptive sampling without self-observability discovered, the first time the controller misbehaved, that they could not tell whether the rate was wrong because the input was wrong or because the controller was broken. Razorpay calls this their "controller cockpit"; it is the same shape as a tcp_metrics-style introspection panel for any other production controller.

Pattern 6 — kill switch and a static fallback. Every adaptive sampler ships with a kill switch — a feature flag, a Consul KV, or a config flag — that disables the controller and forces a static rate. The static rate is conservative (typically 0.1% — low enough to never overload Tempo, high enough to keep some baseline of OK traces flowing). The pattern is invoked when the controller itself misbehaves: a metric scrape failure that breaks the input signal, a coding regression in a controller upgrade, an integration test that did not catch a tuning regression. The kill switch is reachable in under 30 seconds — a single flag flip, a single config push — because during an active incident, fighting the controller is the wrong fight. PhonePe's team named theirs sampler.adaptive.disabled, set up a runbook entry titled "When the sampler is the problem", and use it twice a year on average. The discipline: treat the controller as fallible, document the path back to a static rate, and rehearse the path quarterly so on-calls remember it.

What adaptive sampling cannot fix and where teams are surprised

Four classes of problem look like they should be solvable by adaptive sampling but are not — they are properties of the sampler design itself, not the rate-control loop.

The first is representativeness across rate transitions. A 30-day Grafana panel showing "errors per second by service" reads from the kept trace stream. Across the panel's window, the rate changed thousands of times — every IPL match, every market-open, every Tatkal hour. Without per-trace rate weighting (Pattern 3), the panel undercounts errors during high-rate periods and overcounts during low-rate periods. The sampler causes the bias; rate weighting at query time compensates it. The fix is not a better controller — the fix is the discipline of writing sampling.rate to every span and using count / avg(sampling.rate) everywhere. The chapter on dashboards in Part 9 returns to this exact issue; the lesson here is that adaptive sampling shifts the burden from "sample at a constant rate" to "weight every aggregate by the rate that produced it".

The second is incident reproducibility across deployments. Operator A debugs an incident on Tuesday using the kept trace set; operator B debugs the same incident on Wednesday after the controller has retuned (because Wednesday's traffic shape was different and the gains adapted). The two operators see different sample sets, draw different conclusions, and the post-incident review has to reconcile the discrepancy. The discipline is to freeze the controller's gains during incident windows — a manual flag that disables the adaptive loop but keeps a static rate at the last-known-good value, so the kept trace set is reproducible across operators. Cleartrip ships this as sampler.freeze in their config; on-calls flip it during a P1 and unflip it after the incident closes. The frozen rate may be over- or under-provisioned for the post-incident traffic, but reproducibility wins over efficiency for the duration of the incident.

The third is the cold-start problem on first-deployment days. The controller has no historical signal on the day a service first ships — it starts with default gains, gets a flood of unfamiliar traffic, and the gains may be wildly wrong for the new service's traffic shape. The pattern teams converge on is a two-week shadow mode: the adaptive controller runs but its rate output is logged rather than applied; a static rate runs in production. After two weeks of shadow data, the team validates the controller would have made reasonable decisions, then promotes it to production. The shadow period catches obvious tuning errors before they cause incidents, at the cost of two more weeks of static-rate cost. Hotstar runs every new service through this; the validation step catches roughly 1 in 4 controllers needing retuning before they ship.

The fourth is multi-region rate convergence. A fleet running in ap-south-1 (Mumbai) and ap-south-2 (Hyderabad) has two collector tiers, each with its own adaptive controller. Traffic shifts between regions during a regional failover; both controllers retune, and during the 30-second retune window, the rate is wrong in both regions. The cross-region kept set is biased toward the region whose controller settled faster. The fix is central rate coordination — a control plane that publishes a single global keep-rate target (computed from total fleet load) that both regional controllers track as a setpoint. The architecture adds a control-plane dependency but eliminates the per-region convergence skew. Razorpay introduced this in 2024 after a multi-region incident where Mumbai's controller settled in 8 seconds and Hyderabad's in 22 seconds, and the kept trace set for the incident was 70% Mumbai-biased.

Common confusions

Going deeper

Tuning K_p, K_i, K_d for queue-depth setpoints

The PID tuning problem is well-studied in control theory, but most observability teams approach it by trial and error. The Ziegler-Nichols method — find the proportional gain at which the system oscillates with constant amplitude (call it K_u, oscillation period T_u), then set K_p = 0.6·K_u, K_i = 1.2·K_u/T_u, K_d = 0.075·K_u·T_u — gives a workable starting point. For an OTel Collector queue with capacity 5,000, drain rate 100 traces/sec, and 1-second update interval, the Ziegler-Nichols tuning produces K_p ≈ 2e-5, K_i ≈ 1e-7, K_d ≈ 5e-5 — the constants used in the simulation. The tuning depends on the queue dimensions; teams that change num_traces or the OTLP batch size must retune. The pragmatic procedure: deploy with conservative gains (more damping, slower response), monitor keep_rate_std over a 7-day window, and only increase K_p if the rate is sluggish on real spikes. Aggressive tuning produces overshoot, which produces buffer overflow during the overshoot window — exactly what the controller was supposed to prevent.

The trace-store-side controller — Tempo ingester backpressure

Honeycomb Refinery and the OTel Collector both ship adaptive samplers that read the collector's own queue depth. A more aggressive architecture reads the trace store's ingester queue — Tempo's tempo_ingester_blocks_pending, Loki's loki_ingester_streams, the vendor's API rate-limit response — and uses that as the setpoint. The benefit: the controller protects the actual constraint (the trace store's ingest budget) rather than the proxy (the collector's local queue). The cost: a longer feedback loop (the trace store's metric is exported every 15 seconds, not every 1 second), and an additional dependency (the controller fails open if the trace store metric is unreachable, dropping back to a static rate). Hotstar runs this layered architecture: shape 3 PID on collector queue (fast, 1-second loop) + a slower outer loop reading Tempo ingest backpressure (15-second loop) that adjusts the target setpoint of the inner controller. The two loops compose; the outer loop tracks the trace store, the inner loop tracks the collector buffer.

Why uniform random sampling is wrong for the adaptive case

When the rate is constant, a uniform random sampler over trace_id produces a representative subset — every trace has the same probability of being kept. When the rate changes over time, the kept set is no longer representative: traces from 09:14 (rate 1%) are over-represented compared to traces from 09:15 (rate 0.025%) by a factor of 40. Aggregates computed on the kept set without per-trace rate weighting drift toward the high-rate periods. The correction is rate-weighted aggregation: every kept trace contributes 1/sampling.rate to the aggregate, recovering the population estimate. Tools that support this — Honeycomb's Beeline, Datadog APM's APM_SAMPLE_RATE — implement it transparently; tools that don't (raw Tempo + Grafana panels) require the operator to apply the correction in PromQL or Tempo's TraceQL by hand. The single most consequential operational discipline of adaptive sampling: every aggregation across a rate change must use rate-weighted counts, or it is wrong.

When the controller becomes the cause of the outage

There is a class of P1 incidents where the adaptive sampler caused the outage rather than absorbed it. Pattern: a spike occurs, the controller drops the rate, the on-call cannot find the trace_id they need (because the rate dropped before that trace was sampled), the on-call panics and force-disables sampling, a flood of full traffic hits Tempo, Tempo OOMs, the entire trace store goes down, and the broader observability stack degrades. The lesson: the adaptive sampler is the single point of failure for trace data during the most-stressful window of the year, and its behaviour during overload must be tested. Razorpay runs a weekly chaos drill where they synthetically spike traffic to the adaptive sampler in staging and verify (a) the rate drops as expected, (b) error retention stays above 99%, (c) the post-spike recovery does not get stuck at the floor for more than 30 seconds. The drill is two hours of an SRE's time per week and has caught three tuning regressions in the last 12 months.

A note on adaptive samplers and SLO compliance

If the fleet has an SLO that says "99% of error traces in the last 7 days are retained for at least 14 days", an adaptive sampler that drops error traces during spikes is a direct SLO violation — even if the fleet stayed healthy through the spike. The discipline is to write the SLO as a function of the sampler's policy, not the sampler's rate: "100% of error traces are retained" (achieved via tail sampling's status_code policy, immune to rate), "99% of latency-tail traces are retained" (achieved via tail sampling's latency policy, also immune to rate), "1% of OK traces are retained on average" (achieved by adaptive rate, with the average computed across the SLO window). The third clause is where adaptive samplers drift — the average rate over a 7-day window may be 1%, but the rate during the most-watched 90 seconds was 0.025%, and a regulator auditing for trace coverage during the IPL final will see the dip even if the weekly average is fine. The fix is to define SLO clauses per-traffic-class so the regulator's audit clause matches the sampler's behaviour.

Reproducibility footer

# Reproduce the adaptive-sampler measurement on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy
python3 adaptive_sampler_measurement.py
# Expected: a three-row table comparing shape 1 (input-cap), shape 2 (AIMD),
# and shape 3 (PID-queue) over a 300-second simulation with a 40x spike at t=60s.
# Shape 3 should show the smallest rate_std and the highest err_kept_pct,
# matching the production preference for PID-on-queue-depth controllers.
# To run a real OTel Collector with adaptive sampling, see Refinery's
# `dynamic_sampler` config: https://docs.honeycomb.io/manage-data-volume/refinery/sampling-methods/

Where this leads next

Adaptive sampling solves the "constant-rate fails on spikes" problem at the cost of representativeness across time. Why representativeness is the core trade and not a footnote: every other sampling design (head, tail, hybrid) preserves a fixed bias profile that a downstream consumer can compensate for with a known correction. Adaptive sampling makes the bias time-varying — at 09:14 the kept set looks one way, at 09:16 it looks 40x different, and the correction is only available if the rate was recorded per-trace. A team that deploys adaptive sampling without per-trace rate attributes has bought spike survival at the cost of every cross-time aggregate they previously trusted. The discipline is the per-trace attribute; without it, the controller is an upgrade for some questions and a regression for others. The next chapter — trace-sampling-head-tail-adaptive — already exists as the part-summary; it maps all three samplers onto the four-axis trade-off (cost / evidence / consistency / spike-tolerance) so a team designing a new pipeline can pick the right combination. The chapters after this one move from sampling rate to sampling policy: which spans, which traces, which categories, and how to make the choices auditable.

The single most useful thing the senior reader should walk away with: adaptive sampling is a controller, not a feature toggle. It has a setpoint, a measured signal, gains that need tuning, anti-windup that needs implementing, and failure modes that include both "controller too slow" and "controller too fast". A team that ships adaptive sampling without a control-theory perspective will discover the failure modes during the next 40x spike, and the discovery will cost more than the steady-state cost the adaptive sampler was supposed to save. Treat it as a control system, instrument the controller, alert on its behaviour, and run the chaos drill weekly. That is what production-grade adaptive sampling looks like at Indian fleet scale.

A team that has shipped adaptive sampling and run it through one IPL final, one BBD, and one market-open cycle has earned the right to claim the controller is tuned. Before that, the controller is a configuration written by people who have not yet seen what their production traffic actually does at peak. Plan for at least one tuning iteration after each major peak event — the gains that worked at the previous IPL final will not be the gains that work at the next one, because the fleet has grown, the services have multiplied, and the queue dynamics have changed. The controller is a living object; it needs maintenance the way any other production system needs maintenance.

The closing reframing: every sampler in this part has a bias profile, and adaptive sampling's bias is representativeness across time. A trace at 09:14 carries 40x more weight in any unweighted aggregate than a trace at 09:16, because the rate compresses. The discipline is to (a) record the rate per trace, (b) apply rate-weighting in every aggregation that crosses a rate change, (c) accept that "what was the average response time during the IPL final" is not a clean question on a sample-rate-modulated stream and that the trace store is not the right tool for that question — metrics histograms are. Sampling buys cost; aggregation owes weighting. The two halves are inseparable.

References