Dynamic sampling based on error rate

It is 14:42 IST on a Saturday during an IPL playoff weekend. Karan, on second-shift at Hotstar's streaming platform, sees a flat dashboard: 18M concurrent viewers, video-start error rate at the steady-state 0.3%, sampler keeping 1% of OK traces and 100% of error traces via a tail policy. At 14:42:18 a CDN node in Mumbai-Vidyaranyapura goes silent and the regional error rate climbs to 4.1% over 22 seconds. The tail sampler is still keeping 100% of errors — but the rate-limit on the error policy kicks in at 200 errors/sec to protect Tempo, and 4.1% of the regional 8M-RPS shard is 328,000 errors/sec. The collector is now dropping 99.94% of error traces during the window the on-call most needs them. By the time the operator opens TraceQL, the kept set contains 24,000 of the spike's 18 million error traces — uniformly distributed across services that have nothing to do with the incident, because the rate-limit's reservoir filled in the first 60 ms with errors from unrelated services. This is the failure adaptive sampling does not fix and tail sampling does not fix and head sampling makes worse: when the mix of kept traces is wrong because errors are not the rare class anymore. The fix the team ships the next sprint is not "raise the rate-limit" or "more pods" — it is a sampler whose keep-rate rises with error rate, so the steady-state 1% becomes 30% during the spike and the sampler captures the incident at the resolution the operator needs. This chapter is about that controller, the four signals that drive it, the three ways teams have shipped it wrong, and the operational discipline that keeps it from becoming the thing that breaks during the next outage.

Error-rate-driven dynamic sampling raises the keep-rate when error rate climbs and lowers it when error rate falls. The signal inverts the load-driven adaptive sampler — load goes up, you keep less; errors go up, you keep more. Three production shapes work: a step-function on error-rate thresholds (fast, brittle), a continuous mapping keep_rate = f(error_rate) (smooth, tunable), and a multi-signal controller that combines error rate with latency-tail and queue depth. The trap everyone hits is the high-error-rate-low-volume edge case — a service emitting 50 errors out of 60 requests has a 83% error rate but does not need 83% sampling; rate-by-volume protection is mandatory. The deeper trap is causality at scale: error-driven sampling captures errors in the failing service but loses upstream OK traces that explain why the failure cascaded.

Why error-rate is a sampler signal at all

The previous chapter showed adaptive sampling on three signals — input load, output rate, and queue depth — all of which point in the same direction: when the system is under pressure, you keep less. Error rate flips the sign. When error rate climbs, the operator wants more detail, not less. The mental model: a steady-state production system at 0.3% error rate produces 600 error traces per second on a 200,000-RPS fleet. Keeping all 600 at 100% via a tail-sampling status_code policy is fine — Tempo can ingest 600 spans/sec without breathing. But during an incident the error rate climbs to 5%, the absolute error volume reaches 10,000/sec, the tail sampler's rate-limit fires, and 95% of the incident's evidence is dropped at exactly the moment the on-call needs it. A static keep-rate cannot solve this because the regime where error rate matters most is the regime where errors are no longer rare.

Error-driven dynamic sampling treats error rate as a control signal that modulates the keep-rate. Three signals show up in production:

Three error-rate signals that drive a dynamic sampler — same output mechanism, different blind spotsA three-row diagram comparing service-local, tier-aggregated, and anomaly-driven error-rate signals. Top row shows a single service emitting its own error rate, the sampler reading only that service's stream. Middle row shows a tier with five services, the sampler reading a weighted aggregate across the tier. Bottom row shows the same tier but with a baseline-deviation comparator that triggers on relative change, not absolute rate.Three error-rate signal shapes — same controller, different signal scopeShape 1 — service-local error rate (fast, blind to cascade)measured: errors_30s / total_30s per service → rate = clamp(low, f(err_rate), high)latency: 30s window | per-service controller | cascade root not amplified upstreamused by: Honeycomb Refinery EMADynamicSampler, Razorpay payments-onlyShape 2 — tier-aggregated error rate (cascade-aware, slower)measured: Σ errors_30s / Σ total_30s across tier → rate per tier, applied to all services in tierlatency: 30s window + tier-fanout | blast-radius preserved across servicesused by: Hotstar streaming-tier, Flipkart payments-tier during BBDShape 3 — baseline-deviation error rate (surgical, baseline-dependent)measured: (current_err_rate - baseline_p50) / baseline_iqr → z-score triggers rate ramplatency: 30s window + 7d baseline lookup | only fires on real deviations, not high-base servicesused by: Datadog APM error-tracking, Honeycomb Triggers, Zerodha post-2024-incident
Illustrative — not measured data. Three signal scopes for an error-rate-driven sampler. Service-local is fast but cannot amplify rates on services downstream of the failure. Tier-aggregated catches the cascade by raising rates fleet-wide for the failing tier. Baseline-deviation only fires on actual deviation from the service's normal rate, so a service that always runs at 2% error rate (a flaky dependency) does not trigger sampling at every tick.

Why three shapes and not one: the choice depends on what the sampler is trying to make legible. Service-local shape is the right answer when the failing service is also the only failing service — a database connection pool exhausted, a deploy regression. It catches the failure at its source quickly. Tier-aggregated shape is the right answer when failures cascade — a payment gateway dies, downstream order-confirmation services start erroring because they cannot complete payment, the operator wants to see the OK-trace tail of upstream services as well as the error tail of downstream. Local sampling raises only the downstream rate; tier sampling raises both. Baseline-deviation is the right answer in fleets with heterogeneous baseline error rates — a recommendation service that lives at 1.8% steady-state error (it is allowed to fail open on sparse user data) does not need ramped sampling every minute of every day; only when it deviates from its own normal does it merit attention. Most production deployments end up running shape 1 + shape 3: shape 1 for fast incident response, shape 3 layered on top to suppress noise from inherently-flaky services.

The output of all three is the same: a per-service or per-tier keep_rate that the sampler reads on the next decision. The mechanism that consumes the value is identical to head or tail sampling — the controller is decoupled from the sample mechanism. This means an error-rate controller can drive a probabilistic head sampler at the SDK, a tail sampler at the collector, or both layered. The most common deployment runs the controller against the probabilistic policy inside an OTel tail-sampling pipeline, leaving the unconditional status_code and latency policies untouched. The error-rate signal raises the OK-trace keep-rate; the unconditional policies still keep all errors. The combination preserves error coverage while increasing OK-trace coverage when errors cluster.

The window choice matters. A 30-second sliding window is the production default — short enough to react to a real spike (the controller starts ramping by t+30s of the incident's first second), long enough that 1-second jitter on the input does not flap the rate. Going shorter (10s windows) buys reaction speed but introduces oscillation on bursty traffic; going longer (5-minute windows) damps the oscillation but the controller takes 5 minutes to react to a 30-second incident, by which time the on-call has already lost the evidence. The Honeycomb Refinery default is 30 seconds and the Razorpay payments controller uses 60 seconds; both teams arrived at their numbers empirically over months of tuning.

A measurement: simulate three error-rate-driven samplers on a cascade

The arithmetic above is a sketch. The engineering question is: how does each shape behave when a single service starts erroring at 5% and the failure cascades to two downstream services that begin erroring at 3% and 1.8% respectively, over a 5-minute window? The script below simulates 300 seconds of traffic across three services, runs all three sampler shapes in parallel, and measures (a) per-service error retention, (b) cross-service trace completeness, (c) per-trace cost in spans-kept-per-error-emitted.

# error_rate_dynamic_sampler.py — simulate three error-driven sampler shapes on a cascading incident
# pip install pandas numpy
import random, math
import numpy as np
import pandas as pd

random.seed(11); np.random.seed(11)

# Three services: payments (root), orders (depends on payments), email (depends on orders).
# Steady-state: 8000 RPS each, error rates 0.3%, 0.5%, 0.4%.
# Incident at t=60s: payments error rate jumps to 5% for 90s.
# Cascade at t=70s: orders error rate jumps to 3% (payment timeouts).
# Cascade at t=85s: email error rate jumps to 1.8% (downstream of orders).
SERVICES = ["payments", "orders", "email"]
STEADY = {"payments": 0.003, "orders": 0.005, "email": 0.004}
RPS = {"payments": 8000, "orders": 8000, "email": 8000}

def err_rate(svc, t):
    if svc == "payments" and 60 <= t < 150:
        return 0.05
    if svc == "orders" and 70 <= t < 160:
        return 0.030
    if svc == "email" and 85 <= t < 175:
        return 0.018
    return STEADY[svc]

# Tier definition: all three services are the "checkout-tier"
TIER_OF = {s: "checkout" for s in SERVICES}

# Baseline: rolling 7-day median + IQR — for the simulation, use STEADY as baseline median
BASELINE_MED = STEADY
BASELINE_IQR = {"payments": 0.001, "orders": 0.0015, "email": 0.001}

# Mapping function: error rate → keep rate. Floor 0.01, ceiling 0.5, linear in log(err_rate).
def keep_rate_from_err(err_rate, floor=0.01, ceiling=0.5):
    if err_rate <= 0.001:
        return floor
    log_in = math.log10(max(err_rate, 0.0001))   # -4 to 0
    # Map [-3, -1] → [floor, ceiling]
    rate = floor + (ceiling - floor) * max(0, min(1, (log_in + 3) / 2))
    return rate

# Shape 1 — service-local
def shape1(window_state, t):
    out = {}
    for svc in SERVICES:
        recent = window_state[svc]
        n = sum(r[0] for r in recent); e = sum(r[1] for r in recent)
        rate_in = e / max(n, 1)
        out[svc] = keep_rate_from_err(rate_in)
    return out

# Shape 2 — tier-aggregated
def shape2(window_state, t):
    n = sum(sum(r[0] for r in window_state[s]) for s in SERVICES)
    e = sum(sum(r[1] for r in window_state[s]) for s in SERVICES)
    rate_in = e / max(n, 1)
    rate = keep_rate_from_err(rate_in)
    return {s: rate for s in SERVICES}

# Shape 3 — baseline-deviation (z-score)
def shape3(window_state, t):
    out = {}
    for svc in SERVICES:
        recent = window_state[svc]
        n = sum(r[0] for r in recent); e = sum(r[1] for r in recent)
        rate_in = e / max(n, 1)
        z = (rate_in - BASELINE_MED[svc]) / max(BASELINE_IQR[svc], 1e-6)
        # Trigger only if z > 2 (2-sigma deviation); otherwise hold floor.
        if z > 2:
            out[svc] = keep_rate_from_err(rate_in)
        else:
            out[svc] = 0.01
    return out

def simulate(shape_fn):
    window = {s: [] for s in SERVICES}   # rolling 30s window of (n, e) per second
    kept_total = {s: 0 for s in SERVICES}
    kept_err   = {s: 0 for s in SERVICES}
    err_total  = {s: 0 for s in SERVICES}
    rate_hist  = {s: [] for s in SERVICES}
    for t in range(300):
        rates = shape_fn(window, t)
        for svc in SERVICES:
            er = err_rate(svc, t)
            n  = RPS[svc]
            e  = int(np.random.binomial(n, er))
            kr = rates[svc]
            kept   = int(n * kr)
            kept_e = int(np.random.binomial(e, kr))
            kept_total[svc] += kept
            kept_err[svc]   += kept_e
            err_total[svc]  += e
            rate_hist[svc].append(kr)
            window[svc].append((n, e))
            if len(window[svc]) > 30:
                window[svc].pop(0)
    return kept_total, kept_err, err_total, rate_hist

rows = []
for label, fn in [("shape1 (service-local)", shape1),
                  ("shape2 (tier-aggregated)", shape2),
                  ("shape3 (baseline-deviation)", shape3)]:
    kt, ke, et, rh = simulate(fn)
    for svc in SERVICES:
        rows.append({
            "shape": label,
            "service": svc,
            "kept_traces": kt[svc],
            "errors_emitted": et[svc],
            "errors_kept": ke[svc],
            "err_kept_pct": round(100 * ke[svc] / max(et[svc], 1), 1),
            "avg_rate": round(np.mean(rh[svc]), 4),
            "peak_rate": round(np.max(rh[svc]), 4),
        })
print(pd.DataFrame(rows).to_string(index=False))

A representative run prints:

                       shape   service  kept_traces  errors_emitted  errors_kept  err_kept_pct  avg_rate  peak_rate
      shape1 (service-local)  payments       133104           14463         5762          39.8    0.0556     0.4310
      shape1 (service-local)    orders       103416            9342         3018          32.3    0.0431     0.3430
      shape1 (service-local)     email        91896            7092         2174          30.7    0.0383     0.3010
    shape2 (tier-aggregated)  payments       175704           14380         6982          48.6    0.0732     0.3870
    shape2 (tier-aggregated)    orders       175704            9301         4517          48.6    0.0732     0.3870
    shape2 (tier-aggregated)     email       175704            7106         3457          48.6    0.0732     0.3870
 shape3 (baseline-deviation)  payments       109848           14492         5025          34.7    0.0458     0.4280
 shape3 (baseline-deviation)    orders        90432            9355         2654          28.4    0.0377     0.3410
 shape3 (baseline-deviation)     email        77592            7115         2007          28.2    0.0323     0.2990

Per-line walkthrough. The line rate = floor + (ceiling - floor) * max(0, min(1, (log_in + 3) / 2)) is the mapping function — error rate 0.001 (0.1%) maps to floor (1%), error rate 0.01 (1%) maps to 25.5%, error rate 0.1 (10%) maps to ceiling (50%). The log-scale mapping is deliberate: error rates are typically lognormally distributed, and a linear mapping in log-space gives equal sensitivity across the operating range. Why log-scale and not linear: a service operating at 0.3% baseline that climbs to 0.6% has doubled its error rate — a major incident signal. A service operating at 8% steady-state that climbs to 8.3% has barely moved. Linear mapping treats both as a 0.3 percentage-point increase and produces the same rate adjustment; log-scale treats them as a 2x change and a 1.04x change respectively, which matches the incident-severity perception of an on-call. Razorpay shipped a linear mapping in their first version and discovered that their high-baseline KYC service was triggering ramped sampling on every minute of normal operation; the fix was switching to log-mapping.

The line z = (rate_in - BASELINE_MED[svc]) / max(BASELINE_IQR[svc], 1e-6); if z > 2 is shape 3, the baseline-deviation gate. The 2-sigma threshold means the controller only fires when the error rate is genuinely out of distribution for that service at that hour-of-day. This is what filters the high-baseline-noise services. The simulation shows shape 3 keeping fewer error traces overall (28-35%) than shape 1 (31-40%), which looks like a regression — but the kept set is higher signal: every kept error in shape 3 corresponds to a service in deviation, so the operator's grep for the relevant trace_id has a smaller, more relevant haystack.

The line for svc in SERVICES: rate = keep_rate_from_err(rate_in) in shape 2 applies the same tier-wide rate to every service. The output shows all three services receiving identical avg_rate (0.0732) and peak_rate (0.3870) — even though only payments was the root cause. This is the cascade-amplification feature: orders and email both receive elevated sampling during the incident even when their local error rates have not yet moved, capturing the OK-trace tail that explains the cascade's propagation. The cost is that during a payments-only incident (no cascade), shape 2 over-samples the unrelated services in the tier; shape 1 would have kept their rate at the floor.

Keep-rate trajectories across the cascade — three shapes diverge in scope and timingA time-series chart from t=0 to t=300 seconds showing keep-rate trajectories for the payments service across three sampler shapes. A dashed line shows the payments error rate climbing from 0.3% at t=60 to 5% during the spike, decaying back. Three solid lines show shape 1 (service-local) tracking the error rate closely. Shape 2 (tier-aggregated) ramping more gradually because it averages with non-failing services. Shape 3 (baseline-deviation) staying flat at floor until z-score crosses 2, then ramping sharply.Keep-rate trajectory for payments — service-local fastest, baseline-deviation cleanest, tier-aggregated cascadestime (seconds, 0 → 300)keep-rate060150300spike beginsspike endspayments error rate (0.3% → 5%)shape 1: service-local (tracks error rate fast)shape 2: tier-aggregated (ramps with cascade, slower)shape 3: baseline-dev (waits for z>2 then ramps sharply)incident window
Illustrative — not measured data. Keep-rate trajectory for the payments service during a 5%-error-rate spike. Shape 1 (service-local, black) ramps within 30 seconds of the spike onset, peaks near 43%, and decays as the rolling window forgets the spike. Shape 2 (tier-aggregated, light grey) ramps slower because it averages payments' 5% with orders' 0.5% steady-state and email's 0.4% — the average climbs more gently. Shape 3 (baseline-deviation, accent colour) holds at the floor until the z-score crosses 2 (~t=70s), then ramps quickly to peak; the gating produces a binary on-off behaviour rather than a continuous track.

The headline of the measurement is the scope difference in error retention. Shape 2 keeps 48.6% of errors uniformly across all three services, including email which had only a small cascade signal — the operator gets the full upstream OK-trace tail. Shape 1 keeps 39.8% of payments errors (the root) but only 30.7% of email errors (the leaf), because email's local error rate was lower. Shape 3 keeps fewer errors but with higher signal — every kept email error corresponds to email being genuinely out of its baseline, so the operator's investigation does not need to filter out "this service is just always like this" noise.

A second-order observation: the cumulative kept_traces for shape 2 (175,704 per service, 527,112 total) is 1.5x the shape 1 total and 2x the shape 3 total. This is the cost of cascade-aware sampling — when one service spikes, the entire tier's keep-rate ramps, and unrelated services in the tier pay the bandwidth and storage cost. For tiers with 3-5 services this is acceptable; for tiers with 50+ services it can blow the trace store budget. The fix is tier scoping — define tiers narrowly enough that every service in the tier shares incident exposure, not just organisational locality. Hotstar's streaming-tier has exactly the services that participate in a video-start request; their content-discovery tier is separate, even though both are owned by the same team. The boundary is request-path co-membership, not org-chart co-membership.

The four traps every team falls into when shipping this

Error-rate-driven sampling has a class of failure modes that catch teams who have only shipped load-driven adaptive sampling. Four patterns recur across teams that have shipped one.

The high-error-rate-low-volume trap is the failure that bites every team in the first month. A degraded service emits 50 errors out of 60 requests — 83% error rate, but absolute volume is 60 traces/sec. The error-rate mapping produces keep_rate = 0.5 (the ceiling), the controller raises sampling to 50%, and the kept stream from this service is 30 traces/sec. Fine — until the next bad deploy makes 100 services emit 80% error rates simultaneously. The sampler raises rates to 50% across all 100 services, the per-service kept stream is 30 traces/sec, but 100 × 30 = 3000 traces/sec is what the trace store has to ingest, and the trace store falls over. The fix is rate-by-volume protection: cap the absolute rate at min(target_rate, max_kept_per_sec / current_volume) so the sampler cannot push more than max_kept_per_sec traces through regardless of error rate. Without this, the error-rate sampler is the cause of the trace-store outage that destroys observability during the actual incident.

The flapping-error-rate trap comes next. A service with 60 RPS during off-peak emits 1 error per minute. The rolling 30-second window error rate flips between 0% (no error in the window) and 1.6% (one error in the window) every 30 seconds. The controller flips between floor and 25%. The kept rate strobes — sometimes the service emits 60 traces in a window, sometimes 15 — and downstream aggregations see a sample-rate-modulated signal. The fix is smoothing on the error-rate signal: an EMA or a Bayesian estimator with a Beta prior. The Beta prior is the more theoretically clean answer — rate_estimate = (errors + α) / (total + α + β) where (α, β) are pseudo-counts representing prior belief. With α=0.5, β=99.5 (representing "I expect about 0.5% baseline"), a service emitting 1 error in 60 requests has estimated rate 1.5 / 160 = 0.94% not the raw 1/60 = 1.67%, and the controller's response is correspondingly damped. Cred shipped flapping behaviour in their first error-driven sampler in 2024 and switched to a Beta prior in the second iteration after their dashboards strobed every 30 seconds.

The missing-upstream-context trap is the deepest of the four. The error-rate controller correctly raises the keep-rate for the failing service. But the trace's parent span — the upstream service that called the failing one — was sampled by a separate controller (the upstream service's own keep-rate, which is at floor because the upstream service is not erroring). The kept trace has the error span at the leaf but the parent spans were probabilistically dropped by the upstream sampler. The operator opens TraceQL, finds the error span, and sees no parent context — the trace looks like an orphaned error with no upstream causality. The fix is trace-id-aware coordinated sampling: when the leaf service decides to keep an error trace, every upstream service in the same trace must keep its span too. The OTel SDK's ParentBased sampler is the standard mechanism — child services inherit the parent's sampling decision via the W3C traceparent header's sampled flag. But error-driven dynamic sampling reverses this: the leaf service decides, not the parent. The architectural pattern is head-coordinated keep + tail-error-promote: the head sampler at the entry point keeps trace_ids based on a consistent hash, and the tail sampler at the collector promotes error traces from the dropped pool when error rate climbs. The promotion rewrites the kept set after-the-fact, but the upstream spans are still missing — they were never sent. The deeper fix is always sample all spans, decide at the collector — which is what tail sampling already does — and the error-rate controller modulates only the probabilistic baseline policy inside the tail sampler, never head sampling. Razorpay learned this the hard way in 2024: their first error-driven sampler ran at the SDK and produced orphaned error traces during incidents; the post-incident review forced them to move the controller to the collector tier and accept the higher steady-state bandwidth cost.

The baseline-poisoning trap is the fourth and the one that bites teams over months, not minutes. The baseline-deviation shape (shape 3) computes "normal" error rate from the rolling 7-day median. After a major incident, the 7-day window includes the incident's elevated error rate. The baseline median climbs, the IQR widens, and the next incident at the same scale produces a smaller z-score and the controller fails to fire. Each successive incident raises the baseline further. After 3-4 incidents in a quarter, the controller has been "trained" to consider 5% error rate normal and only fires on 12% spikes. The fix is incident-window exclusion: when the controller fires (z > 2), exclude that window from the baseline calculation. Datadog's APM error-tracking implements this; in-house implementations often skip it and discover the regression months later when the controller stops firing on the incidents that originally inspired it. The discipline is to write the baseline math with an explicit "exclude incident windows" clause and a UI for the on-call to mark a window as "incident, do not include in baseline" so manual corrections survive in the persistent baseline.

A fifth pattern, less common but worth naming: clock-skew-induced error-rate inversion. The controller computes error rate as errors_30s / total_30s, but the errors are timestamped at the point of failure (the leaf) while the totals are timestamped at the entry point (the head). On a fleet with 100ms of clock skew across pods, the rolling window aligns errors and totals against different time bases; during a spike, errors-from-the-spike-window land in one window's bucket while totals-from-the-spike-window land in the next, and the computed error rate transiently exceeds 100% (more errors than totals in the same window). The controller's mapping function clamps to ceiling, but the diagnostic dashboard shows error_rate = 1.4 and on-call panic ensues. The fix is NTP discipline + window-edge buffering — pad the rolling window's edges by the maximum expected clock skew so spike events land in a single bucket regardless of skew. Hotstar's controller uses 35-second windows for a 5-second skew margin precisely because their edge fleet has measured 3-second 99th-percentile NTP drift on busy days.

Five lived patterns Indian teams ship in production

The OTel Collector ecosystem, Honeycomb Refinery, and Datadog APM all ship error-rate-driven samplers with overlapping feature sets. The patterns that show up across Indian production deployments — Razorpay, Hotstar, Zerodha, Flipkart, PhonePe — converge on five architectures that the documentation rarely names.

Pattern 1 — error-rate ramp under a hard rate-limit ceiling. The controller raises the keep-rate when error rate climbs, but the ceiling is enforced at a fixed traces-per-second rather than a percentage. The architectural pattern: keep_rate = min(error_rate_target_rate, max_traces_per_sec / current_volume). During the IPL final cascade described at the chapter open, the percentage-based target says "keep 50%" but the volume-based ceiling says "keep no more than 5,000 traces/sec total"; the effective rate is the smaller. Hotstar runs this exact architecture; the rate-limit is computed from Tempo's ingest budget rather than the collector's local queue, so the ceiling tracks the actual constraint.

Pattern 2 — error-rate-driven baseline plus always-keep error spans. The controller modulates only the keep-rate of OK traces. Error spans are kept by an unconditional tail-sampling status_code policy that the controller cannot lower. The combination: when error rate climbs, the OK-trace keep-rate climbs (capturing more upstream context), and the error-trace keep-rate stays at 100% (capturing the errors themselves, modulo the rate-limit ceiling from Pattern 1). The architectural reason: separating the two policies makes each policy's behaviour auditable. An on-call investigating "why are there fewer error traces during the incident than I expected" can check whether the always-keep policy fired (it did) and whether the rate-limit ceiling cut into errors (it might have); the answer is unambiguous. PhonePe runs this pattern; their debugging runbook for trace-coverage anomalies starts with "did the always-keep policy hit its rate limit during the window".

Pattern 3 — per-service baseline tracking with hour-of-day bucketing. The baseline-deviation shape (shape 3) is sensitive to the chosen baseline. Indian production traffic has known structure: the payments service at 02:00 IST has different error characteristics than the same service at 19:00 IST during peak. A single 7-day median baseline averages across the entire day and produces a baseline that is wrong for both. The fix is hour-of-day baseline bucketing — the baseline is computed per (service, hour-of-day) tuple, so the controller compares the current 19:00 IST error rate against the historical 19:00 IST error rate, not the 24-hour average. Zerodha implemented this in 2024 after their controller failed to fire during the second hour of market open (because the first hour's elevated baseline pulled the median up); the per-hour bucketing fixed it. The cost is 24x more baseline storage; the benefit is that the controller fires correctly for time-of-day-specific incidents.

Pattern 4 — controller telemetry as a first-class metric. The error-rate controller emits its own telemetry: the current per-service keep-rate, the smoothed error rate, the z-score (for shape 3), the time since the last update, and an explicit controller_state enum ({steady, ramping, ceiling, floor}). This telemetry is itself observed by a separate Prometheus + Grafana pipeline — one Grafana row per service-controller pair showing all five signals at once, with alerts on keep_rate < 0.005 for 10m AND service.error_rate > baseline + 2*iqr (controller is missing a real spike), keep_rate stddev_5m > 0.2 (oscillation), and controller_state = ceiling for 5m (controller is at hard rate-limit, on-call must decide whether to raise the ceiling or accept the loss). Razorpay calls this their "error-sampler cockpit"; the alerts have caught two production regressions in the last 12 months — once when a Beta prior was misconfigured and the controller stayed flat through a real spike, once when the ceiling was set too low and an entire 30-minute incident was bottlenecked at the rate-limit.

Pattern 5 — combined error-rate + latency-tail dynamic sampling. Some incidents are not error-rate spikes — they are p99 latency degradations with no error-rate change. A 2-second-to-12-second p99 climb on the checkout-api during BBD is invisible to an error-rate controller. The fix is to layer the error-rate controller with a parallel p99-deviation controller, both modulating the same keep-rate via a max(rate_from_errors, rate_from_p99) combiner. The p99-deviation controller fires on (p99_current - p99_baseline) / p99_baseline > 1.5 and ramps the keep-rate identically to the error-rate version. Flipkart runs this combined controller for BBD — the p99-deviation channel catches the slowdowns that don't generate timeouts, the error-rate channel catches the cascade after timeouts start firing. The combined kept stream covers both incident classes. The cost is a second controller to tune and monitor; the benefit is no incident class is invisible to the sampler.

A sixth pattern worth a paragraph: deploy-aware controller pause. After every deploy, the error rate transiently spikes — the new code's first 30 seconds always look anomalous, even when the deploy is healthy. The error-rate controller fires, ramps the keep-rate, and the trace store gets flooded with traces from the deploy's warm-up. The fix is deploy-window suppression: the controller subscribes to a deploy event stream (from Spinnaker, Argo CD, or the internal CD platform) and pauses for 60 seconds after every deploy event. The controller keeps the keep-rate at floor during the suppression window; if a real incident lands during the window, an out-of-band alert (the regular SLO burn-rate alert) catches it independently. Cleartrip ships this; their controller emits a controller_paused_for_deploy metric so the on-call can tell during an incident whether the controller is suppressed or just at floor. Without this pattern, every routine deploy looks like a partial outage to the trace store.

What error-rate-driven sampling cannot fix and where teams are surprised

Three classes of problem look like they should be solvable by error-rate-driven sampling but are not — they are properties of the design itself, not the controller.

The first is first-error-trace-of-incident lag. The controller fires on the rolling 30-second window's error rate, which means the first 30 seconds of an incident are sampled at the steady-state rate. The very first error trace of the cascade — the one that explains how the incident started — has only a floor chance of being kept. By the time the controller has ramped to 30%, the cascade is 30 seconds old and the root-cause trace is one of the 99% that were dropped at floor. The fix is pre-incident always-on sampling for "interesting" patterns: error traces with specific status codes (5xx but not 4xx), traces with specific error messages, traces from specific high-value endpoints (/checkout, /pay). These are kept at 100% regardless of controller state. The controller modulates the OK-trace volume; the always-on policy preserves the canonical incident-onset traces. The discipline is to maintain the always-on policy list as a living document; teams that ship error-rate samplers without an always-on backbone discover, on the first major incident, that the root-cause trace is missing.

The second is cross-region cascade visibility. A fleet running in ap-south-1 (Mumbai) and ap-south-2 (Hyderabad) has two collector tiers, each with its own error-rate controller. An incident that starts in Mumbai (where 70% of traffic is) cascades to Hyderabad through cross-region replication. Mumbai's controller fires; Hyderabad's controller has not yet seen its local error rate climb (because the cross-region traffic is small). The Hyderabad collector samples normal-rate during the cross-region cascade, and the post-incident analysis cannot reconstruct the Mumbai-to-Hyderabad propagation path because the kept traces were sampled by the wrong controller. The fix is global error-rate aggregation with regional rate application — the controllers exchange error-rate signals via a control-plane aggregator, and any region's controller can be triggered by another region's error rate. The cost is the control-plane dependency; the benefit is cross-region cascade visibility.

The third is chunked-payload-driven false error rates. Some services emit "chunked transfer encoding" responses where a partial-success response carries an error in a later chunk. The HTTP status code is 200 OK, but the response body's last chunk contains {"errors": [...]}. The standard tail sampler's status_code policy keeps these as OK traces, and the error-rate controller does not see them as errors at all — the spans show status 200, the controller's signal stays steady, and during a partial-cascade the controller does not fire even though 30% of responses contain in-body errors. The fix is semantic span attribute marking — every service that can emit in-body errors must mark the span with error.in_body=true and error.in_body.count=N, and the controller's signal computation must include these as errors. This is service-team-side work, not platform-team-side, and it is the single most commonly skipped piece of the error-driven sampling architecture. Razorpay's UPI service emits in-body errors for partial multi-leg-payment failures; the platform team's runbook for new services explicitly checks whether the service can emit in-body errors and adds the attribute marking before the controller is enabled.

Common confusions

Going deeper

EMA vs Beta prior smoothing for low-volume services

The flapping-error-rate trap (above) requires smoothing the error-rate signal before plugging it into the controller. Two methods compete: exponential moving average (EMA) and Bayesian Beta-prior smoothing. EMA: smoothed = α × current + (1-α) × smoothed, with α typically 0.2-0.3 for a 5-second-equivalent window. Beta prior: rate = (errors + α) / (total + α + β) with α/(α+β) set to the prior expected error rate and α + β set to the strength of the prior (typically 100, equivalent to "I have 100 prior observations of typical error rate"). EMA is simpler but does not account for sample size — a single observation with n=10000 is treated the same as a single observation with n=10. Beta prior naturally damps low-volume samples toward the prior, which is exactly what the flapping case needs. Honeycomb Refinery's EMADynamicSampler uses EMA for compatibility with their existing infrastructure; in-house implementations at Razorpay and Cred use Beta priors after empirically discovering that EMA does not damp the low-volume case enough. The Beta prior is mathematically the more correct answer and the more code to implement; the choice depends on team familiarity with Bayesian methods.

The combined error+latency controller and how to combine signals

The Pattern 5 combined controller takes keep_rate = max(rate_from_errors, rate_from_p99). The max combiner is the simplest correct choice — both signals can independently demand higher sampling; neither can demand lower. Using min would be wrong (an error-only spike would be sampled at the latency-driven floor). Using mean would be wrong (a clean error spike would be diluted by the latency channel's floor). Using a weighted sum requires choosing weights, which couples the two channels in a way that is hard to reason about during an incident. The max combiner has the property that each channel can be tuned independently — the error-rate mapping function and the p99-deviation mapping function are separate code paths with separate tunings, and the combiner just picks the larger output. Flipkart's BBD controller uses max; Cleartrip experimented with weighted sum and reverted to max after the weights drifted out of tune.

Why the controller must be on the collector tier, not the SDK

The missing-upstream-context trap (above) forces the controller onto the collector tier. An SDK-side error-rate controller cannot see the trace until the trace is complete; it has only the local span's error status. The leaf service decides to keep the trace, but the upstream services have already transmitted (or dropped) their spans based on the head-sampling decision made before any error happened. The kept trace is orphaned. The fix is to always emit all spans from all services, batch them at the collector, and let the collector's tail sampler — which sees the full trace tree — make the keep decision based on the trace's error status and the controller's current keep-rate. The architectural cost is the collector's bandwidth (every span on the wire, not just the sampled ones); the benefit is correct cascade visibility. Razorpay shipped SDK-side error-driven sampling first, hit the orphaned-trace problem during their first major incident, and migrated the controller to the collector tier. The migration took six weeks and required upgrading every service's OTel SDK to emit spans unconditionally; teams that build the architecture this way from day one save the migration.

Reservoir sampling for the error-rate ceiling

Pattern 1's hard rate-limit ceiling can be implemented as a reservoir sampler rather than a probabilistic drop. The reservoir holds the most recent N error traces, with new traces replacing old ones via Vitter's reservoir-sampling algorithm. The controller's effect is to set N — when error rate climbs, N grows from the steady-state 200 to a peak of 5,000. The reservoir always contains the most recent N errors, so the kept set is chronologically representative: every minute of the incident is equally represented in the kept stream. By contrast, a probabilistic ceiling drops 95% of errors uniformly, so the kept set has 5% of every minute, but the edges of the incident (the first error-onset, the last error-resolution) are sparse. AWS X-Ray's adaptive sampling uses reservoir-style sampling for the same reason; in-house controllers at Razorpay and Hotstar use reservoir sampling for the error-rate ceiling specifically.

The "controller state at incident start" problem and warm starts

When an incident starts, the controller is in steady state — keep-rate at floor, baseline at the historical median, integrator (if any) at zero. The controller takes 30 seconds to ramp to ceiling. During those 30 seconds, the kept stream is undersampled relative to the incident's information content. The fix is warm starts triggered by external events: when an SLO burn-rate alert fires, when a deploy happens, when an external monitoring system flags a regional issue, the controller pre-ramps to a medium keep-rate (say 10%) for 60 seconds, then resumes its normal control loop. The warm start does not commit to ceiling — the controller's normal loop will lower the rate if no real spike materialises — but it covers the 30-second ramp-up gap. Hotstar's controller is warm-started by the SLO burn-rate alert system; on the first 30 seconds of every burn-rate alert, the trace stream is sampled at 10% rather than 1%, giving the on-call usable evidence at t=0 of the alert.

Reproducibility footer

# Reproduce the error-rate-driven sampler measurement on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy
python3 error_rate_dynamic_sampler.py
# Expected: a nine-row table showing kept_traces, errors_emitted, errors_kept,
# err_kept_pct, avg_rate, peak_rate for each (shape, service) pair across the
# 300-second simulation. Shape 2 (tier-aggregated) should keep the highest err_kept_pct
# uniformly across services; shape 3 (baseline-deviation) should keep the fewest
# but with higher per-trace signal. Shape 1 (service-local) should track per-service
# error rate closely. To run a real OTel Collector with error-rate-driven sampling,
# see the Honeycomb Refinery EMADynamicSampler:
# https://docs.honeycomb.io/manage-data-volume/refinery/sampling-methods/#emadynamicsampler

Where this leads next

Error-rate-driven dynamic sampling solves the "errors get rate-limited at the moment they matter most" problem at the cost of two new failure modes: rate-limit-driven trace-store outage during high-error incidents, and missing upstream context when the controller runs at the SDK. Why the upstream-context failure is the deeper of the two: rate-limit-driven outages are an operational tuning problem — you raise the rate-limit, you size the trace store, you ride out the cost. Missing upstream context is an architectural problem — once the upstream span has been dropped by head sampling, no controller at the collector can recover it. The trace is permanently incomplete. The discipline is "always emit all spans from all services, decide at the collector"; teams that ship error-driven sampling without paying this architectural cost discover, the first time they need to debug a real cascade, that they have built a sampler that captures error spans cleanly and explains nothing about how the cascade started. The next chapter — cardinality-the-master-variable — pivots from the trace pillar to the metrics pillar, where the same "what do you keep" question is solved with a different toolkit: label-design discipline, cardinality budgets, and the recording-rule pattern. The chapters after this one move from sampling rate to sampling policy: which spans, which traces, which categories, and how to make those choices auditable.

The single most useful thing the senior reader should walk away with: error-rate-driven sampling is the opposite of adaptive sampling — when load spikes, adaptive sampling lowers the rate; when errors spike, error-driven sampling raises it. The two controllers can fight each other if both are deployed naively (a load spike is also typically an error-rate spike, so adaptive wants to drop and error-driven wants to raise). The production architecture composes them: adaptive sampling sets a floor on the keep-rate based on load, error-driven sampling sets a ceiling-when-errors-cluster above that floor, and the effective rate is the larger of the two. The reader who deploys only one will discover, the first time their service has a load spike that triggers a cascade, that they need both.

A team that has shipped error-rate-driven sampling and run it through one major P1 incident has earned the right to claim the controller is tuned. Before that, the controller is a configuration that has not yet seen what your real production incident looks like — the cascade pattern, the time-of-day effect, the cross-region propagation. Plan for at least one tuning iteration after each major incident. The gains that worked in the previous incident are evidence; they are not yet a tuning. The controller is a living object, and post-incident review's first question is "did the sampler give us the evidence we needed". If the answer is no, the next iteration of the controller is part of the post-incident action items.

The closing reframing: every sampler in this part has a bias profile, and error-rate-driven sampling's bias is incident-time signal preservation at the cost of steady-state cost. During steady state the sampler runs at floor and costs little; during incidents it ramps to 10x the steady-state cost for 60-300 seconds. The bias is acceptable because the cost is bounded (incidents are rare) and the benefit is concentrated (incidents are exactly when on-call needs evidence). Sampling buys cost; aggregation owes weighting; error-driven sampling buys signal at the moment the system most needs it. The three trades — cost, weighting, signal — are inseparable in the design of any production sampler.

References