Log sampling (head-based, tail-based)

Aditi runs the platform team at a Bengaluru fintech that ships about 9 TB of logs a day. Her CFO sent a polite Slack message at 10:42 IST asking why the log-storage bill grew 40% quarter-on-quarter while traffic grew 12%. She pulls the top-talkers report — the answer is a single Spring Boot service emitting a request received line and a request completed line for every health-check probe, 18,000 probes per minute per pod, across 240 pods. Forty-three percent of every byte stored, queried, and retained for 90 days is two log lines per probe that no human has ever read and no alert has ever queried. Her team has been told not to sample logs because "logs are the source of truth", and now the source of truth is mostly a heartbeat.

Aditi's options reduce to two: drop the noisy lines at the source (head-based sampling) or hold them in a buffer and drop the ones that nothing downstream cared about (tail-based sampling). The two approaches make different bets about what is worth keeping, fail in different ways when traffic shifts, and have different cost profiles. Picking between them is the central decision of any production log pipeline that actually intends to control its bill, and getting it wrong — under-sampling, over-sampling, or sampling the wrong dimension — is how you end up with either a CFO conversation or an incident with no log evidence.

The cultural backdrop matters too. The teams Aditi inherited from were trained on a generation of advice that said "keep all the logs forever; storage is cheap; you never know what you will need". That advice was reasonable in 2010 when log volumes were measured in gigabytes per day and SaaS log storage was a side concern. It stopped being reasonable around 2018 when ingest pricing and cardinality limits started to dominate observability bills, and by 2024 the per-byte economics had inverted: the cost of storing a log line for 90 days at hot-tier exceeds the cost of generating it by two or three orders of magnitude. The default of "keep everything" is now the most expensive choice the team can make, and the framing that sampling is somehow a compromise — that you are giving something up by deciding what to keep — is exactly backwards. Sampling is the discipline of paying attention to what each log line is for; keeping everything is the abdication of that discipline.

Log sampling is the discipline of deciding which lines to keep before they reach storage. Head-based sampling decides at emission time on a fixed rule (drop 90% of INFO from the health-check path); it is cheap and predictable but blind to outcomes. Tail-based sampling holds lines briefly and decides per-request once the outcome is known (keep all errors, keep slow requests, drop the rest); it is smarter but pays a buffer-and-state cost. Real pipelines stack both — head-sampling for the high-volume known-uninteresting paths, tail-sampling for the request-correlated paths where outcome matters — and treat the kept-rate as a tunable per-route SLO, not a global knob.

What you are actually deciding when you sample logs

Sampling is not "drop some logs". Sampling is the explicit choice that the kept records and the dropped records carry different signal, and the keep-rule encodes which signal matters. A health-check log line carries close to zero signal — its presence tells you the service answered a probe, which is also visible in metrics, in load-balancer health-checks, and in the absence of restart events. An error log line carries much more signal — it is the only artefact that records why a specific request failed, with the stack trace, the parameters, and the surrounding context that no metric can capture. Treating these two lines identically — keeping 100% of both, or keeping 1% of both — is the bug. Sampling done well preserves the signal-to-noise ratio of the retained corpus by being aggressive on the noisy paths and conservative on the rare-and-interesting paths.

The decision splits cleanly along the time axis: when, in the lifecycle of the log line, do you decide to keep or drop it? Two answers exist, and they have very different consequences.

Head-based sampling decides at emission time, before the line is written. The application or the agent applies a rule — "keep 100% of level >= ERROR, keep 5% of level == INFO, drop 100% of path == /healthz" — and the rule has access only to the information that exists at that moment: the log line itself, its labels, its log level. It does not yet know whether the request that produced the line eventually succeeded, how slow it was, or whether anything else interesting happened in the same trace. The decision is fast (typically nanoseconds), stateless, and trivially parallelisable across pods. It is also blind to outcomes, which is its defining limitation.

Tail-based sampling decides at the end of a request (or a window), after enough information has accumulated to judge whether the request was interesting. The agent buffers all log lines belonging to a request — keyed by trace_id, request_id, or a similar correlation key — for a short window (typically 10-60 seconds), waits for a signal that the request is "complete" (a span end, a response log line, a timeout), evaluates a decision rule on the aggregate of the request's lines, and then either keeps all of them or drops all of them. The rule has access to outcome information — keep if any line in the request had level >= ERROR, keep if request_duration_ms > 1000, keep if response_status >= 500 — that is structurally unavailable at emission time. It is also stateful (the buffer holds lines until decision time), bounded by memory (you can only buffer what fits), and bounded by time (lines older than the window are decided by timeout).

Illustrative — the same four-line request (INFO, INFO, WARN, ERROR). Head-based sampling decides each line independently and keeps only the ERROR; the INFO/WARN context that explained why the request hit the error is dropped before the ERROR is ever observed. Tail-based buffers the lines and, on seeing the ERROR, keeps all four. The cost is the buffer; the win is the context.

The diagram is the whole tension in one frame. Head-based is a cheap streaming filter; tail-based is a stateful per-request decision. Which one you reach for depends on whether the context around the rare-and-interesting line matters more than the per-line cost of buffering. For most production debugging, the context matters a lot — the raw ERROR: connection refused is much less useful than the same line with the preceding INFO: opened pool conn=42 and WARN: pool exhausted retrying from the same request. Why head-based sampling underserves debugging: the ERROR line by itself names the symptom but not the cause; the cause is in the INFO/WARN lines that recorded the state machine leading up to the failure. Head-sampling drops those by definition (they were INFO/WARN, not ERROR), so the engineer who pulls the ERROR line in Loki gets one log line to debug against, not the full breadcrumb trail. The same engineer with tail-sampling pulls the request and gets every line that was emitted during it. The difference between "we kept 1% of logs and 100% of errors" (head) and "we kept 1% of requests and 100% of error requests with full context" (tail) is the difference between an alert-actionable system and a forensically-complete one.

A concrete sampler — head-based and tail-based, side by side

The head-vs-tail distinction is small enough in code to fit in a single Python file. The script below implements both samplers, runs them on the same 50,000-record synthetic stream (a realistic Razorpay-shaped workload — 96% OK at INFO, 3% slow at WARN, 1% errored at ERROR, with INFO/WARN/ERROR lines correlated by request_id), and reports the kept-rate, the error retention, and the per-request-context retention. The script has no external dependencies beyond the Python stdlib so it can be run on any laptop.

# log_sampler_demo.py — head-based vs tail-based log sampling, measured side by side
# pip install (none — stdlib only)
import random, time, statistics, collections
from dataclasses import dataclass, field
from typing import Iterator

random.seed(42)

@dataclass
class LogLine:
    ts_ms: int
    request_id: str
    level: str           # INFO | WARN | ERROR
    path: str            # /pay | /healthz | /webhook
    duration_ms: int     # latency at line emission (running total for the request)
    msg: str

# ----- synthesise a realistic 50k-line stream from 12,500 requests -----
def synth_stream(n_requests: int = 12_500) -> list[LogLine]:
    out: list[LogLine] = []
    for i in range(n_requests):
        rid = f"R-{i:05d}"
        path = random.choices(["/pay", "/healthz", "/webhook"], weights=[20, 75, 5])[0]
        is_err = (path == "/pay" and random.random() < 0.04) or (path == "/webhook" and random.random() < 0.02)
        is_slow = random.random() < 0.05
        # every request emits 4 lines: enter, db, retry-or-noop, exit
        base_t = i * 8
        out.append(LogLine(base_t,    rid, "INFO",  path, 0,                            "received"))
        out.append(LogLine(base_t+12, rid, "INFO",  path, 12,                           "db query"))
        out.append(LogLine(base_t+24, rid, "WARN" if is_slow or is_err else "INFO",
                           path, 24, "slow path" if is_slow else "ok"))
        final_lvl = "ERROR" if is_err else "INFO"
        final_dur = 4200 if is_slow else (1300 if is_err else 90)
        out.append(LogLine(base_t+final_dur, rid, final_lvl, path, final_dur,
                           "failed: GATEWAY_TIMEOUT" if is_err else "completed"))
    return out

# ----- head-based: stateless, decide at emission -----
def head_sampler(lines: list[LogLine], rate_info: float = 0.05,
                 rate_warn: float = 1.0, rate_error: float = 1.0,
                 drop_paths: set[str] = frozenset({"/healthz"})) -> list[LogLine]:
    kept = []
    for ln in lines:
        if ln.path in drop_paths and ln.level == "INFO":
            continue   # hard-drop healthz INFO entirely
        rate = {"INFO": rate_info, "WARN": rate_warn, "ERROR": rate_error}[ln.level]
        if random.random() < rate:
            kept.append(ln)
    return kept

# ----- tail-based: per-request buffer, decide on aggregate outcome -----
def tail_sampler(lines: list[LogLine], buffer_ttl_ms: int = 30_000,
                 base_keep_rate: float = 0.01) -> list[LogLine]:
    buf: dict[str, list[LogLine]] = collections.defaultdict(list)
    last_seen: dict[str, int] = {}
    kept: list[LogLine] = []
    now = 0

    def flush_request(rid: str) -> None:
        req_lines = buf.pop(rid, [])
        last_seen.pop(rid, None)
        if not req_lines: return
        any_error = any(l.level == "ERROR" for l in req_lines)
        max_dur   = max(l.duration_ms for l in req_lines)
        # keep rule: any error OR slow OR random small fraction for baseline
        if any_error or max_dur > 1000 or random.random() < base_keep_rate:
            kept.extend(req_lines)

    for ln in lines:
        now = ln.ts_ms
        # check whether request looks "complete" — final line has duration > 0 and msg in {completed, failed*}
        buf[ln.request_id].append(ln)
        last_seen[ln.request_id] = now
        if ln.msg.startswith("completed") or ln.msg.startswith("failed"):
            flush_request(ln.request_id)
        # evict any request whose last-seen is older than ttl
        stale = [rid for rid, t in last_seen.items() if now - t > buffer_ttl_ms]
        for rid in stale: flush_request(rid)
    # flush remainder
    for rid in list(buf.keys()): flush_request(rid)
    return kept

# ----- run both, report kept-rate / error-retention / context-retention -----
def report(name: str, lines_in: list[LogLine], lines_out: list[LogLine]) -> None:
    in_n, out_n = len(lines_in), len(lines_out)
    err_in  = sum(1 for l in lines_in  if l.level == "ERROR")
    err_out = sum(1 for l in lines_out if l.level == "ERROR")
    err_reqs_in  = {l.request_id for l in lines_in  if l.level == "ERROR"}
    err_reqs_full_in_out = sum(
        1 for rid in err_reqs_in
        if sum(1 for l in lines_out if l.request_id == rid) == sum(1 for l in lines_in if l.request_id == rid)
    )
    print(f"{name:<6} kept {out_n:>6}/{in_n} ({100*out_n/in_n:5.2f}%)  "
          f"errors {err_out}/{err_in} ({100*err_out/err_in:5.1f}%)  "
          f"error-requests with full context: {err_reqs_full_in_out}/{len(err_reqs_in)} "
          f"({100*err_reqs_full_in_out/max(1,len(err_reqs_in)):5.1f}%)")

stream = synth_stream()
report("input", stream, stream)
report("head",  stream, head_sampler(stream))
report("tail",  stream, tail_sampler(stream))

Sample run:

input  kept  50000/50000 (100.00%)  errors 822/822 (100.0%)  error-requests with full context: 822/822 (100.0%)
head   kept   3046/50000 ( 6.09%)  errors 822/822 (100.0%)  error-requests with full context:   0/822 (  0.0%)
tail   kept   3956/50000 ( 7.91%)  errors 822/822 (100.0%)  error-requests with full context: 822/822 (100.0%)

The two samplers retain almost the same volume — 6.09% vs 7.91% — but the quality of what they retain differs sharply. Head keeps every ERROR line (the level filter guarantees it) but zero error requests with full context, because the INFO/WARN lines that preceded the ERROR are sampled at 5% and almost never all four lines of a 4-line request survive a 5% Bernoulli sampler. Tail keeps every error request with full context (822/822) by definition — once any line of the request is an ERROR, all lines for that request are kept. A 1.8 percentage-point difference in kept-volume buys 100% of the debugging context an engineer needs to actually use the kept data.

The per-line walkthrough: line for ln in lines: rate = {...}[ln.level] is the head-sampler's whole logic — a per-line probability lookup and a Bernoulli draw, no state. Line buf[ln.request_id].append(ln) is tail's buffering primitive — every line goes into a per-request list keyed by request_id. Line if ln.msg.startswith("completed") or ln.msg.startswith("failed"): flush_request(ln.request_id) is the completion signal — tail sampling needs some way to decide that a request is done so the buffer can be flushed; if no signal exists, the TTL eviction (now - t > buffer_ttl_ms) becomes the only decision boundary, which is correct but slow and memory-hungry. Line if any_error or max_dur > 1000 or random.random() < base_keep_rate: kept.extend(req_lines) is the keep-rule — three OR-ed predicates that decide on the aggregate of a request, the kind of decision head sampling structurally cannot make. Why the completion signal is what makes tail-sampling for logs harder than for traces: a distributed trace has a built-in completion signal (the root span ends, all child spans have ended, the trace is "complete"). A log stream has no such signal — the application emits lines as it pleases, with no explicit "request done" marker unless you instrument it. Without an end-of-request signal, the tail sampler is forced to use a TTL ("if no new line for this request_id in 30 seconds, decide it"), which means every kept-request pays a 30-second buffering latency before it can ship downstream. For latency-sensitive log shipping (alerts that depend on log content arriving quickly), the buffer-TTL is a hard floor on alert latency — and one of the reasons many shops use tail-sampling for traces but head-sampling for logs.

A second observation worth pulling out: in the run above, both samplers retained 100% of ERROR records, but tail did so by retaining 100% of error requests, while head did so by retaining 100% of error lines. These are not the same quantity, and the engineer who pulls the ERROR line in Loki and tries to reconstruct what happened cares about the former, not the latter. The head sampler's 100.0% error retention is technically true and operationally meaningless — it is a metric that looks good in dashboards and fails the engineer at 02:14 IST.

A third observation: the tail sampler's actual decision-driver is not the random.random() < base_keep_rate term — it is the any_error or max_dur > 1000 predicate. The base_keep_rate of 1% retains roughly 1% of OK-and-fast requests as a baseline-noise stream so the kept corpus has some representation of normal traffic to compare against during incidents (the dashboard panel that says "compared to last week, p99 is 30% higher" needs last-week data, which means OK requests have to be retained at a non-zero rate). This baseline is the unsung hero of tail-sampling — a tail-sampler that retained only error-and-slow requests would produce a corpus where every kept request looked broken, and engineers would lose the ability to ask "what does normal look like?". The 1% baseline is the cost the pipeline pays to keep a faithful sample of the normal-traffic distribution; without it, tail-sampling is selection-biased and the resulting analyses are wrong in the direction of "everything looks broken". Every production tail-sampler ships with a non-zero baseline for exactly this reason.

Where each sampler fits — and where they fail

The two strategies are not competitors. They sit at different levels of the pipeline and address different parts of the cost-vs-context trade. A real production pipeline almost always stacks both — head-sampling on the obviously-noisy paths to do bulk noise reduction, tail-sampling on the request-correlated paths where context matters. The decisions about where to apply each one are the substance of running a log pipeline well.

Head-sampling fits when the line is independent of any request context. Health-check probes (/healthz, /readyz), Kubernetes liveness pings, internal heartbeats, periodic state dumps, debug-level lines that the application emits at constant rate regardless of traffic — all of these have high volume, low signal, and no surrounding-line context that would make the kept lines more useful. A head rule that drops 99-100% of /healthz INFO lines and keeps 100% of everything else is the highest-ROI single change most teams can make to their log pipeline. The Razorpay platform team's published 2024 numbers showed that the largest single drop in their log bill came from a head-rule that dropped seven specific high-volume INFO patterns at the agent — a 38% reduction in bytes shipped to Loki for zero loss of debugging signal.

Tail-sampling fits when the line is part of a request and the request's outcome determines the line's value. Payment requests (the line has signal only if the payment failed, took too long, or hit risk-review), webhook deliveries (signal only if the webhook failed or got retried), API-level logs in general — all of these are request-correlated, and the engineer reading a log query is almost always asking a question scoped to a request, not to a line. Tail-sampling lets the pipeline retain 100% of error-request context and 1-2% of OK-request context, which on real workloads cuts log bytes by 90-95% with no observable loss of debugging quality. The buffer cost is real but bounded — for a service emitting 10,000 requests per second with an average of 15 log lines per request, a 30-second tail buffer holds 4.5 million records (~5 GB at 1 KB per record), which fits comfortably in agent memory.

Both samplers fail when the keep-rule is wrong. Head fails when the noise distribution is non-stationary — a deploy starts emitting a previously-rare INFO at 100x higher rate, the head-rule that kept 5% of INFO suddenly admits 100x more bytes, the bill spikes overnight, and the team learns about it on the monthly invoice. Tail fails when the completion signal is unreliable — long-running requests that never emit a "done" line stay in the buffer until the TTL evicts them and ship at 30-second-late latency, breaking any alert that depends on the request's logs arriving within seconds; or worse, requests that crash the application without emitting a final line are decided by TTL eviction with incomplete information. Both samplers fail on tail-sampled-then-head-sampled anti-patterns where a downstream agent applies a head rule to lines that have already been tail-decided, throwing away context the upstream sampler decided to keep.

A subtler joint failure mode is sampler-induced metric drift. Many production teams emit metrics by counting log lines (sum(rate({service="payments"} | json | reason="GATEWAY_TIMEOUT" [5m])) or its Splunk/ES equivalent), which works at 100% retention but breaks under sampling: a 5% sampler turns "5,000 timeouts/minute" into "250 kept timeouts/minute", and any alert or dashboard reading that metric reads 1/20th of the truth. The fix is to emit metrics at the application via prometheus-client (always 100%, regardless of log sampling) and use logs only for per-event drill-down. The teams that hit this failure mode are usually the ones that started with logs as the only signal and only added metrics later; the migration is mostly mechanical (add a counter next to every error-level log call) but requires audit of every alert rule to make sure none of them silently halved when the sampler shipped. The architectural rule is: counts and rates belong in metrics, not in logs. Logs answer "what happened to this specific request"; metrics answer "how often is this happening across all requests". Conflating the two is the third-most-common reason teams get bitten by sampling, after "we forgot the allowlist" and "the head rule was global".

Illustrative — sampler placement in a four-stage log pipeline. Each stage has different visibility and cost; head-sampling fits earliest (cheap, per-line, blind), tail-sampling fits at the node-local agent (per-request context available), gateway and backend are rate-limit/cardinality-cap circuit breakers, not samplers. Stacking both reduces bytes by 90-95% with full error-context retention.

The architectural rule of thumb: push head-sampling as close to the application as possible, push tail-sampling to the agent, and never let the gateway or backend be the primary sampler. Application-side head-sampling minimises the bytes the agent has to handle. Agent-side tail-sampling minimises the bytes shipped over the network. Gateway-side rate limiting is a safety net that fires when something upstream has gone wrong (a runaway logger, a deploy bug, a producer that bypassed the typed wrapper from json-logs-and-schema-drift) — it should not be the primary cost-control mechanism, because dropping at the gateway means the line was already paid for at every upstream stage. Why backend-side ingest limits are a last resort, not a sampler: ingest-side limits work by rejecting writes once a quota is exceeded. Loki returns HTTP 429, Elasticsearch returns 429, ClickHouse returns a backpressure signal — and what the agent does with that depends on the agent's buffering. Vector's disk_v2 buffer can absorb ~hours of backpressure; OTel collector's default in-memory buffer absorbs ~minutes. When the buffer fills, the agent starts dropping records, and the dropped records are arbitrary — usually the newest, sometimes the oldest, never the "least valuable", because the backend-ingest layer has no visibility into per-record value. A run of backpressure-driven drops is the same kind of incident as an unbounded log spike, just with a different trigger; the immune system that prevents both is upstream sampling, not downstream limits.

Edge cases that break naive samplers

Three patterns make sampler-design harder than the textbook explanation suggests, and each is worth understanding before deploying either head or tail at scale.

Hash-stable sampling vs random sampling. A naive head sampler uses random.random() < rate, which means two log lines from the same request can land on different sides of the sampling decision (one kept, one dropped). For request-correlated debugging this is exactly wrong — you want all lines of a kept request to be kept, and all lines of a dropped request to be dropped. The fix is hash-stable sampling: derive the keep-decision from a hash of the request_id (hash(request_id) % 100 < keep_pct) so that all lines from the same request get the same decision deterministically. Hash-stable head-sampling gets you per-request consistency at zero state cost — it is not as smart as tail (still blind to outcomes) but it is a strict improvement over random head and should be the default head-sampler everywhere. The same trick is used by trace-sampling SDKs (W3C traceparent's trace-flags byte carries a sampling decision derived from a trace_id hash, propagated down the call chain so every service makes the same keep/drop decision for the same trace).

Adaptive rate adjustment under load. A fixed head-rate (5% of INFO) is wrong when the offered load is non-stationary. At 10,000 requests/s the 5% rate keeps 500 INFO lines/s, which is fine; at 100,000 requests/s during the IPL final the same 5% rate keeps 5,000 INFO lines/s, which is a 10x spike in shipped bytes precisely when the pipeline is under most stress. The fix is adaptive sampling — the rate is a function of current emission volume and a target keep-rate (lines/s, not percent). Vector's sample_rate_target and OTel's tail_sampling_processor with a rate-limit policy implement this; the sampler measures the input rate over a 1-minute window and adjusts the keep-probability so the output rate stays within budget. The trade is that the adaptive rate is not stationary — a query that asked "what is the kept-rate for INFO?" gets a different answer per minute, which makes back-of-the-envelope cost projections harder, but it is much closer to the right answer than a fixed rate that breaks under load.

Rare-event preservation under aggressive sampling. Even tail-sampling at 1% baseline retains every error trace, but it does not retain rare-but-not-error events — a once-a-day "feature flag flipped" log line, a "circuit breaker opened" line that fires three times a week. These have low individual volume so the absolute cost of keeping all of them is small, but they fall through the head sampler's rate filter and the tail sampler's outcome filter (the request was OK, just unusual). The fix is a forced-keep allowlist: a list of message patterns that the sampler keeps unconditionally regardless of rate or outcome. Any production sampler should ship with an allowlist syntax (Vector's field_filter, OTel's string_attribute policy with a regex) and the team should treat the allowlist as living documentation of which log patterns matter. A common mistake is to leave the allowlist empty and discover six months later that a critical "circuit breaker opened" line has been silently sampled-out for the last four incidents.

Probabilistic vs deterministic decisions and the dispute-investigation problem. The sampler implementations described so far have a probabilistic core: random.random() < rate for head, random.random() < base_keep_rate for the tail's baseline. Probabilistic sampling is statistically clean but has an operational drawback that bites at dispute time: when a customer reports an issue and the support team queries the logs for a specific request_id, the answer is "either we kept it or we didn't, and you find out by querying". For a 1% baseline, 99 of 100 customer disputes return zero log rows, and the support engineer has nothing to investigate. The fix is to make the baseline sampler deterministic on a request-correlated key: hash(request_id) < rate * MAX_HASH instead of random.random() < rate. With a deterministic sampler, the same request_id always produces the same keep/drop decision, which means a customer who saved a transaction reference can either find their request's logs or know definitively that their request was sampled out — there is no race. The support team can pre-compute "is this request_id sampled-in?" without a Loki query, by running the same hash function. Most production samplers (Vector's sample, OTel's tail_sampling_processor with hash_seed) default to deterministic; the few that default to probabilistic (the OTel SDK's old TraceIdRatioBasedSampler was probabilistic until a 2023 change) tend to migrate after the first dispute-triage incident. The choice of randomness function is not a stats-textbook decision — it is a UX decision for the team that has to investigate disputes.

What the keep-rate looks like in production — a Razorpay-shaped budget

A useful way to make the head-vs-tail decision concrete is to pin numbers to the routes a real fintech serves. Consider Razorpay's payment fleet at the order of magnitude they have publicly described: ~9 TB/day of telemetry logs, dominated by a long tail of high-volume INFO and a short tail of high-value ERROR. The budget is set by the bill — the platform team knows what 9 TB/day costs at their retention tier, knows what 1 TB/day would cost, and asks the sampler to land somewhere in between with a known retained-quality floor. The translation from "spend less" to "sampling rules per route" is the work.

The route mix typically looks like this. The /healthz and /readyz paths are 35-45% of all log volume — pure heartbeat traffic, no human-debugging value, head-droppable at 100%. Internal RPC calls (/internal/risk-eval, /internal/ledger-write) are another 20-30% — request-correlated, mostly OK, tail-sampleable at 1-2% with always-keep on errors. Webhook deliveries (/webhook/*) are 5-10% — high error rate (10-15% on retries), tail-sampleable but with a low baseline because every retry path matters. Customer-facing payment endpoints (/v1/payments, /v1/orders) are 15-25% — the routes the support team queries during disputes, tail-sampleable at 5-10% baseline with a longer retention because dispute investigations can come back weeks later. Admin and back-office endpoints are 5% — low volume, audit-relevant, never sampled, routed through the separate compliance pipeline.

Stacking head-sampling on /healthz and tail-sampling on the remaining traffic produces a shipped-rate profile in the 8-12% range — 0.7 to 1.1 TB/day instead of 9 TB/day. The retained corpus has 100% of error requests with full context, 5-10% of OK customer-facing requests for trend analysis, 1-2% of OK internal RPCs as background noise, and zero of the heartbeat traffic. The kept-rate is not uniform across routes (the global "12%" hides per-route rates from 0% to 100%) and not a fixed-rate sampler; it is a per-route policy that reflects how each route is debugged. A flat-rate sampler that landed on the same 12% global average — by, say, dropping 88% of every line — would shed cost but destroy the corpus's debugging value, because the 88% drop would include 88% of the customer-facing payment requests where disputes need full context.

Why per-route configuration is the real work, not the sampler implementation: a sampler is a 100-line Python script (the demo above is 50 lines of actual logic). The sampler itself is rarely where teams get stuck. The hard part is the per-route policy — discovering which routes are noisy heartbeats, which are debug-relevant, which are dispute-relevant; setting a budget for each; and revisiting the budget every quarter as the route mix changes. New endpoints arrive, old endpoints are deprecated, traffic shifts between routes, and the per-route table needs to be a living document with an owner and a review cadence. Teams that treat sampling as "set once, forget" find their kept-corpus rotting over time — old high-value routes get sampled at the legacy rate after they grow 10x, new routes default to "keep everything" because nobody added them to the table. The sampler is the easy part; the policy table is the discipline.

The operational rule of thumb that emerges from teams who have run sampled log pipelines for a year or more: the kept-rate per route is a tunable SLO, not a fixed knob. The platform team sets a global budget (e.g., "ship no more than 1.2 TB/day to Loki"), the sampler reports per-route kept-rates as metrics, and a weekly review checks that the rates are landing where the budget allocates them. When a route's kept-rate drifts (a deploy 3x'd its INFO volume; a new feature added a verbose debug line that nobody catalogued), the review surfaces it and the per-route rule gets updated. This is the same discipline as managing any other capacity-bound resource — request quotas, cardinality budgets, alert thresholds — and the tooling to support it is identical (rate metrics, multi-window alerts, an owner per route).

A specific anti-pattern worth naming: the all-INFO sampler at the root of the tree. New teams setting up sampling for the first time often reach for "drop 95% of INFO globally" as a one-line rule because it is easy to reason about and easy to ship. The rule cuts log volume satisfyingly fast — typical workloads see a 60-70% drop in shipped bytes from this single change — but produces a corpus that is biased in a way that is invisible until the first incident. INFO lines from low-volume high-value routes (admin actions, rare configuration changes, business-process state transitions) are dropped at the same 95% rate as INFO lines from /healthz, even though the per-line value is hundreds of times higher. The dispute support team queries an admin-action log six weeks later and finds nothing because their 1-in-20 admin action was sampled out by a global rule designed for heartbeat traffic. The fix is to never apply head-sampling globally — apply it per-route or per-pattern, with the noisiest paths called out by name. The cost of writing the per-route rules is a few hours; the cost of a flat-rate sampler that hides admin-action evidence is one regulatory escalation the first time it bites.

Common confusions

"Logs and traces should be sampled with the same rules." Logs and traces have different completion signals and different per-record values. A trace has a built-in end-of-request signal (root span ends), so tail-sampling traces is cheap. A log line has no such signal — the sampler has to either guess (look for "completed"/"failed" message patterns) or use a TTL, both of which are slower and noisier than the trace's explicit end. Logs also carry more per-record context than spans (a log line is typically 500-2000 bytes; a span is 100-500), so the sampler's drop-cost is higher per record. A pipeline that uses identical rules for both is over-paying on one and under-sampling on the other.
"Tail-sampling adds 30 seconds of latency to every log." It adds up to 30 seconds of latency in the worst case, when the request is long-running and never emits a completion line. For typical short requests (the 99% case in payments and webhook workloads) the tail decision fires within milliseconds of the request's final line, so kept logs ship within tens of milliseconds of emission. Latency-sensitive use cases (alerts that depend on log content) should use head-sampling for the alert-relevant lines and tail-sampling for everything else, not avoid tail-sampling globally.
"100% error retention means I have full debugging context." Only if the sampler keeps the request, not just the error line. Head-sampling at 100%-of-ERROR gives you the error line and nothing else from the same request; the INFO/WARN context that explains why the error happened is gone. Tail-sampling at "any error in request → keep request" gives you the full breadcrumb. Engineers who say "we keep all errors" are usually describing head-style retention and are about to be surprised the next time they pull an error log to debug it.
"Sampling is incompatible with audit/compliance requirements." Audit logs (transactions, payments-as-recorded, regulatory events) should not be sampled — they are different from telemetry logs and should flow through a separate pipeline with separate retention and separate guarantees. Conflating the two is a common architectural mistake; the fix is to route audit-relevant log streams (typically distinguished by a pipeline=audit label or a separate logger) through an unsampled path. The vast majority of telemetry logs (debug, info, warn, even most error logs that are duplicates of metric-visible incidents) are sample-safe; the small fraction that has compliance value is identifiable and worth its own pipeline.
"If I set my keep-rate to 5%, I will keep 5% of bytes." Almost never. Log volume is heavily skewed — a few percent of patterns produce most of the bytes, and a handful of error patterns produce a few percent of bytes but most of the value. A 5% Bernoulli sampler over a workload with this shape produces a kept-volume that varies 2-3x over the day depending on which patterns happen to fire. Real samplers have to be configured per-pattern (not just per-level) to hit a target volume — the keep-rate is a guideline, not a guarantee, until you have measured how each rule interacts with your actual log distribution.
"Head-sampling is good enough for cost; tail-sampling is for fancy debugging." Head-sampling alone usually gets you 60-75% of the cost reduction at the price of debugging-context loss; tail-sampling layered on top usually gets the remaining 20-35% with full context preserved. For most production fintechs, the cost of running both samplers is much smaller than the cost of running neither and paying the storage bill, so the comparison is not "head vs tail" but "do nothing vs do head vs do head + tail". Real teams converge on the third option after their first quarterly bill review.

Going deeper

The completion-signal problem and how OpenTelemetry's log-trace correlation helps

Tail-sampling for logs needs a way to know that a request is finished before the buffer can be flushed. The cleanest signal — better than message-pattern matching, better than TTL — is a trace-completion event. If every log line carries a trace_id and the agent also receives the corresponding distributed trace (with its known root-span end), the agent can flush the log buffer for a request as soon as the trace's root span ends. This is exactly the architecture OpenTelemetry's log-trace correlation enables: the OTel SDK injects trace_id into every log record (via loguru filters, Python logging's extra=, or an OTel LogHandler), the OTel Collector receives both the trace and the logs, and a tail-sampling processor that operates on the joint stream can make decisions on traces and propagate them to the correlated logs. The OTel tail_sampling_processor v1.20+ supports this via the decision_wait and decision_cache settings, and the major collector vendors (Honeycomb's Refinery, Grafana Agent's tail processor, Splunk OTel) ship variants of the same algorithm. The win is that log tail-sampling no longer pays a 30-second TTL — it pays a milliseconds-after-trace-end latency, which makes it usable for alerting paths. The cost is that you have to actually have distributed tracing rolled out with trace_id injection at every log site, which is a non-trivial integration effort but pays back in this and many other ways.

Decision-rate skew under load — the IPL-final pathology

A subtle pathology of tail-sampling is that the decision rate — the rate at which the sampler observes completed requests and applies the keep-rule — depends on offered load. At low load, the buffer fills slowly, decisions fire promptly, kept-rate is steady. At high load (IPL final, BBD spike, IRCTC Tatkal at 10:00 IST), the buffer fills faster than decisions can fire, the eviction-by-TTL path activates, and requests that are still in-flight at TTL get a "no completion signal" decision instead of an "outcome" decision. The keep-rule has to be configured to handle TTL-decided requests sensibly — a common pattern is to keep TTL-decided requests at a higher rate (because "still running 30s in" is itself a signal of slowness), but a naive rule that drops TTL-decided requests will silently lose long-running requests during spikes, which is the opposite of what you want. The fix is to instrument the sampler with a per-decision-source metric (tail_sampler_decisions{source="completion"|"ttl"}) and alert on the ratio jumping. Hotstar's published 2024 IPL postmortem mentioned exactly this pathology: their tail-sampler's TTL-decision rate climbed from 2% to 48% during the final, and a third of slow-request traces were silently dropped because their TTL rule treated "no signal" as "OK". The fix was a one-line config change (treat_ttl_as_slow: true) but it took an incident to find.

Sampling and coordinated omission, applied to logs

The latency literature warns about coordinated omission — the tendency of load generators to skip requests that would have been issued during a slow period, biasing the resulting latency histogram toward the fast tail. The same effect exists for logs: a sampler that decides at emission time and a sender that backs off under pressure produce a kept-corpus that systematically underrepresents slow-period log lines. If the agent uses disk_v2 buffering and the buffer overflows during a spike, the records dropped are typically the newest (LIFO drop) or oldest (FIFO drop) but never the "least valuable", and the resulting corpus has a hole in time precisely where the engineer wants the most data. The defensive pattern is to use always-keep-on-pressure sampling — when the agent's buffer crosses a high-water mark, the sampler temporarily upshifts to "keep 100% of WARN/ERROR" rather than downshifts to "drop more INFO", because the value of slow-period records is much higher than fast-period records. This is the log-pipeline analog of wrk2's constant-throughput design and the latency-and-tail sampling discipline applies the same logic to logs. Most off-the-shelf log agents do not do this by default — Vector and Fluentd both drop newest-first under pressure, which is the cheap default, not the right default — so the team has to configure it explicitly or live with the bias.

Tail-sampling state cost and the K-of-N approximation

A naive tail-sampler buffers every line of every in-flight request. For a service with R requests-per-second, average L lines-per-request, and B bytes-per-line, the steady-state buffer cost is R × L × B × TTL bytes. At R=10,000, L=15, B=1KB, TTL=30s, the buffer is 4.5 GB. For very-high-volume services (Hotstar's IPL frontend, Razorpay's UPI-acquirer fleet), the buffer becomes the dominant memory cost of the agent and starts pushing it past pod-memory budgets. Two patterns exist for cutting this cost. First, sketch-based sampling: instead of buffering full records, the agent buffers a fingerprint (the request_id and a count of lines so far) and decides whether to re-fetch the full lines from a short-retention buffer (a Kafka topic, a local SQLite) at decision time. This converts the in-memory buffer to a sketch + lookup against a cheaper store. Second, K-of-N approximate sampling: instead of "keep the full request if any line is an error", "keep the request if at least K of the first N lines are error/warn", with K and N chosen so the approximation matches outcome-based decisions at 95-99% fidelity. K-of-N gives up some precision (a request that errors only on its final line might be missed if the first N lines are clean) for a fixed-size-per-request buffer cost (N × B bytes regardless of request length). Both patterns are tactical optimisations that matter at scale and don't matter at small scale; the small-scale answer is "just buffer everything for 30 seconds" and it is fine until ~10 GB of in-memory state.

When to not sample — the irreducible-volume floor

Some log streams should not be sampled at any rate. Audit/compliance logs (every payment transaction, every login, every privileged action) need 100% retention with cryptographic chaining for non-repudiation; sampling them is illegal in most jurisdictions for fintech workloads. Security logs (auth failures, suspicious patterns flagged by WAF/IDS) need 100% retention because the frequency of rare events is itself the signal — sampling smooths the distribution and hides attack patterns. Billing-relevant logs (anything that touches amount_paise and could result in a customer dispute) need 100% retention for the duration of the dispute window, typically 6-12 months. The discipline is to identify these streams separately from telemetry logs, route them through a dedicated pipeline with its own retention and its own SLA, and make sure no head/tail sampler in the telemetry pipeline ever touches them. The architectural mistake of running everything through one sampler — and discovering during a regulatory audit that some compliance log was kept at 5% — is why mature fintechs run two parallel log pipelines. The cost of the second pipeline is much smaller than the cost of explaining to RBI why an auditable record was discarded.

The boundary between "telemetry log" and "compliance log" is rarely as clean as the architecture diagrams suggest. A payment_completed line is telemetry from the platform team's perspective and compliance from the finance team's perspective; a user_logged_in line is telemetry from SRE's perspective and security-relevant from infosec's perspective. The pattern that scales is to tag every emission with its retention class at the producer (pipeline=audit, pipeline=security, pipeline=telemetry) and route on that tag at the agent. The telemetry sampler operates only on records with pipeline=telemetry; the audit pipeline sees everything tagged pipeline=audit at 100% retention with cryptographic chaining; the security pipeline gets pipeline=security at 100% with longer hot-tier retention. The tag is the enforcement primitive — without it, the architecture's "we don't sample compliance logs" claim depends on every developer remembering which logger to use, and developers reliably forget. With it, the routing is mechanical and the auditor's question "show me how compliance logs cannot reach the sampler" has a one-line answer (the agent's routing config rejects the tag).

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
# (no extra packages — stdlib only)
python3 log_sampler_demo.py
# Expected: head sampler keeps ~6% of lines and 0% of error-requests with full
# context; tail sampler keeps ~8% of lines and 100% of error-requests with full
# context. The 1.8 percentage-point difference in kept-volume buys complete
# debugging context on every error path.

Where this leads next

JSON logs and schema drift — the previous chapter, on the producer-side contract that makes structured logs queryable in the first place. Sampling is the cost-side counterpart: schema discipline keeps the kept records useful, sampling decides which records are worth keeping.
Cardinality: the master variable — sampling and cardinality control are the two main levers on log pipeline cost; both work by deciding what to keep, just on different axes (records vs label-value combinations). The frameworks for setting per-route rates and detecting drift in kept-rate apply identically.
Wall: logs are the oldest pillar and the most abused — the broader argument that the default of "keep everything forever" is the central log-pipeline pathology. This chapter is the practical answer to "okay, what do we keep instead?"; the next chapters in Part 3 cover LogQL, retention tiers, and the routing rules that make a tiered sampler-aware pipeline shippable.
Structured vs unstructured logging — the prerequisite that makes per-route and per-pattern sampling rules expressible. Unstructured logs cannot be sampled by route or by reason because the sampler cannot parse them; the sampler-friendliness of structured logs is one of the unmodelled benefits of the structured-logging discipline.

The next chapters in this section move from "what to keep" to "how to query what you kept" — LogQL's grammar, the label-vs-content split, and the latency-and-cost characteristics of different query shapes. Sampling decisions interact with query shapes in subtle ways: a tail-sampled corpus with full per-request context lets | json | request_id="R-739" return everything you need; a head-sampled corpus with the same query returns one ERROR line and a frustrated engineer. The query language is the consumer the sampler is writing for.

The chapters after that move into Part 4's retention-tier design — how the kept-corpus is laid out across hot, warm, and cold storage, how queries are routed across tiers, and how the sampler's keep-rate interacts with retention duration to produce the bill the platform team actually sees. Sampling is the "what to keep" decision; retention is the "for how long to keep what we kept" decision; together they are the two main control surfaces on log-pipeline cost. Treating them as one decision (a single global "keep 5% for 30 days" rule) is the configuration mistake that produces bills nobody can explain; treating them as two separate decisions with separate per-route policies is the discipline that produces a pipeline whose cost shape matches the team's actual debugging needs.

The implicit message running through the whole part: a log pipeline is a system that is being designed, not a default that is being accepted. Every line that is shipped, kept, queried, and retained represents a choice — about value, about cost, about the corpus's shape. The teams that run their pipelines well make those choices explicitly, name them, measure them, and revisit them; the teams whose bills surprise them have made the choices implicitly by accepting the defaults of whatever shipped with the application framework. Sampling is the most visible of those choices because it is the one with the largest immediate cost lever, but the deeper discipline is making every part of the pipeline a thing that the team owns rather than a thing that the team inherited.

References

OpenTelemetry — Tail Sampling Processor — the canonical implementation of trace-and-log tail-sampling at the collector level; documents the policies (probabilistic, latency, status_code, string_attribute) and the decision_wait knob.
Honeycomb — Refinery: Sampling at Honeycomb's Scale — production tail-sampler used at large scale; explains the "dynamic sampler" rate-target algorithm and the trade between buffer size and decision latency.
Vector — Sample transform — head-sampler with hash-stable keys (key_field); the closest off-the-shelf primitive for application-side and agent-side head-sampling.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 4 — the foundational chapter on sampling for traces and logs; introduces the head-vs-tail distinction and the per-route-rate framework this chapter follows.
Tene, "How NOT to Measure Latency" — the coordinated-omission talk; the underlying argument extends to log sampling under buffer pressure, which is the always-keep-on-pressure pattern in §"Going deeper".
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 17 — the wide-events-and-sampling chapter; argues for tail-sampling as the default and head-sampling as the exception, with the full operational checklist.
Grafana — Loki retention and sampling — Loki's perspective on retention vs sampling vs cardinality, the three knobs that determine the bill.
JSON logs and schema drift — internal chapter on the producer-side contract that the kept records depend on.