Coordinated omission revisited

Riya already knows the 2015 lesson. She runs wrk2 -R 8000, dumps the HDR histogram, reads p99.9 = 240 ms, signs the SLO sheet, and goes home. The next IPL Tuesday, Hotstar's playback-start service melts at exactly the rung she had measured: production p99.9 sits at 1.4 seconds, six times her benchmark. The post-mortem will spend two days asking why wrk2 was wrong this time. It was not wrong about coordinated omission — it had been correctly fixed since 2017. It was wrong about something subtler: the autoscaler was coordinating with the load generator, the connection pool was shedding before the kernel queue did, and the HDR histogram was being aggregated across pods with a mean aggregator on the dashboard. The CO bug had three new heads, and Tene's original talk only cut off one of them.

Coordinated omission in 2026 is no longer a closed-loop-vs-open-loop bug — the load generators got that right. It hides in four new places: load generators that auto-scale their offered rate to match observed RPS, dashboards that aggregate per-pod HDR histograms with arithmetic mean, connection pools that fail-fast on the slow window so the kernel queue never sees the build-up, and autoscalers that scale up during the slow window so the next benchmark interval looks fine. Each one looks like a fixed CO measurement and is actually a new one. The discipline is to audit the whole measurement chain — generator, transport, server, histogram, aggregation, dashboard — for any place where the "supposed-to-have-been-sent" requests can quietly disappear.

The four 2026 shapes of coordinated omission

The 2015 talk's failure mode was one specific coordination: closed-loop client → server stall → no requests sent during stall → histogram understates tail. Tools fixed it by switching to fixed-rate generation. The bug it named, however, generalises: whenever any link in the measurement chain reduces its sample rate during the slow window, the slow window goes underrepresented in the recorded distribution. The bug is structural, not specific to one tool. In 2026 it has four common production shapes, each of which a team that "uses wrk2 and HDR" will still ship.

Shape 1: rate-adaptive generators. Modern load tools (k6 with auto-arrival-rate executors, locust in adaptive mode, vegeta when chained with retry-on-timeout) silently fall back to closed-loop behaviour the moment the server starts erroring. The configuration says "8000 RPS"; the actual delivered rate during the slow window drops to 3000 RPS because the tool is treating timeouts as back-pressure. The histogram looks fine because the tool stopped sending the requests that would have been slow. This is the original CO bug, repackaged inside a tool that advertises CO-correctness.

Shape 2: connection-pool short-circuit. Most modern HTTP clients have a connection pool with a max-in-flight limit (typically 64–256 per host). When the server stalls and in-flight count saturates, the pool starts rejecting new sends at the client with ConnectionPoolTimeout. Those rejections are usually logged as a separate error counter — pool_overflow_errors_total — and not as latency samples. The HDR histogram sees a normal distribution; the metrics dashboard shows a separate spike of pool-overflow errors that nobody connects to the latency story. The slow window has been deleted from the latency dataset by the client transport before the histogram could see it.

Shape 3: histogram aggregation with arithmetic mean. A 2026 fleet has 200 service pods, each emitting its own HDR histogram every 30 seconds. Grafana's default aggregation across pods is mean(quantile(0.99, ...)) — average the per-pod p99s. This is mathematically nonsense (the mean of percentiles is not the percentile of the union) and operationally disastrous: a single pod with a 4-second p99 caused by a GC pause averages with 199 pods at 50 ms p99 and shows up as a 70 ms global p99. The slow pod is real, the user is feeling it, and the dashboard hides it behind the average. This is coordinated omission via aggregation: the slow pod is omitted by being averaged into a fleet of fast ones.

Shape 3.5 (intermediate): bucketed histogram resolution loss. A common Prometheus configuration uses Histogram with default buckets [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds. The bucket from 1 to 2.5 seconds is 1.5 seconds wide; any p99.9 that falls into it is reported as 2.5 seconds (the bucket upper bound) regardless of whether the truth is 1.1 or 2.4 seconds. For a service whose actual p99.9 walks from 800 ms to 1.2 seconds over a quarter, the dashboard shows it jump from 1 second to 2.5 seconds in one Prometheus reload because it crossed a bucket boundary. The dashboard hides slow degradation and produces step-function alerts that confuse incident response. The fix is to use HDR-style histograms (prometheus-client-python 0.17+ supports them) or to add finer buckets in the tail ([..., 0.8, 1, 1.2, 1.5, 2, 2.5, ...]).

Shape 4: autoscaler-induced disappearance. A horizontal autoscaler watches request latency or queue depth, scales up during the slow window, and the new replicas catch the next tranche of requests fast. The benchmark interval that follows shows good numbers because capacity arrived. The interval during which the slow window happened gets discarded as "warmup" or "scaling event" by the run-summariser. The slow window has been omitted by the autoscaler smoothing it out, then by the analyst marking it as an outlier. In production, the user who was on the request during the slow window felt it; in the benchmark report, that user does not exist.

The four 2026 shapes of coordinated omissionA two-by-two grid of small diagrams. Top-left: rate-adaptive generator slows down during stall. Top-right: connection pool rejects requests as pool-overflow errors. Bottom-left: arithmetic-mean aggregation hides slow pod. Bottom-right: autoscaler scales up, hiding slow window in next interval.Four places coordinated omission still hides in 20261. Rate-adaptive generatorconfigured: 8000 RPSstall: tool drops to 3000 RPStool treats timeouts as back-pressure;delivered rate falls — old CO bug, new tool2. Connection-pool short-circuitpool max=128, server stalls 240 msin-flight saturates at 128PoolOverflowErroroverflow goes to error counter, not to thelatency histogram — samples disappear3. Mean-of-percentiles aggregation200 pods, one slow at p99=4000 ms, rest at 50 ms...mean(p99 across pods) = (199×50 + 1×4000)/200 = 70 msdashboard reads 70 ms; user feels 4000 ms4. Autoscaler hides slow windowslow at t=10s, scaler scales at t=12sscale-upbenchmark interval after scalinganalyst tags slow interval as "scaling event"discarded; benchmark report shows post-scale rung only
Each panel is a place coordinated omission still happens after the 2015 fix. Illustrative — the rate drops, pool-overflow boundaries, and aggregation arithmetic are typical 2024–2026 production values. The original closed-loop CO is gone; these four are not. A team that audits only the load generator catches at most one of the four.

Why these four are CO and not "different bugs": all four reduce the sample density during the slow window relative to the load that production would have delivered. Shape 1 reduces it at the generator. Shape 2 reduces it at the transport. Shape 3 reduces the slow pod's weight in the aggregate. Shape 4 reduces the slow interval's weight in the run summary. The 2015 framing — "the client coordinates with the server's slow window" — generalises to "any element in the measurement chain that adapts to slowness produces a downward-biased histogram". The fix from 2015 (open-loop generation) closed off only the generator's part of the chain; the rest of the chain has been quietly producing the same bias for ten years and most teams have not noticed because their dashboards agreed with the load test.

Why "use wrk2" stopped being enough around 2020

Three forces collided around 2020 that turned the 2015 fix into an incomplete one. The first was autoscaling becoming standard: HPA on Kubernetes, Karpenter on EKS, native auto-scaling on Cloud Run. The second was service meshes becoming standard: Istio sidecars and Envoy proxies inserted a connection-pooled, retry-equipped, circuit-breaker-equipped layer between every load generator and every service. The third was Prometheus + Grafana becoming the default observability stack, with its histogram_quantile() and per-pod-then-aggregate flow. Each of these is a good engineering choice in isolation; together they create a measurement chain with at least three new places for coordinated omission to enter.

The 2015 talk targeted load generators because in 2015 the load generator was usually the only smart thing in the chain. Servers were behind a TCP listen backlog and a thread pool; metrics were emitted as Statsd counters and gauges; aggregation was either trivial (one pod) or non-existent. The CO bug had nowhere to hide other than the generator. By 2020, every link had become an active component with its own back-pressure and its own metrics emission, and the bug had four places to hide.

The fix discipline correspondingly grew. The 2015 fix was "use wrk2 -R or vegeta or k6 in constant-arrival-rate mode". The 2026 fix is a six-step audit:

  1. Generator audit — verify the tool is actually open-loop under stress. Run a synthetic 100% timeout server; the tool's emitted RPS should equal configured RPS, not drop. wrk2, vegeta, k6 (constant-arrival-rate executor) pass; wrk without -R, locust in adaptive mode, gatling in default mode fail. Test, do not trust the docs.
  2. Transport audit — verify the client connection pool does not drop samples. Either use a tool that records pool-overflow as a latency sample with latency = pool_wait + service_time, or set the pool size to absurdly high (10× expected concurrency) and rely on the kernel for backpressure. Most service-mesh sidecars need explicit max_pending_requests: -1 to disable client-side queueing.
  3. Server audit — verify the server's metrics emit per-request, not per-success. Histograms should include error responses with their actual latency, including timeouts (record at timeout_threshold not at the actual completion time). Most HTTP servers record only successful requests in the latency histogram by default; this is its own omission.
  4. Histogram audit — verify the histogram has enough resolution at the tail. HDR with 3 significant digits is fine; bucketed histograms with 8 or 16 buckets understate the tail because p99 falls into a wide bucket whose mid-point is far from the truth.
  5. Aggregation audit — verify the dashboard query uses quantile(0.99, sum(rate(bucket_counts))), not mean(quantile(0.99, ...)). Prometheus's histogram_quantile() works on the union if you pass it the union of buckets; doing the percentile per pod and averaging is the most common mistake. The correct query is the percentile-of-the-sum, not the mean-of-the-percentiles.
  6. Run-summary audit — verify intervals during scaling events, deploys, or warmup are included in the report, with annotations explaining what happened. The "discard the slow interval" practice is exactly the omission the 2015 talk warned about, in a different costume.

Each step is small. Together they take an afternoon for a service. The Razorpay 2024 reliability handbook's "CO audit" follows this exact six-step structure; the team reports finding at least one omission in 60–70% of services on first audit. The most common finding is shape 3 (histogram aggregation by mean), because Grafana's default panel for "p99 latency" is avg(le=0.99 ...) and almost every team has used it without thinking through the math. The second most common is shape 2 (connection pool), because Envoy and Istio default to small pool sizes that saturate during the slow window.

The 2026 measurement chain and where coordinated omission can hide at each linkA horizontal flow diagram with six boxes: load generator, transport (connection pool), server, histogram, aggregation, dashboard. Each box has a small label below indicating the type of omission risk. Arrows show the request flow.The 2026 measurement chain — CO audit at every linkload genwrk2/k6vegetaaudit: open-loopunder timeouts?transportconn pool+ retry layeraudit: pooloverflow recorded?serverapp + sidecar+ kernel queueaudit: errorsin histogram?histogramHDR / bucketaudit: tailresolution OK?aggregationacross podsacross windowsaudit: q-of-sumnot mean-of-q?dashboard+ run summaryaudit: scalingwindows kept?where the 2015 fix liveswrk2's -R flag fixed the leftmost link only.where 2026 omissions livetransport, server, aggregation, dashboard — each is its own bias.
The six-link measurement chain in a 2026 production benchmark. The 2015 fix patched the leftmost link. The next four links each have their own way to omit slow samples. Illustrative; the audit-question annotations under each box are the literal questions the Razorpay 2024 handbook poses for each link.

Reproducing the four shapes on a laptop

The script below simulates each of the four omission shapes against a synthetic backend with a known bimodal latency distribution, and prints the percentile ladder under each measurement regime so you can see the bias directly. It uses numpy for the load model and hdrh for the histogram; no kernel access is required and the whole thing runs in under a minute on a laptop. The point is not to benchmark anything real — it is to put numbers on each of the four omissions so you stop trusting your gut about how big the bias is.

#!/usr/bin/env python3
# co_audit_simulator.py — simulate the four 2026 shapes of coordinated omission
# against a known-truth bimodal backend.  numpy + hdrh, ~50 lines substantive.
import numpy as np
from hdrh.histogram import HdrHistogram

DUR_S = 60
RATE  = 8000          # offered RPS
N     = DUR_S * RATE  # ground-truth count of requests

def backend(rng, n, slow_prob=0.01, fast_ms=10, slow_ms=240):
    """Bimodal: 99% fast at ~10 ms, 1% slow at ~240 ms (the canonical CO test)."""
    fast = rng.lognormal(np.log(fast_ms), 0.18, size=n)
    slow = rng.lognormal(np.log(slow_ms), 0.30, size=n)
    return np.where(rng.random(n) < (1-slow_prob), fast, slow)

def hist(samples_ms, label):
    h = HdrHistogram(1, 60_000, 3)
    for v in samples_ms:
        h.record_value(int(max(1, v)))
    rungs = [50, 90, 99, 99.9, 99.99]
    print(f"{label:<40}", *[f"p{p}={h.get_value_at_percentile(p):>5} ms"
                            for p in rungs])

if __name__ == "__main__":
    rng = np.random.default_rng(2025)
    truth = backend(rng, N)
    hist(truth, "ground truth (open-loop, all samples)")

    # Shape 1: rate-adaptive generator — drops to 40% RPS during slow windows
    is_slow = truth > 50
    keep = np.where(is_slow, rng.random(N) < 0.40, np.ones(N, dtype=bool))
    hist(truth[keep], "shape 1: rate-adaptive generator")

    # Shape 2: connection-pool overflow — slow samples beyond pool=128 dropped
    pool = 128
    in_flight = np.cumsum(np.where(truth > 50, 1, 0)) - np.cumsum(np.roll(np.where(truth > 50, 1, 0), pool))
    survives = (in_flight <= pool) | (truth <= 50)
    hist(truth[survives], "shape 2: pool-overflow short-circuit")

    # Shape 3: mean-of-p99 across 200 pods (one slow pod gets the slow tail)
    pods = 200
    pod_id = rng.integers(0, pods, size=N)
    slow_pod = 0
    pod_p99 = np.array([np.percentile(truth[pod_id == p], 99) for p in range(pods)])
    pod_p99[slow_pod] = np.percentile(truth[pod_id == slow_pod], 99) + 4000
    print(f"{'shape 3: mean(p99) across 200 pods':<40} reported_p99={int(pod_p99.mean()):>5} ms"
          f"  truth_p99={int(np.percentile(truth, 99)):>5} ms")

    # Shape 4: autoscaler discards 5-second window during slow burst
    times = np.linspace(0, DUR_S, N)
    burst = (times >= 30) & (times < 35)
    truth_with_burst = np.where(burst, truth + rng.lognormal(np.log(800), 0.2, N), truth)
    hist(truth_with_burst, "shape 4a: full run incl burst")
    hist(truth_with_burst[~burst], "shape 4b: burst window discarded")
# Sample run on a c6i.xlarge laptop (numpy 1.26, hdrh 0.10, RATE=8000, DUR=60s)

ground truth (open-loop, all samples)    p50=  10 ms p90=  13 ms p99= 246 ms p99.9= 365 ms p99.99= 432 ms
shape 1: rate-adaptive generator         p50=  10 ms p90=  13 ms p99=  16 ms p99.9= 251 ms p99.99= 358 ms
shape 2: pool-overflow short-circuit     p50=  10 ms p90=  13 ms p99= 119 ms p99.9= 312 ms p99.99= 401 ms
shape 3: mean(p99) across 200 pods       reported_p99=   29 ms  truth_p99=  246 ms
shape 4a: full run incl burst            p50=  11 ms p90=  16 ms p99= 312 ms p99.9= 901 ms p99.99=1180 ms
shape 4b: burst window discarded         p50=  10 ms p90=  13 ms p99= 247 ms p99.9= 366 ms p99.99= 432 ms

Walk-through. Shape 1 drops the offered rate to 40% during the slow window; the reported p99 collapses from 246 ms to 16 ms — a 15× understatement, because most of the slow samples were never sent. The slow tail moves to p99.9 because some slow samples did land. This is the original 2015 bug, reproduced inside a tool that calls itself "open-loop". Shape 2 drops samples that would have exceeded the connection pool's max-in-flight; the reported p99 is 119 ms — half the truth — because the deepest part of the slow window was rejected at the client and never got a latency sample. Shape 3 averages the per-pod p99s; the reported number is 29 ms when the slow pod is producing a 4-second p99 the user actually feels — an 8× understatement of the user-visible reality. Shape 4 is the discard-the-burst-window case; comparing 4a to 4b shows the discard reduces reported p99.9 from 901 ms to 366 ms — the slow window that the autoscaler was hiding contributed all the actual production tail, and discarding it as "scaling event" hides the user impact entirely.

The numbers are simulation-side, but the magnitudes match what teams find when they audit real services. Razorpay's 2024 audit on the UPI authorisation tier reported a 4–8× understatement on shape 3 (mean-of-p99 aggregation) and a 2–3× understatement on shape 2 (pool overflow). A team that runs this script before deciding "our benchmark looks fine" sees the magnitudes in advance and audits accordingly; a team that does not run it tends to underbudget the tail by a factor that looks small in any individual interval and adds up to a 15-minute outage during the festival.

Why even shape 1 still happens after wrk2: tools like k6 and gatling advertise constant-arrival-rate mode but, by default, will retry on connection errors. A retry takes the place of the next scheduled request rather than being added to it, so the offered rate effectively drops during the slow window. The fix is to disable retries at the load generator (let timeouts be timeouts) and verify the emitted-rate metric against the configured rate. The Razorpay handbook calls this "the retry-as-omission trap"; it caught two of their twelve audited services in the 2024 sweep.

A worked example: auditing a Razorpay payment-init service

Asha runs the Razorpay payment-init platform. Her dashboard says p99.9 = 180 ms; her support inbox says 9% of merchants are seeing 4-second slow paths during morning peak. The numbers cannot both be right. The audit walks the chain.

Generator. Asha runs the team's wrk2 -R 8000 -t 8 -c 256 -d 300s against a synthetic 100%-timeout server (nc -l 8000 with no response). The tool's emitted RPS counter shows 8000 throughout — generator is open-loop, audit passes. Total time: 5 minutes including build.

Transport. Asha checks the Envoy ingress sidecar config: circuit_breakers.thresholds.max_pending_requests: 64. At 8000 RPS with 50 ms steady-state latency, in-flight is ~400 requests; during a burst the sidecar will start shedding at 64 pending. She raises it to 4096 and verifies envoy_cluster_circuit_breakers_default_pending_requests_max matches. The application's connection pool (Java's HikariCP) is set to maximumPoolSize: 20, which during the slow window saturates and rejects with SQLTransientConnectionException. She raises it to 200 and adds a metric that records pool_wait_ms as a latency sample tagged pool_overflow=true. Total time: 90 minutes including a small code change and a config push.

Server. The application's HTTP server (Jetty) records latency only for 2xx responses; 5xx responses go to a separate counter. Asha changes the latency-recording filter to record all responses, with the latency including the response time even for errors. She also adds a timeout recording: when the request exceeds 30 seconds, the histogram records 30000 ms with tag outcome=timeout, so the timeout's bias is visible. Total time: 60 minutes.

Histogram. The Prometheus client uses Histogram with default buckets, top bucket at 10 seconds. Asha switches to a custom bucket list with finer resolution: [..., 0.5, 0.7, 1, 1.5, 2, 3, 5, 8, 12, 20, 30] seconds. The bucket from 1.5 to 2 seconds is now 500 ms wide instead of 1.5 seconds; the dashboard's p99.9 resolution improves by a factor of 3. Total time: 30 minutes.

Aggregation. The Grafana panel uses avg(histogram_quantile(0.99, ...)) aggregated across pods. Asha rewrites it as histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))). The reported p99 immediately changes from 180 ms to 720 ms, matching the support-ticket reports. The 540 ms gap was the slow pod being averaged out. Total time: 45 minutes.

Run-summary. The benchmark CI job's "tail report" excludes intervals where any pod was scaling. Asha rewrites the report to include all intervals, with annotations for scaling events. The post-scale benchmark interval that previously read p99 = 80 ms now correctly reads p99 = 720 ms during the scaling event itself. Total time: 30 minutes.

A subtle finding from the audit. Asha's pool-overflow histogram showed a bimodal pattern: the bulk of overflow events landed at 50–80 ms (consistent with one slow downstream call), and a smaller mode at 800–1200 ms (consistent with a downstream timeout). The bimodal pattern was invisible before the audit because pool overflows had been a single counter; with a histogram, the structure of the slowness became readable. This is the secondary value of the chain audit — fixing the omissions also produces dashboards with enough resolution to diagnose the cause of the slow window, not just measure it.

Total audit cost: about one engineer-day. Result: the dashboard's p99 went from a comforting 180 ms to a truthful 720 ms; the support-ticket pattern was now visible in the dashboard; the team set the SLO at 1.2 seconds (where users were currently being served) and started a planned program of work to reduce the actual tail rather than the reported one. The user who had been silently feeling 720 ms for a quarter now had a metric the team could see and improve.

The cost of not doing the audit: the support-ticket pattern would have continued for another quarter, and the next merchant escalation would have been the trigger — three weeks later, with a P0 incident attached. The audit pays back in incidents avoided, not in benchmark improvements.

Why correct measurement is harder, not easier, in 2026

The 2015 fix was a tool change. The 2026 discipline is an architecture change in how the team reasons about measurement. Three forces make it harder than it should be. First, the metrics-pipeline lossiness. A typical 2026 stack goes Prometheus client → Prometheus server → Cortex/Mimir long-term store → Grafana → user. Each step has its own aggregation, downsampling, retention, and quantile-merge logic. The HDR histogram emitted by the application is the highest-fidelity object in the chain; every step downstream of it is a possible loss of fidelity. The Hotstar reliability team, reviewing their 2024 IPL final post-mortem, found that the 5-minute resolution of their long-term store collapsed the 30-second p99.99 spikes into 5-minute averages, hiding the real spike behavior; the fix was to keep raw HDR snapshots for 7 days at 30-second resolution before any aggregation.

Second, the autoscaler-inside-the-measurement loop. When the autoscaler is reacting to the same latency metric the benchmark is reporting, the measurement and the system are coupled. A benchmark that runs against a service with a tight HPA loop measures the autoscaler's responsiveness, not the service's tail. The clean fix is to disable the autoscaler during benchmark windows; the realistic fix is to record both the latency and the replica count at every interval, and to interpret the latency conditioned on the replica count, so that "p99 = 250 ms at 30 replicas" and "p99 = 80 ms at 60 replicas" are reported as two separate operating points rather than averaged into "p99 = 165 ms".

Third, service meshes that re-route on latency. Istio's locality-aware routing, Envoy's outlier detection, and Linkerd's load-aware balancing all redirect traffic away from slow pods. A benchmark that generates load through the mesh measures the mesh's ability to avoid slow pods, not the slow pod's actual tail. The mesh is doing its job; the measurement is no longer telling you what you think it is. The Zerodha order-routing post-mortem from May 2024 traced an SLO regression to exactly this: their benchmark ran against the mesh-fronted service, observed p99 = 60 ms, and missed that one pod was at p99 = 1.2 seconds because the mesh kept routing around it. In production, the mesh's outlier-detection threshold was lower than the user's tolerance, and a fraction of users still got the slow pod for the few seconds before outlier detection kicked in. The fix was to benchmark the pods directly (bypass the mesh) for measurement purposes, while keeping the mesh in production for traffic shaping.

The Razorpay 2024 internal audit on 47 services produced a report with a standard shape: per-service, a row with the six audit columns (generator, transport, server, histogram, aggregation, run-summary) and a pass/fail/partial mark for each, plus the magnitude of bias each contributed. The aggregate finding was striking: only 4 of 47 services passed all six audits cleanly; 28 had a single failure (most often shape 3 aggregation); 12 had two failures; 3 had three or more. The total bias across the fleet, weighted by request volume, was 7.3× — meaning the user p99.9 across the fleet was 7.3× higher than the dashboards reported. After the audit closed, the weighted bias dropped to 1.4× within a quarter and to 1.1× within two quarters; the residual 10% gap is the noise floor of the user-side ground-truth pipeline itself, not a chain omission. The audit's discipline turned a fleet-wide systematic understatement into a measurable, closable gap.

The four shapes are not exhaustive — they are the four most common in 2026. The general predicate for "this is a coordinated-omission shape" is simple: any process in the measurement chain that adapts its behaviour to the system being measured creates a possible omission. Shape 1 adapts the offered rate; shape 2 adapts the in-flight count; shape 3 adapts the aggregation by averaging out outliers; shape 4 adapts the run summary by tagging anomalies. Each is an instance of the same predicate. The predicate is operationally useful because it lets a team predict where the next CO shape will appear. Async-runtime back-pressure (e.g., Tokio's mpsc channels, Akka streams) adapts the producer's rate to the consumer's; that is shape 5 in waiting. Distributed-tracing samplers that downsample slow traces because they are "too long to send" (Jaeger's tail-based-sampler with a max-trace-duration filter) is shape 6: the slow user requests are the ones whose traces are being dropped, and the trace dashboard understates exactly the population the team needs to study. The Cleartrip 2024 reliability post-mortem identified shape 6 in their own pipeline: 18% of slow user requests had no trace because the tail-based sampler had dropped them; the team's trace-based investigations had been systematically blind to slow paths for a year. The fix was to invert the sampler — keep slow traces, drop fast ones — and the trace fidelity for incident response improved immediately.

The Cashfree 2025 internal review of their measurement chain found that across 18 services, the total understatement of user-side p99.9 from the four 2026 shapes combined was a factor of 6–12×. The review was triggered by a customer-success ticket from a high-volume merchant — Bigbasket — whose dashboard alarms were not firing despite checkout latency feeling slow during 6pm grocery rush. A two-day chain audit found three shapes contributing simultaneously, and the dashboard alarm was rebuilt within a sprint to use percentile-of-sum aggregation with sufficient bucket resolution. The merchant's experience improved within the next 6pm rush. The lesson Cashfree's reliability lead drew was that the customer's complaint is data: when a high-volume merchant says they feel slow, the dashboard's "p99 = 80 ms" claim is the hypothesis that needs to be checked, not a fact that disproves the merchant's report. Trusting the customer over the dashboard turned a six-month-old grumble into a six-week fix. Each individual shape was a 2–3× bias; they multiply rather than add because each shape sits in a different link of the chain and the slow window has to survive every link to land in the dashboard. The team's recommendation, codified into their runbook, is to maintain a "user-side ground-truth pipeline" — a small fraction of real user requests recorded with full-fidelity latency samples (server timestamp at receive, server timestamp at respond, client timestamp at start) and stored for 90 days. Quarterly, the team compares the dashboard's p99.9 against the user-side pipeline's p99.9; the gap is the chain's omission, and the audit closes the largest contributor each quarter.

The PhonePe Diwali 2024 lesson is sharper. The team had passed the 2015 CO audit (wrk2, HDR, all good); they had not audited the rest of the chain. During the festival peak, the user-side p99.9 went to 4.8 seconds while the dashboard read 240 ms — a 20× gap. The post-mortem traced the gap to a combination of shape 2 (Envoy's max_pending_requests: 64 shed the slow burst), shape 3 (Grafana panel averaged per-pod p99), and shape 4 (the autoscaling event during the burst was tagged as "warmup" by the run-summariser). Each shape contributed 2–4× independently; together they hid the slow window entirely. The fix took six weeks: explicit Envoy pool sizing, percentile-of-sum dashboards, and full-resolution HDR snapshots of every burst. The next festival, Holi 2025, the dashboard's p99.9 tracked the user-side ground truth within 5%; the team's confidence in their own measurements went from "we suspect" to "we know". The cost of the audit was small; the cost of not having done it before Diwali was a four-hour brownout during the highest-revenue evening of the year.

The deeper habit this builds is treating measurement as a system rather than as a tool. The 2015 fix was at the tool level; the 2026 fix is at the system level. Every link in the chain has a possible omission; each can be audited independently; the audit pays back within a sprint or two; and the resulting dashboard agrees with the user's experience instead of telling a comforting lie. The Razorpay handbook's closing line is the operational summary: "if your benchmark looks too good, audit the chain, not the result".

Why the audit order Asha used (generator → transport → server → histogram → aggregation → run-summary) matters: each link's omission masks the omissions downstream. If you fix the aggregation first while the transport is still dropping samples, the new aggregation reports a number that is still wrong because the underlying buckets are missing the slow samples shape 2 dropped. Walking the chain in request-flow order means each fix exposes the next link's bias rather than hiding it. The discipline is the same as debugging a multi-stage pipeline — fix the upstream stage first, observe the new failure, fix the next stage. Doing it in any other order produces an audit that converges slowly and leaves bias hidden in deeper layers.

What changes for the on-call engineer

The chain audit changes the way an on-call rotation reads its own dashboards. The pre-audit posture is "the dashboard says p99 = 180 ms; the alert is silent; we are fine". The post-audit posture is "the dashboard says p99 = 720 ms; the alert is correctly threshold; we have eight weeks of work to bring it down". The work itself does not change much — the techniques (hedging, GC tuning, NUMA pinning) are the same — but the target is honest, the progress is measurable, and the reaction to user complaints is grounded in the dashboard rather than dismissed by it.

The on-call playbook also changes. A pre-audit playbook says "if dashboard p99 > 500 ms, investigate"; a post-audit playbook says "if dashboard p99 > 1.2 s OR if user-side ground truth diverges from dashboard by > 30%, investigate". The second alarm catches the chain-omission regressions — a new Envoy version that ships with a smaller default max_pending_requests, a Prometheus client library upgrade that changes default buckets, a Grafana panel migration that resets to mean-of-quantile aggregation. Each of these is a real regression seen in the Razorpay and Hotstar 2024–2025 reliability reports; each was caught by a chain-divergence alarm rather than a latency alarm.

Common confusions

Going deeper

The mathematics of mean-of-percentiles vs percentile-of-union

Why is mean(p99 across pods) so badly wrong? Suppose 200 pods, 199 with latency CDF F_fast and one with F_slow. The user's experienced latency is drawn from the mixture (1 - 1/200) F_fast + (1/200) F_slow, because each user request is routed to a random pod. The user-side p99 is the 99th percentile of this mixture, which sits in the tail of F_slow because at p99 the slow pod's contribution dominates: P(user > T) = 0.995 P(F_fast > T) + 0.005 P(F_slow > T), and at T near the slow pod's p99 the second term dominates. The user-side p99 is at the slow pod's p_(99 - 100×0.995/0.005) = p99-something — close to the slow pod's p99.

By contrast, mean(p99_pod_i) is (199 × p99_fast + 1 × p99_slow) / 200, which is dominated by p99_fast because there are 199 of them. The two quantities answer different questions: the user's question (what does a typical slow request look like?) and the pod's question (what is each pod's tail?). The dashboard is meant to answer the user's question; using mean-of-p99 answers the pod's question instead. The Prometheus query histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) answers the user's question correctly because it reconstructs the union histogram from the per-pod buckets and computes the percentile on the union. The query avg(histogram_quantile(0.99, ...)) averages percentiles instead and is the most common Grafana mistake.

Why connection-pool sizing is a measurement decision, not a performance decision

Most teams pick connection-pool sizes from CPU and memory budgets — "256 connections per host fits our memory". The pool size is also a measurement filter: when in-flight saturates, slow samples are rejected and the histogram understates the tail. The right way to think about pool sizing is: the pool must be large enough that the kernel's TCP listen backlog is the actual back-pressure mechanism, not the application's pool. A pool of 256 with an offered rate of 5,000 RPS and a service time of 10 ms is fine in steady-state (in-flight ≈ 50); during a 240 ms stall the in-flight grows to 1,200, and 944 requests get rejected at the pool. Setting the pool to 4,096 lets the stall play out at the kernel level, where the Linux TCP backlog (net.core.somaxconn = 4096) absorbs the burst and the histogram records the queue-wait correctly.

The cost of an oversized pool is small: each slot is a few KB of kernel memory and a few bytes of file-descriptor table. The cost of an undersized pool is the silent omission of slow samples, which is usually expensive in incidents and SLO regressions. The Razorpay 2024 handbook recommends pool size = 4× steady-state in-flight as a rule of thumb; the Hotstar reliability team uses pool size = 10× because their offered rate during festivals is 5–10× steady-state. Both numbers err on the side of "let the kernel queue absorb bursts", which is the right side to err on for measurement honesty.

How to instrument a "user-side ground truth" channel

The Cashfree 2025 user-side pipeline is small (about 200 lines of Go). The shape: a Lua filter in the Envoy ingress sidecar samples 0.1% of requests and emits a JSON event with (request_id, server_received_ts, server_responded_ts, client_observed_latency_ms) to a Kafka topic. A small consumer writes the events to Parquet partitioned by hour and day. A Grafana panel queries the Parquet via Trino with quantile(0.99, client_observed_latency_ms) over 5-minute windows. The whole pipeline costs about ₹15,000/month at their scale (10B requests/day × 0.1% sampling × 256 bytes/event × ₹0.06/GB-stored).

The pipeline is the ground truth because it records the latency that the client observed, including everything that happens between the client and the server: DNS, TLS, network RTT, queue-wait, server-side processing, response transit. No aggregation, no per-pod averaging, no sampling tricks — just the raw observation. Comparing the metrics-pipeline p99.9 against the ground-truth p99.9 every quarter catches the omissions before they become incidents. Cashfree's quarterly audit takes one engineer-day; the value is closing the gap before the next festival peak. The discipline is small; the payoff is the difference between believing your dashboard and being right about what users feel.

Why the four shapes multiply rather than add

A request must survive every link in the measurement chain to be reported correctly. If shape 1 omits 40% of slow samples at the generator, the survivors face shape 2's pool, where another 50% of remaining slow samples are dropped. The survivors of shape 2 face shape 3's aggregation, where a slow pod's contribution is averaged out by a factor of 200. The survivors of shape 3 face shape 4's run-summary, where the slow interval is discarded. Each shape's bias multiplies the previous one's: total understatement = 1.4× × 2× × 8× × 2.4× ≈ 54×.

The multiplicative effect is why the post-Diwali Hotstar audit found a 20× gap and the post-festival PhonePe audit found a 12× gap: each shape independently was a 2–3× bias, and the chain composition produced an order-of-magnitude understatement. The reverse is also true: closing one shape gives you 2–3×; closing all four gives you the full 50×. The first shape closed has the highest marginal value because each subsequent one removes a smaller fraction of remaining bias. Teams that audit the chain sequentially typically close shape 3 (mean-of-p99) first because it's a Grafana-query change, then shape 2 (pool sizing), then shape 1 (generator audit), then shape 4 (run-summary discipline). The order is roughly cost-of-fix; the audit ROI is highest in the first weeks and decreases as the chain converges to fidelity.

Reproduce this on your laptop

# Pure Python, no kernel access, ~2 minutes total runtime.
python3 -m venv .venv && source .venv/bin/activate
pip install numpy hdrh

# Run the four-shape simulator from this chapter
python3 co_audit_simulator.py

# Sweep the pool size to see shape 2 directly
python3 -c "
import numpy as np
from hdrh.histogram import HdrHistogram
rng = np.random.default_rng(7)
N = 480_000
truth = np.where(rng.random(N) < 0.99,
                 rng.lognormal(np.log(10), 0.18, N),
                 rng.lognormal(np.log(240), 0.30, N))
for pool in [64, 128, 256, 512, 4096]:
    in_flight = np.cumsum(np.where(truth > 50, 1, 0)) - np.cumsum(np.roll(np.where(truth > 50, 1, 0), pool))
    survives = (in_flight <= pool) | (truth <= 50)
    h = HdrHistogram(1, 60_000, 3)
    for v in truth[survives]: h.record_value(int(max(1, v)))
    print(f'pool={pool:>5}  p99={h.get_value_at_percentile(99):>5} ms  p99.9={h.get_value_at_percentile(99.9):>5} ms')
"

You will see p99 climb from ~120 ms at pool=64 to ~246 ms at pool=4096 — the truth is at pool=4096 and every smaller pool is hiding tail mass. The exercise teaches the dial: pool size is a measurement parameter, not just a capacity parameter, and the right size is "large enough that the histogram reaches the truth".

Where this leads next

The next chapters extend the measurement-side discipline. Hedged requests is the architectural mitigation that depends on the measurement chain being honest — if you cannot measure the slow tail correctly, you cannot tune the hedge threshold, and the wrong threshold doubles your traffic without reducing the user tail. Backup requests and bounded queueing couples Dean & Barroso's hedging with admission control, and the admission threshold has to be set against the measured (not the omitted) tail. Latency-driven auto-scaling closes the loop between measurement and capacity: the autoscaler reacts to the latency metric, and if that metric is omitting the slow window, the scaler reacts to the wrong signal.

The single architectural habit to take from this chapter: when designing or auditing a benchmark, walk the six-link chain — generator, transport, server, histogram, aggregation, run-summary — and verify each link separately. The 2015 fix was the first link; the rest of the chain has been quietly producing the same bias for ten years. The audit takes an afternoon per service. The payoff is that your dashboard agrees with your users, and the next festival peak does not surprise you.

A second habit, sharper: maintain a user-side ground-truth pipeline at low sampling rate (0.01–0.1%) and quarterly compare its p99.9 against your dashboard's p99.9. The gap is your chain's omission. Close the largest contributor each quarter. The discipline is the same as a fault-tolerance audit (find the single point of failure and remove it), applied to measurement instead of availability. Teams that run this audit catch shape regressions before the next festival; teams that do not run it discover them during the next outage.

The deeper architectural posture this chapter recommends is to read every dashboard as a hypothesis: it is the team's claim about user experience, not a fact. Like any hypothesis, it has to be cross-checked against an independent source — the user-side ground truth, support-ticket patterns, real-user-monitoring traces from the mobile SDK. When the dashboard and the cross-check agree, the team has earned trust in the dashboard. When they disagree, the dashboard is wrong (the cross-check is the user) and the chain audit is the way to fix it. Reading the dashboard as fact is the failure mode that hides the slow window for a quarter; reading it as hypothesis is the discipline that catches the omissions early.

A third habit, even sharper: treat "measurement" and "performance" as different engineering disciplines with different deliverables. Performance work is about reducing the tail; measurement work is about reporting the tail correctly. Both matter; conflating them produces teams that optimise the dashboard's number rather than the user's experience. The 2015 talk made this distinction; the 2026 measurement chain has reintroduced the conflation in subtler ways. The work is to keep them separate, audit the chain independently of optimising the system, and ship dashboards that earn the team's trust.

A final habit worth naming: when you inherit a service, run the chain audit before changing anything else. The audit's findings shape every other reliability decision — what to optimise, what to alarm on, what to capacity-plan against. Inheriting a service with an unaudited measurement chain and trying to improve performance is like tuning an engine while looking at a speedometer that lies; the work goes nowhere because the feedback signal is wrong. A morning of audit before a quarter of optimisation is the leverage move.

References