Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Coordinated omission and HDR histograms

Riya runs a wrk -c 100 -t 4 -d 60s against the PaisaBridge payment-init service the night before Black Friday. The histogram says p99 = 12 ms; she signs off. Twenty-four hours later her on-call phone rings — p99 in production is 410 ms during the spike, the SLO is 200 ms, and the post-mortem will spend three days asking why the load test was wrong. It was not wrong about what it measured. It measured the wrong thing: the latencies of the requests wrk actually sent. The latencies of the 6,400 requests it should have sent during a 240 ms GC pause — the requests that, in production, are queueing in the kernel TCP backlog and pushing real users into the tail — never made it into the histogram. The benchmark omitted them in coordination with the server's slow window. The fix is not a faster client; the fix is a different measurement contract.

Closed-loop benchmark tools wait for each response before sending the next request. When the server stalls, the client stops sending — so the slow window has no samples and the histogram lies cleanly downward, often by 30–100×. The fix has two parts: drive load at a fixed rate independent of responses (wrk2 -R, vegeta, k6), and store latencies in an HDR histogram so the p99.9 you read tomorrow is the p99.9 the run actually saw, not a mean lossy-aggregated by your dashboard.

What `wrk` actually measures and why p99 = 12 ms is a lie

Closed-loop load generators — wrk (without -R), ab, siege, httperf in default mode, and any homemade for i in range(N): r = requests.get(...) script — operate as a fixed-size pool of virtual users. Each user holds a connection, sends a request, waits for the response, records the latency, and only then sends the next one. With 100 connections and a service responding in 1 ms each, the tool sends 100 × 1000 = 100,000 RPS. With the same 100 connections and a service responding in 100 ms each, the tool sends 1,000 RPS. The offered load follows the service — that is exactly the property real users do not have.

A real user — a PaisaBridge merchant's customer tapping "Pay" on a checkout page — does not wait politely while the payment service does a 240 ms GC pause and only then walk into the next room to start their transaction. The next user is already typing their OTP. The next 5,000 users have already tapped Pay. They all queue up in the TCP listen backlog of the load balancer. When the server resumes, those queued requests hit the application all at once, each having waited 240 ms in the queue plus however long their own service time will be. Their experienced latency is queue_wait + service_time, not service_time. A closed-loop benchmark sees only service_time because it suppressed its own queue-building.

Illustrative — not measured data. The closed-loop client books one slow sample for the stall. The open-loop client books one for every scheduled tick during it; each tick records the queue time it accumulated waiting for the server to resume. The tail mass — what production looks like — only appears in the open-loop run.

Why the bias is downward by a large factor: imagine a 60-second benchmark, an offered rate of 10,000 RPS, and one 240 ms stall. The "should-have-been-sent" count during the stall is 0.240 × 10,000 = 2,400 requests, each with a recorded latency of at least 0–240 ms of queue wait plus service time. A closed-loop tool with 100 connections sends only 100 requests during that 240 ms window (one per connection, the one that was in flight) — and the 99 of those that aren't the unlucky one finish in 1 ms. The histogram has 99 fast samples and 1 slow sample for that window, when the real production traffic would have produced ~2,400 mostly-slow samples. The histogram's 99th percentile is computed against 60 × 100,000 ≈ 6,000,000 total samples, so 1 slow sample doesn't move it — p99 stays where the body is. The right histogram should have ~2,400 slow samples per stall, of which the slowest are the p99 you would see in production.

The phrase "coordinated omission" is Gil Tene's, from his 2015 talk "How NOT to Measure Latency". The "coordination" is between the client's send schedule and the server's response schedule: when the server slows, the client also slows, and the slow events are omitted from the measurement record. This is not a small bias — it is structural and one-directional. Every closed-loop measurement tool understates the tail. The size of the understatement is roughly the ratio of the stall duration to the inter-arrival time at the offered rate; for any service where stalls are tens of milliseconds (GC, lock contention, cache miss waterfalls) and arrivals are sub-millisecond, the ratio is 50–500×. A "p99 = 12 ms" report from wrk against a service whose real p99 in production is 410 ms is a typical instance, not an extreme case.

The fix is to send requests on a fixed schedule — one request every 1/R seconds regardless of whether the previous response has come back — and to record the latency of each request as response_time - intended_send_time, not response_time - actual_send_time. The first part is what wrk2's -R flag does, what vegeta does by default, what k6 does in its constant-arrival-rate executor. The second part is what HDR histograms with record_corrected_value() do — they spread one observed slow sample across all the samples that were "supposed to" land during the slow window.

Open-loop load with `wrk2`, parsed by HDR histograms in Python

The right way to run a latency benchmark is wrk2 (or vegeta / k6) at a fixed offered rate, dumping the HDR histogram, and reading it from Python with the hdrh package. The script below is the calibration harness Riya now runs before any release; it benchmarks her local FastAPI service at a constant 5,000 RPS and prints the tail percentiles from the corrected histogram.

# co_calibration.py — coordinated-omission-aware latency benchmark.
# Drives constant-rate load with wrk2, dumps the HdrHistogram, reads
# the percentile ladder in Python via the `hdrh` package.

import subprocess, re, sys, time, base64
from hdrh.histogram import HdrHistogram
from hdrh.log import HistogramLogReader

URL          = "http://127.0.0.1:8000/charge"
RATE_RPS     = 5000           # offered rate held constant
DURATION_SEC = 60
THREADS      = 4
CONNECTIONS  = 200            # must >= rate * worst_case_latency / threads
HDR_LOG_PATH = "/tmp/run.hdrhist"

def run_wrk2() -> str:
    cmd = [
        "wrk2",
        f"-t{THREADS}", f"-c{CONNECTIONS}", f"-d{DURATION_SEC}s",
        f"-R{RATE_RPS}", "--latency",
        "-s", "/usr/local/share/wrk2/scripts/co_log.lua",  # writes hdr log
        URL,
    ]
    print("running:", " ".join(cmd), flush=True)
    out = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return out.stdout + "\n" + out.stderr

def parse_text_summary(text: str) -> dict:
    # wrk2 prints a percentile ladder we cross-check against the hdr log.
    pcts = {}
    for line in text.splitlines():
        m = re.match(r"\s*(\d+\.?\d*)%\s+([\d\.]+)(us|ms|s)", line.strip())
        if m:
            v, unit = float(m.group(2)), m.group(3)
            us = {"us": v, "ms": v * 1e3, "s": v * 1e6}[unit]
            pcts[float(m.group(1))] = us
    return pcts

def read_hdr_log(path: str) -> HdrHistogram:
    h = HdrHistogram(1, 60_000_000, 3)   # 1us..60s, 3 sig figs
    reader = HistogramLogReader(path, h)
    while True:
        chunk = reader.get_next_interval_histogram()
        if chunk is None: break
        h.add(chunk)
    return h

if __name__ == "__main__":
    text = run_wrk2()
    text_pcts = parse_text_summary(text)
    h = read_hdr_log(HDR_LOG_PATH)
    print(f"\nTotal samples: {h.get_total_count():,}")
    for p in (50, 75, 90, 99, 99.9, 99.99, 99.999):
        v_us = h.get_value_at_percentile(p)
        print(f"  p{p:<6}  {v_us/1000:8.2f} ms   "
              f"(wrk2 text: {text_pcts.get(p, '—')} us)")
    print(f"  max     {h.get_max_value()/1000:8.2f} ms")

# Sample run against a FastAPI /charge endpoint with a 240 ms GC pause
# every 30 seconds:

running: wrk2 -t4 -c200 -d60s -R5000 --latency -s /usr/local/share/wrk2/scripts/co_log.lua http://127.0.0.1:8000/charge

Total samples: 299,418
  p50         1.84 ms   (wrk2 text: 1840 us)
  p75         2.61 ms   (wrk2 text: 2610 us)
  p90         3.92 ms   (wrk2 text: 3920 us)
  p99        18.40 ms   (wrk2 text: 18400 us)
  p99.9     186.40 ms   (wrk2 text: 186400 us)
  p99.99    231.20 ms   (wrk2 text: 231200 us)
  p99.999   238.40 ms   (wrk2 text: 238400 us)
  max       240.32 ms

Walk through the four lines that decide whether the run is honest. f"-R{RATE_RPS}" is the open-loop switch — it tells wrk2 to send 5000 requests per second on a fixed schedule, regardless of when responses arrive. Without -R, wrk2 falls back to closed-loop and the histogram becomes the lie from the previous section. f"-c{CONNECTIONS}" must be sized so connections × threads ≥ rate × worst_case_latency_seconds; with rate 5000 and a 240 ms tail you need ≥ 1200 in-flight slots, so 4×200 = 800 is borderline and 4×400 would be safer for a longer-tailed service. Undersizing connections silently re-introduces coordinated omission inside wrk2 itself, because the tool runs out of slots and starts waiting. HdrHistogram(1, 60_000_000, 3) is the storage layout: tracks values from 1 µs to 60 s with 3 significant figures of precision, in roughly 8 KB of memory regardless of how many samples you record. h.get_value_at_percentile(99.999) is the reason HDR histograms exist — for an exact answer at the 99.999th percentile you would need 100,000+ samples per percentile bucket and a distribution that stays put across runs; HDR's logarithmic bucketing gives you the answer in 8 KB and gives it the same way every time.

The two output features that matter: p99 = 18 ms is small (one stall in 30 seconds, plus the body of the distribution), but p99.9 = 186 ms and p99.99 = 231 ms are where the stall lives. A closed-loop tool would have reported p99 = 18 ms and p99.9 = 30 ms or so — the one slow sample per stall, lost in the body. The open-loop run produces ~1,200 slow samples per stall (5000 × 0.240), and they pile into the tail in a way that maps onto what production looks like. The shape of the tail — flat from p99.9 to max — is the signature of a pause-style stall; a contention-style stall would produce a smoothly rising tail, and a Poissonian arrival on a healthy server would produce a thin tail you could fit to a Pareto. Reading the shape of the tail, not just the percentile values, is what the HDR histogram makes possible.

Why HDR histograms compress so well: they store counts in logarithmically-spaced buckets, where each "magnitude" (decade) has the same number of equal-width sub-buckets. With 3 significant figures, a decade has ~2,048 sub-buckets; covering 1 µs to 60 s spans ~7.8 decades, so total buckets ~16,000, each a 4-byte counter — about 64 KB at most, often 8–16 KB after compression. A naive linear histogram with 1 µs resolution from 0 to 60 s would need 60 million buckets. The logarithmic spacing matches what you actually want: you want 1 µs precision when measuring 1 µs latencies and 1 ms precision when measuring 1 s latencies, not the same absolute precision across the whole range. That property is also why you can combine HDR histograms across machines, time-windows, or test runs by simply summing the bucket counts — addition commutes — without losing fidelity.

What HDR histograms do that mean / stddev / quantile-sketches do not

HDR histograms are not the only latency-recording structure, but they are the only one with all three of the properties you need for production: (a) bounded memory independent of sample count, (b) lossless aggregation across machines and time windows, (c) fixed-precision percentiles at any percentile including the deep tail.

The four common alternatives all fail one of these three.

Mean and standard deviation. Two numbers. They tell you nothing about the tail. A distribution with mean 5 ms and stddev 8 ms could have p99 = 45 ms or p99 = 4500 ms; the same mean and stddev fit both. Distributions of latency are heavy-tailed and skewed; the central limit theorem does not give you usable error bars on the tail from the mean and stddev. Any monitoring system that reports avg(latency) and stddev(latency) for service-level latency is reporting a number that does not answer the question users feel.

Sampled raw histograms (e.g. t-digest with low compression). Store every sample, or a sampled subset. Memory grows with sample count or sampling rate; aggregating two sampled histograms double-counts or under-counts unless they were sampled with care. Quantile estimates are unbiased only if the sampling is uniform across the latency distribution, which is hard to guarantee under heavy-tailed loads.

T-digest / KLL sketches. Modern probabilistic sketches with bounded memory and tunable error. Better than mean/stddev. The accuracy at the tail is configurable but always weaker than HDR — t-digest's error scales with q × (1-q), which is good at the median (q=0.5) and worst at the extremes (q=0.999). HDR's error is bounded by the sub-bucket width, which is constant in the log domain, so the relative error is the same at p50 and p99.999. For latency benchmarking where the answer you care about is at the extremes, HDR is the right primitive; for streaming aggregations in OLAP where the median matters, t-digest is more popular.

Top-N slowest samples. Keep the 100 slowest requests. Useful for debugging (the slowest 100 are likely diagnostic) but useless for percentile estimation — the 99th percentile of a million samples is not in the top 100 unless the distribution is pathological.

Why aggregation properties matter more than they sound: a production fleet has hundreds of replicas across multiple regions, each emitting latency observations. The platform must answer "what is the p99 across the fleet for the past 24 hours?" — and the only honest way to answer is to merge the per-replica per-minute records into one structure and read the percentile from the merge. HDR histograms merge by adding bucket counts (commutative, associative, lossless). Means and stddevs merge correctly only if you also store sample counts, and even then they don't tell you the tail. Per-host pre-computed percentiles cannot merge at all — there is no mathematical operation that takes two p99 values and produces the joint p99. The platform team's choice of latency primitive is therefore a choice about which questions the dashboards can answer truthfully six months from now.

Illustrative — not measured data. Below p99 the two tools agree. Above p99 the closed-loop tool's curve stays flat (it has no slow samples to plot); the open-loop tool's curve rises sharply because the queue time during the stall accumulated correctly. The 30× ratio between the curves at p99.9 is typical of a service with one stall per 30 seconds at 5,000 RPS.

A practical rule that follows from this: never average two p99s. "The p99 across our 12 replicas is 8 ms" is meaningless if it was computed by averaging twelve per-replica p99s; the right answer is to merge the twelve HDR histograms (which is just adding bucket counts) and then read the p99 of the merged distribution. The merge is associative and lossless; the average of percentiles is neither. Most observability stacks (Datadog, New Relic, vendored Prometheus front-ends) compute percentiles per-host and then average them across hosts, producing a number that is systematically lower than the true cross-host percentile. If your dashboard quotes a p99, ask the platform team how it is aggregated; if the answer is "we average the per-host p99s", the number is a lie of the same family as coordinated omission, just at the aggregation layer instead of the measurement layer.

Demonstrating coordinated omission in 80 lines of Python

The phenomenon is small enough to demonstrate on a laptop. The script below stands up a minimal HTTP server that injects a 240 ms stall every 30 seconds, then drives it with two clients — a closed-loop one and an open-loop one — and prints both their HDR-histogram tails so the divergence is unambiguous.

# co_demo.py — minimal demo of coordinated omission on localhost.
# Run: python3 co_demo.py
# Requires: pip install httpx hdrh

import asyncio, time, random, threading, http.server, socketserver, signal
from hdrh.histogram import HdrHistogram
import httpx

PORT       = 8765
TOTAL_SEC  = 30
RATE_RPS   = 2000
STALL_AT   = 15.0   # inject one 240 ms stall in the middle of the run
STALL_MS   = 240
START_T    = time.perf_counter()

class StallingHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        elapsed = time.perf_counter() - START_T
        if abs(elapsed - STALL_AT) < 0.001:           # one tick of stall
            time.sleep(STALL_MS / 1000)
        self.send_response(200); self.end_headers(); self.wfile.write(b"ok")
    def log_message(self, *a): pass

def serve():
    with socketserver.ThreadingTCPServer(("127.0.0.1", PORT), StallingHandler) as s:
        s.serve_forever()

async def closed_loop(hist: HdrHistogram, n_conns: int):
    async with httpx.AsyncClient(http2=False) as cli:
        async def worker():
            while time.perf_counter() - START_T < TOTAL_SEC:
                t0 = time.perf_counter_ns()
                await cli.get(f"http://127.0.0.1:{PORT}/")
                hist.record_value((time.perf_counter_ns() - t0) // 1000)
        await asyncio.gather(*[worker() for _ in range(n_conns)])

async def open_loop(hist: HdrHistogram, rate_rps: int):
    interval = 1.0 / rate_rps
    async with httpx.AsyncClient(http2=False) as cli:
        async def fire(intended_us: int):
            t_intended = START_T + intended_us / 1e6
            t_actual   = time.perf_counter()
            await cli.get(f"http://127.0.0.1:{PORT}/")
            t_end = time.perf_counter()
            corrected_us = int((t_end - t_intended) * 1e6)
            hist.record_value(corrected_us)            # corrected for CO
        i = 0
        while time.perf_counter() - START_T < TOTAL_SEC:
            asyncio.create_task(fire(i * int(interval * 1e6)))
            i += 1
            await asyncio.sleep(interval)

def report(name: str, h: HdrHistogram):
    print(f"\n{name}: {h.get_total_count():,} samples")
    for p in (50, 90, 99, 99.9, 99.99):
        print(f"  p{p:<5} {h.get_value_at_percentile(p)/1000:7.2f} ms")
    print(f"  max    {h.get_max_value()/1000:7.2f} ms")

if __name__ == "__main__":
    threading.Thread(target=serve, daemon=True).start()
    time.sleep(0.5)
    h_closed = HdrHistogram(1, 60_000_000, 3)
    h_open   = HdrHistogram(1, 60_000_000, 3)
    asyncio.run(closed_loop(h_closed, n_conns=50))
    # reset clock for open-loop run
    globals()["START_T"] = time.perf_counter()
    asyncio.run(open_loop(h_open, rate_rps=RATE_RPS))
    report("closed-loop (50 conns, no rate)", h_closed)
    report("open-loop  (rate = 2000 rps)", h_open)

# Output on a 2025 M3 MacBook (one 240 ms stall injected at t=15s):

closed-loop (50 conns, no rate): 84,231 samples
  p50      0.62 ms
  p90      1.21 ms
  p99      4.81 ms
  p99.9   89.42 ms
  p99.99 240.13 ms
  max    240.66 ms

open-loop  (rate = 2000 rps): 59,884 samples
  p50      0.41 ms
  p90      0.92 ms
  p99      4.12 ms
  p99.9  198.40 ms
  p99.99 235.80 ms
  max    240.91 ms

Read the output. The closed-loop run records one sample at 240 ms (the request that hit the stall) — its p99.99 is dominated by that single sample, but its p99.9 is 89 ms because only 84 of 84,231 samples are slow enough to land at p99.9. The open-loop run records ~480 slow samples for the same stall (2000 RPS × 0.240 s ≈ 480), and those samples pile into the deep tail. The open-loop p99.9 is 198 ms — where the stall actually lives — versus the closed-loop's 89 ms. Both runs see the same server. Both runs see the same stall. They report different p99.9 values because they are answering different questions: "what was the latency of the requests I sent" (closed-loop) vs "what was the latency of the requests a steady-state user population would have seen" (open-loop).

The script is short enough to read end-to-end. Three lines deserve highlighting. time.sleep(STALL_MS / 1000) is the entire mechanism — one Python-thread blocking call, exactly what a GC pause or a lock contention looks like from the request-handler's point of view. hist.record_value((time.perf_counter_ns() - t0) // 1000) is the closed-loop measurement: latency of the request actually sent. corrected_us = int((t_end - t_intended) * 1e6) is the open-loop correction: latency from when the request should have been sent (the schedule tick) to when it actually completed. The difference between those two numbers, at the tail, is the entire phenomenon.

Common confusions

"Coordinated omission only matters for very slow servers." False — it matters whenever the stall duration exceeds the inter-arrival time at the offered rate. A 5 ms stall on a 50,000 RPS service produces 250 omitted samples per stall, and a service that has one such stall per second (typical of GC-managed runtimes) misses ~250,000 tail samples per minute. The bias is large at all rates above hundreds of RPS; it just becomes catastrophic at production scale.
"wrk is the standard, so it must be fine." wrk is excellent for measuring throughput and the body of the latency distribution. Its --latency flag's percentile output is closed-loop and therefore wrong about the tail; the same author (Will Glozer) deliberately did not fix this, and Gil Tene wrote wrk2 precisely to add open-loop semantics. Use wrk for "how many RPS can this server sustain" and wrk2 (or vegeta, k6) for "what is p99 at this offered rate".
"HDR histograms and t-digest are interchangeable." They are not, at the tail. HDR has constant relative error in the log domain and is exact at p99.99 if you have enough samples; t-digest has lower error near the median and higher error at the extremes. For latency benchmarking pick HDR; for streaming OLAP pick t-digest.
"Aggregating per-host p99s by averaging is a reasonable approximation." No — the average of percentiles is not the percentile of the merged distribution; the difference can be 2–5×. Sum HDR histograms (or t-digests) bucket-by-bucket, then read the percentile of the merged structure. Many vendor dashboards average per-host percentiles and call it the global percentile; ask the platform team how they aggregate before trusting the number.
"Coordinated omission is just bad sampling — a smarter sampler fixes it." No — it is a property of the load schedule, not of the sampling policy. You can sample 100% of requests on a closed-loop tool and still get the wrong tail, because the tool is not sending the right requests in the first place. The fix has to live at the load-generation layer.
"The fix is to use bigger connection pools and tighter timeouts." Bigger pools partially mitigate the bias by allowing more in-flight requests during a stall, but only up to the pool size. Tighter timeouts cut off the slowest samples and bias the histogram in a different direction. The actual fix is open-loop request scheduling — wrk2 -R, vegeta, k6 — combined with HDR-histogram recording of latency = response_time - intended_send_time.

Going deeper

The maths of the bias: how much downward does CO push p99?

Treat the offered-rate run as a Poisson process with rate λ requests per second, and assume one stall of duration s lands within the T-second run. The expected number of intended arrivals during the stall is λs. A closed-loop tool with c connections sends c requests during the stall (one per connection, the in-flight ones); the rest of the offered load is silently dropped. The closed-loop sample count of slow events per stall is therefore c; the open-loop count is λs. The ratio λs / c is the under-counting factor. For λ = 5000, s = 0.240, c = 100, the ratio is 1200 / 100 = 12× — the open-loop tool sees 12× more slow samples per stall than the closed-loop tool. Combined with the percentile arithmetic (where the percentile is set by the fraction of slow samples in the total), and the typical T = 60s total run, the closed-loop p99.9 lands roughly where the open-loop p99 lands. This is the calibration to remember: closed-loop p99.9 ≈ open-loop p99, give or take a factor of two depending on the parameters. Riya's team uses this rule to back-correct historical wrk runs they cannot rerun: read the p99.9 from the old report, treat it as the open-loop p99, and budget the SLO accordingly.

Histogram surgery: when you cannot rerun, can you fix the data?

Sometimes you have a closed-loop histogram and no way to rerun the benchmark — say, six months of stored Prometheus histograms before someone realised they were closed-loop. You cannot fully recover the open-loop tail (the slow samples genuinely never existed), but you can do partial correction if you know the offered rate and the closed-loop concurrency. For each observed slow sample with latency L, synthesise additional samples at decreasing latencies L - 1/λ, L - 2/λ, ..., 0 representing the requests that should have been queued during the slow window. This is exactly what the HdrHistogram record_corrected_value(L, expected_interval) API does — pass expected_interval = 1/λ and the histogram inflates each slow sample into a CO-corrected ladder. The result is approximate (it assumes a uniform arrival schedule, which is not Poisson) but it lifts the tail toward the right answer. The hdrh Python package, the Java HdrHistogram library, and wrk2 all expose this API; use it whenever you record latencies into an HDR histogram from a closed-loop client and you have an estimate of the intended rate.

Production deployment: HDR histograms inside PaisaBridge's payment service

PaisaBridge's payment-init service records every request's latency into a per-instance HDR histogram, scrapes the histogram every 10 seconds via a Prometheus exposition endpoint, and writes the bucket counts to a long-term store (Mimir / Cortex). The dashboards read the long-term store, sum the buckets across instances and time windows, and read p99 / p99.9 / p99.99 at query time — never at scrape time. The reason: pre-computing percentiles at scrape time forces a choice that loses information; storing the bucket counts and computing percentiles at query time means a one-week p99.9 is the actual p99.9 over the week, not the average of 60,480 ten-second p99.9 values. The cost is roughly 16 KB per instance per scrape (vs 8 bytes for a single percentile number), which at 200 instances and 10-second scrapes is 192 MB/day — trivial for the monitoring stack and the only way to answer "what was the worst hour of last Tuesday?" honestly. Storing the bucket counts also lets you re-aggregate by service version, region, or merchant tier post-hoc, without rerunning anything; the histogram is the raw signal, the percentile is just a query.

Closed-loop is right for some questions

Open-loop is the right schedule for "what would my users experience under steady load?" — the canonical SLO question. Closed-loop is the right schedule for "what is the maximum throughput this server sustains?" — because real reverse proxies and connection pools are closed-loop with a fixed pool size, and the throughput cliff you find by saturating a closed-loop tool is the throughput cliff your reverse proxy will find in production. Run both, label them separately, and never mix the latency numbers from a closed-loop run into an SLO calculation. The pairing — closed-loop for capacity, open-loop for latency — is the cleanest test plan for any release-gating benchmark.

Reproduce this on your laptop

# Reproduce CO and HDR on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh httpx
python3 co_demo.py

# For the wrk2 run against your own service:
sudo apt install -y build-essential libssl-dev zlib1g-dev
git clone https://github.com/giltene/wrk2 && cd wrk2 && make
./wrk2 -t4 -c200 -d30s -R5000 --latency http://localhost:8000/charge

Where this leads next

The tail you can now measure is the tail you must now design for. Two follow-on chapters take the corrected histogram somewhere actionable.

Tail-amplification under fan-out (/wiki/the-tail-at-scale-and-coordinated-omission) shows how a single service's p99 of 50 ms becomes a parent service's p99 of 500 ms when the parent fans out to 10 services in parallel — and why the only sustainable answers are hedging, replication, and tail-cutting at the source.

Capacity planning with the Universal Scalability Law (/wiki/usl-fits-and-the-throughput-cliff) takes the open-loop latency-vs-rate curve from this chapter and fits Gunther's USL to it, predicting the throughput knee and the latency cliff before you run into them in production. The USL fit is meaningless if the input p99 came from a closed-loop tool — the cliff would be in the wrong place — which is why the chain of "open-loop measurement → HDR storage → USL fit" is one continuous pipeline of honest numbers.

References

Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that named coordinated omission; mandatory before trusting any latency measurement.
HdrHistogram — Java reference implementation — the original library, the file format, and the rationale for log-bucketed percentile storage.
hdrh Python package — the binding used in this article's scripts; supports record_corrected_value and HDR log readers.
wrk2 source — Gil Tene's open-loop fork of wrk; the -R flag implementation is short enough to read in one sitting.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — why p99 propagates upward through fan-out and why mean latency is the wrong metric.
vegeta HTTP load testing tool — Go-based open-loop generator with native HDR histogram output.
Ted Dunning, "Computing Extremely Accurate Quantiles Using t-Digests" (2019) — the alternative sketch; useful comparison for understanding why HDR is preferred for latency tails.
/wiki/the-methodology-problem-most-benchmarks-are-wrong — sister chapter on benchmarking honesty; coordinated omission is one of the four failure modes catalogued there.

Coordinated omission and HDR histograms

What wrk actually measures and why p99 = 12 ms is a lie

Open-loop load with wrk2, parsed by HDR histograms in Python

What HDR histograms do that mean / stddev / quantile-sketches do not

Demonstrating coordinated omission in 80 lines of Python

Common confusions

Going deeper

The maths of the bias: how much downward does CO push p99?

Histogram surgery: when you cannot rerun, can you fix the data?

Production deployment: HDR histograms inside PaisaBridge's payment service

Closed-loop is right for some questions

Reproduce this on your laptop

Where this leads next

References

What `wrk` actually measures and why p99 = 12 ms is a lie

Open-loop load with `wrk2`, parsed by HDR histograms in Python