Load testing: wrk, k6, Gatling

Karan runs a load test against Razorpay's payments API the night before the Diwali freeze. He uses wrk because it is what every blog post recommends, runs wrk -t12 -c400 -d60s https://api.payments.local/charge, and the output reads Requests/sec: 48,920, Latency p99: 11.4 ms. Capacity report says the fleet is good for 4× headroom. Diwali night the fleet melts at 18,200 RPS with p99 climbing to 2.1 seconds. The autopsy finds the wrk run was lying by 3× because of coordinated omission — when the server slowed down, wrk paused too, and never recorded the responses that would have been late. The tool reported the latency it observed, not the latency a real user would see. This chapter is about the three load-test tools every Indian backend team will encounter, what each one measures honestly, what each one lies about, and how to read the output without walking into Karan's morning.

A load test reports two things: the throughput your service achieved, and the latency distribution the test observed. Most production-quality load tests get the first number right and the second number wrong, because closed-loop tools (wrk, ab, raw jmeter) suffer from coordinated omission — when the server slows, the load generator slows with it, and the slowest responses never enter the histogram. wrk2, k6, and Gatling solve this by running an open-loop generator that fires requests at a target rate regardless of server response time, and by recording latency in a CO-corrected HdrHistogram. Pick the right tool for the right shape: wrk2 for raw HTTP throughput at sub-microsecond overhead, k6 for scripted user journeys with realistic ramp profiles, Gatling for sustained scenario tests with detailed per-request recording. Never trust a p99 number from a tool that does not say "open-loop, constant-arrival-rate, HdrHistogram" somewhere in its docs.

Closed-loop vs open-loop — the structural choice that decides the lie

The single most important property of a load testing tool is whether it generates load closed-loop or open-loop, and most engineers using these tools have never seen the distinction stated explicitly. The choice is structural, not a configuration knob, and it determines whether the tool can produce honest tail-latency numbers.

A closed-loop generator has N concurrent virtual users. Each user fires a request, waits for the response, and then fires the next one. If the server slows from 5 ms to 500 ms per response, each virtual user sends 100× fewer requests per second. The offered load drops in lock-step with the response time. The tool sees fewer slow responses than a real production workload would generate, because the tool backed off when the server slowed. This is coordinated omission — the tool and the server have coordinated to omit measurements of the worst responses. wrk (without -R), ab, the default jmeter thread-group model, and any "concurrent users" benchmarking tool work this way.

An open-loop generator fires requests at a configured rate (say, 10,000 RPS) regardless of how fast the server responds. If the server slows from 5 ms to 500 ms, the generator keeps firing 10,000 RPS into it, and the in-flight request count balloons. The tool measures the full response time of every request, including the queue wait at the server, which is exactly what a real user experiences when the server is overloaded. wrk2, k6, Gatling, vegeta, and locust (with the constant_pacing task setter) all support this mode. Tools like wrk2 were built specifically because Gil Tene's 2015 talk "How NOT to Measure Latency" documented coordinated omission as the dominant source of lying in load testing.

Same server, same overload. The closed-loop tool reports p99 ≈ 12 ms because it stopped sending when the server slowed. The three open-loop tools report p99 ≈ 470 ms because they kept firing and measured the queue wait. The 40× gap is what kills capacity reports. Illustrative — typical shape, magnitudes vary by overload depth.

Why the closed-loop tool is structurally lying, not just under-reporting: the tool measures latency = response_time - request_send_time. When the server slows, the tool sends fewer requests, so the histogram has fewer samples, and the worst-percentile bucket has fewer observations of slow responses. A real user does not back off when the server slows — they keep clicking. So the "real" latency distribution has many more entries in the slow buckets than the tool's distribution does. The lie is not "wrk reports a slightly low p99"; the lie is "wrk reports the latency of a workload that does not exist outside the test harness". Coordinated omission can hide 90%+ of the real tail at moderate overload, and 99%+ at severe overload.

A runnable open-loop load test driven from Python

The right way to run a load test is from a Python driver that invokes the load tool, parses its HdrHistogram output, and reports CO-corrected percentiles. The script below uses wrk2 because it is the smallest open-loop generator that produces a usable HdrHistogram dump; it then parses the dump with the hdrh Python package and prints the percentile ladder. The same driver works with k6 (substitute the subprocess invocation) and Gatling (parse simulation.log instead of HdrHistogram).

# loadtest_driver.py — open-loop load test of an HTTP endpoint with CO correction
# Uses wrk2 as the generator (constant-arrival-rate), parses its HdrHistogram dump.
# Run: python3 loadtest_driver.py https://api.payments.local/charge 5000 60
import subprocess, sys, re, json, time
from hdrh.histogram import HdrHistogram
from hdrh.log import HistogramLogReader

URL          = sys.argv[1] if len(sys.argv) > 1 else "http://localhost:8080/health"
TARGET_RPS   = int(sys.argv[2]) if len(sys.argv) > 2 else 1000
DURATION_S   = int(sys.argv[3]) if len(sys.argv) > 3 else 30
THREADS      = 8                # OS threads in wrk2 — should match cores on the LG
CONNECTIONS  = 256              # open TCP connections; size for target_rps × p99_seconds
SLO_P99_MS   = 200.0            # what your service is committed to

def run_wrk2(url, rps, duration_s, threads, connections):
    """Invoke wrk2 in constant-arrival-rate mode with HdrHistogram output."""
    cmd = ["wrk2", "-t", str(threads), "-c", str(connections),
           "-d", f"{duration_s}s", "-R", str(rps),
           "-L",                     # detailed latency stats with HdrHistogram
           "--u_latency",            # uncorrected latency too, so you can see the lie
           url]
    print(f"$ {' '.join(cmd)}")
    started = time.time()
    out = subprocess.run(cmd, capture_output=True, text=True, timeout=duration_s + 60)
    elapsed = time.time() - started
    return out.stdout, out.stderr, elapsed

def parse_wrk2_percentiles(stdout):
    """Extract the corrected and uncorrected percentile tables from wrk2 stdout."""
    blocks = re.split(r'-+\n', stdout)
    corrected, uncorrected = {}, {}
    for line in stdout.splitlines():
        m = re.match(r'\s*([\d.]+)%\s+([\d.]+)(\w*)', line)
        if not m: continue
        pct, val, unit = float(m.group(1)), float(m.group(2)), m.group(3)
        ms = val * (1000.0 if unit == "s" else 1.0 if unit == "ms" else 0.001)
        # First table is corrected, second is uncorrected (if --u_latency given)
        target = uncorrected if "Uncorrected Latency" in stdout[:stdout.find(line)] else corrected
        target[pct] = ms
    return corrected, uncorrected

stdout, stderr, elapsed = run_wrk2(URL, TARGET_RPS, DURATION_S, THREADS, CONNECTIONS)
print(stdout[-1800:])    # last bit of output — has the percentile tables

# Extract Requests/sec achieved vs target
m = re.search(r'Requests/sec:\s+([\d.]+)', stdout)
achieved_rps = float(m.group(1)) if m else 0.0

corrected_p99 = None; uncorrected_p99 = None
for line in stdout.splitlines():
    if line.strip().startswith("99.000%"):
        ms = float(re.search(r'([\d.]+)ms', line).group(1)) if "ms" in line else None
        if corrected_p99 is None:
            corrected_p99 = ms
        else:
            uncorrected_p99 = ms

print(f"\n=== SUMMARY ===")
print(f"Target RPS:              {TARGET_RPS}")
print(f"Achieved RPS:            {achieved_rps:.0f}    (gap = throughput shortfall)")
print(f"p99 (corrected):         {corrected_p99} ms")
print(f"p99 (uncorrected):       {uncorrected_p99} ms")
if corrected_p99 and uncorrected_p99:
    co_factor = corrected_p99 / uncorrected_p99
    print(f"Coordinated omission factor: {co_factor:.1f}x")
print(f"SLO target p99:          {SLO_P99_MS} ms")
print(f"VERDICT: {'BREACH' if (corrected_p99 or 1e9) > SLO_P99_MS else 'within SLO'}")

Sample run against a backend deliberately overloaded by setting TARGET_RPS above the service's measured knee:

$ wrk2 -t 8 -c 256 -d 60s -R 5000 -L --u_latency https://api.payments.local/charge
Running 1m test @ https://api.payments.local/charge
  8 threads and 256 connections
  Thread Stats   Avg      Stdev     Max
    Latency   142.30ms   220.10ms   2.18s
    Req/Sec   612.43     78.21      812
  Latency Distribution (HdrHistogram - Recorded Latency)
   50.000%   12.20ms
   75.000%   28.40ms
   90.000%   180.20ms
   99.000%   780.40ms
   99.900%    1.62s
   99.990%    2.10s
  Latency Distribution (HdrHistogram - Uncorrected Latency)
   50.000%    8.10ms
   75.000%   11.30ms
   90.000%   18.20ms
   99.000%   42.10ms
   99.900%   88.40ms
   99.990%  124.20ms
  295834 requests in 1.00m, 142.21MB read
Requests/sec:   4930.57
Transfer/sec:   2.37MB

=== SUMMARY ===
Target RPS:              5000
Achieved RPS:            4931    (gap = throughput shortfall)
p99 (corrected):         780.4 ms
p99 (uncorrected):       42.1 ms
Coordinated omission factor: 18.5x
SLO target p99:          200.0 ms
VERDICT: BREACH

Walking the key lines. -R 5000 is the load-bearing flag: it tells wrk2 to fire 5000 requests/second open-loop, regardless of how fast the server responds. Without -R, wrk2 reverts to closed-loop mode and the histogram becomes a lie. -L turns on the HdrHistogram-corrected percentile output, which is the only honest column in the result table. --u_latency prints the uncorrected histogram alongside, so you can see the size of the lie — in the sample above, the uncorrected p99 is 42 ms while the corrected p99 is 780 ms, an 18.5× understatement that would mislead any capacity report. Achieved RPS: 4931 vs Target RPS: 5000** is the throughput-shortfall signal: when the achieved rate falls below the target, wrk2 was unable to send fast enough because the server's slow responses backed pressure into the connection pool. A 5–10% shortfall is normal noise; a 30%+ shortfall means the test exceeded the server's capacity to even accept connections, and the latency numbers should be interpreted as "service rejecting load" rather than "service serving load slowly". Coordinated omission factor: 18.5x is the multiplier between the lie a closed-loop tool would tell and the truth this open-loop tool measured — display this number in every load-test report you publish, because it forces the reader to confront the fact that closed-loop numbers are not approximately right.

Why HdrHistogram is required (not just any histogram): a naive linear-bin histogram with 1ms buckets covering 0–10s requires 10000 buckets and loses precision below 1ms. A naive log-bin histogram covers the range cheaply but loses precision to the bucket width at high values. HdrHistogram uses a hybrid scheme — log-spaced buckets sub-divided into linear sub-buckets — that gives configurable precision (e.g. 3 significant figures across 1µs–60s) with bounded memory (~16KB). It also correctly handles the merging of multiple histograms (essential for distributed load generation across many machines) and supports the record_corrected_value(value, expected_interval) API that adds back the missing samples coordinated omission would have dropped. Every honest latency tool in production — Tene's wrk2, k6, Gatling, Cassandra, Kafka, Envoy, Linkerd — uses HdrHistogram or an equivalent.

The three tools in practice — wrk2, k6, Gatling

The three tools cover three different needs, and choosing the wrong one for the job produces tests that pass but fail to predict production. The summary below comes from Razorpay's load-testing playbook (revised after the 2024 Diwali postmortem), Zerodha Kite's pre-market open testing, and Hotstar's IPL pre-season capacity drills.

wrk2 is the right tool when the question is raw HTTP throughput at a single endpoint. It is a tiny C program with a Lua scripting hook for request customisation, a constant-arrival-rate generator, and HdrHistogram output. It can saturate a 100 Gbps NIC from a single host with 50 µs of overhead per request — by far the leanest generator. The downside is that the Lua scripting model is awkward for multi-step user journeys (login → fetch profile → place order → check status); the moment you need state across requests, the Lua code becomes a liability. Use wrk2 when you are testing a single API endpoint's capacity ceiling, when you need the highest possible request rate from each load-generator host, or when you are reproducing a benchmark from a paper that quotes wrk2 numbers.

k6 is the right tool when the test is a scripted user journey with realistic ramp profiles. It is a Go binary with a JavaScript scripting layer (running on Goja, a JS interpreter embedded in Go), an open-loop scenario engine, and detailed thresholds-based pass/fail. The JavaScript scripting model makes multi-step journeys readable: http.post(...), check(response, {...}), sleep(0.5), all inside a default function that runs per virtual user. The scenarios system (constant-arrival-rate, ramping-arrival-rate, per-vu-iterations, shared-iterations) lets you express realistic load shapes — the IRCTC Tatkal pattern is ramping-arrival-rate with three stages ({duration: '1m', target: 3000}, {duration: '10s', target: 90000}, {duration: '90s', target: 90000}). Use k6 when you are testing a multi-step user flow, when you need to model realistic ramp shapes, or when the test needs to integrate with CI/CD via the threshold-based exit codes.

Gatling is the right tool when the test is a long sustained scenario with detailed per-request recording for postmortem analysis. It is a Scala-based tool with a DSL that reads almost like English (scenario("checkout").exec(http("login").post("/login")).pause(1).exec(...)), an open-loop injection profile (constantUsersPerSec, rampUsersPerSec, atOnceUsers), and an HTML report generator that produces clickable percentile-vs-time plots, per-request error breakdowns, and per-step latency contributions. The downside is JVM startup overhead (~5 seconds, irrelevant for long tests but noisy for CI), and the Scala-heavy build (sbt or Maven) adds friction for teams without JVM expertise. Use Gatling for the 2-hour Diwali sustained-peak test, the 4-hour Big Billion Days simulation, or any test where the postmortem requires a rich HTML report with drill-down per-step latency contributions.

Tool	Generator language	Open-loop?	HdrHistogram	Best for	Per-host RPS ceiling
wrk2	C + Lua	Yes (`-R`)	Yes (`-L`)	single-endpoint capacity	~1.5M RPS
k6	Go + JavaScript	Yes (scenarios)	Yes (built-in)	scripted user journeys	~80K RPS
Gatling	Scala DSL	Yes (injection)	Yes (built-in)	sustained scenarios + reports	~50K RPS
`wrk` (no `-R`)	C + Lua	No (closed-loop)	No	nothing — never use for SLO claims	~2M RPS (lying)
`ab` (Apache Bench)	C	No (closed-loop)	No	smoke tests only, never SLOs	~50K RPS (lying)
`vegeta`	Go	Yes (`-rate`)	Yes (`-output hdr`)	quick ad-hoc constant-rate tests	~200K RPS
`locust`	Python	Yes (`constant_pacing`)	No (raw histogram)	quick scripting in Python shops	~10K RPS / worker

The three tools cover three different shapes. wrk2 maximises raw throughput against one endpoint; k6 expresses scripted multi-step user journeys with realistic ramp shapes; Gatling produces sustained scenarios with rich postmortem HTML reports. Pick by test shape, not by team familiarity. Per-host RPS ceilings are typical figures from c6i.4xlarge load generator hosts.

Why per-host RPS ceilings matter for distributed load generation: a single load-generator host has finite TCP socket slots (~64K ephemeral ports per source IP), finite kernel send/receive buffer memory, and finite NIC packet-rate (typically 1.5–3M packets/sec for a 25 Gbps NIC). To generate 500K RPS sustained, you need 4–10 generator hosts running in parallel, each binding a different source IP (or different ephemeral port range) to avoid TIME_WAIT exhaustion. Razorpay's pre-Diwali load test uses 16 c6i.4xlarge generator hosts, each at ~30K RPS, totalling ~480K RPS — close to peak Diwali traffic but still 4× below the per-host ceiling because the headroom prevents the generator itself from becoming the bottleneck and confounding the measurement.

Designing a load test that predicts production — five rules

The right load test does not just produce a number; it produces a number that predicts what will happen in production. The five rules below are what separates a test that catches the Diwali-night regression from a test that gives the team false confidence.

Rule 1: Match the production load shape, not just the magnitude. A 60-second flat-rate test at 50K RPS does not predict what happens during a Tatkal-style burst (3K RPS baseline → 90K RPS over 90 seconds → 3K RPS baseline). The burst shape exhausts connection pools, triggers cold JIT compilation paths, and cascades into autoscaler decisions that the flat-rate test never exercises. Use k6's ramping-arrival-rate or Gatling's rampUsersPerSec to reproduce the actual ingest curve from your last production peak. Razorpay's Diwali drill replays the previous year's per-second offered-load curve at 1.3× scale; the replay catches every regression that a flat-rate test misses.

Rule 2: Test from outside the same network, not from the same host. A load test running on the same Kubernetes cluster as the service-under-test bypasses the load balancer, the WAF, the rate limiter, the TLS termination, and the API gateway. Each of those is a potential bottleneck in production, and each contributes to the latency distribution a real user sees. Run the load generator from a separate VPC (ideally a different region — ap-southeast-1 generators against ap-south-1 services), and verify that the test traffic traverses the same path as production user traffic. The 30 ms of network round-trip that this adds is part of the SLO; the test should measure it.

Rule 3: Warm the system before measuring. A cold JVM has not yet JIT-compiled the hot paths. A cold connection pool has not yet established its TLS handshakes. A cold DNS cache will resolve every request from scratch. The first 30–60 seconds of any load test produce latency numbers that no production user ever experiences. Discard the warmup window explicitly — k6 supports discardResponseBodies: true and tags: { phase: 'warmup' }; Gatling supports nothingFor(60.seconds) followed by the real injection. Reporting "p99 = 180 ms" for a test that included its own cold-start is the second-most-common form of load-test lying after coordinated omission.

Rule 4: Drive load until you find the cliff, not just until you hit your target. A test that confirms "the service handles 50K RPS at p99 = 80 ms" tells you nothing about what happens at 60K, 70K, or 90K. The valuable test is the one that ramps load from 10K to 200K over 30 minutes and finds the exact RPS at which p99 crosses SLO — that is the operational headroom number that capacity planning needs. k6's ramping-arrival-rate with stages: [{duration: '30m', target: 200000}] produces the latency-vs-load curve directly, and the SLO breach point reads off the chart. Razorpay's pre-Diwali drill always includes a "find the cliff" stage; Hotstar runs it weekly during IPL season.

Rule 5: Always report achieved RPS alongside latency. A test that targets 50K RPS but only achieves 32K RPS is reporting latency for a workload at 32K, not 50K. The reader who sees only "p99 = 80 ms at 50K RPS" walks away thinking the service handles 50K. The reader who sees "p99 = 80 ms at 50K RPS target / 32K RPS achieved" knows the service collapsed at 32K. Make the gap unmissable. The driver script in the previous section explicitly prints both numbers and a "throughput shortfall" signal — copy that pattern into every load-test report your team produces.

Rule	What it catches	Tool feature to use
1: Match load shape	burst-only failure modes (Tatkal, IPL toss)	k6 `ramping-arrival-rate`, Gatling `rampUsersPerSec`
2: Test from outside	LB / WAF / gateway bottlenecks	separate-VPC generator hosts
3: Warm before measuring	cold-JIT, cold-pool, cold-DNS noise	k6 phase tags, Gatling `nothingFor`
4: Find the cliff	operational headroom number	ramping load past expected peak
5: Report achieved RPS	hidden throughput shortfall	wrk2's `Requests/sec`, k6's `iteration_duration`

Edge cases that break load tests

Three edge cases produce load-test results that pass but do not predict production. Each surfaces in real Indian production drills, each is invisible until you look for it, and each invalidates the SLO claims a team made from the test.

Connection reuse vs production fan-out. A load test from 16 generator hosts establishes 16 × 256 = 4096 long-lived TCP connections to the service. Real production traffic from millions of mobile clients arrives over a much wider distribution of connections — many of which do TLS handshake on every request because the client app does not maintain a connection pool. The test's per-request CPU cost on the server is CPU(handler); the production per-request cost is CPU(handler) + CPU(TLS handshake), and the TLS handshake on a fresh ECDHE-RSA negotiation costs 1–4 ms of CPU plus 1–2 round-trips. A test with high connection reuse can underestimate server CPU by 30–60% relative to production. Fix: configure the load generator to cycle connections (k6's noConnectionReuse: true, wrk2's Lua wrk.headers["Connection"] = "close" per request) for a realistic fraction of the test, calibrated to the connection-reuse rate observed in production.

Test data correlation that production does not have. A test that POSTs the same payload 50,000 times benefits from CPU caches, query plan caches, and database row caches that production never gets. A test that queries the same user_id repeatedly hits a Redis cache row that production users do not share. Production traffic is uniformly distributed across millions of users / orders / payment IDs, with low temporal locality. Fix: the load generator must read its test inputs from a large-cardinality dataset (millions of distinct payloads, IDs, headers) and either round-robin or randomly sample. Razorpay's pre-Diwali test uses a 50-million-row CSV of synthetic payment payloads; without it, the test underestimates database load by 4–8×.

Server-side connection limits that cap the test before the cliff. A service's listening socket has a SOMAXCONN (often 4096) and a per-process file descriptor limit (often 1024 by default, 64K with ulimit -n). The load test ramping past these limits gets connection refused errors, not slow responses, and the test wraps up with a deceptively low p99 ("most requests were fast"). The latency distribution in this regime is the latency of the requests that got through — a heavily filtered subset that excludes everything the OS dropped at the kernel level. Fix: monitor server-side ss -s | grep TCP: (especially the synrecv count and the accept queue overflow counter from nstat -s | grep ListenOverflows) during the test, and treat any non-zero overflow as a sign that the test exceeded the server's accept-queue capacity rather than the service's processing capacity.

Why these edge cases share a structural shape: each one is a measurement infrastructure artefact that is invisible in the test output but visible in the operational dashboard of a real production system. The discipline is to treat the load test as a system whose own performance must be characterised — what is the test's connection-reuse rate, what is its data-cardinality, what is its accept-queue behaviour — and to compare each of these to the corresponding production characteristic. A load test whose infrastructure characteristics differ from production by more than 20% on any axis is structurally incapable of predicting production behaviour, no matter how careful the latency measurement is.

Common confusions

"wrk and wrk2 are basically the same tool." They share authorship but are structurally different. wrk is a closed-loop generator with raw output; wrk2 is wrk plus the -R flag for constant-arrival-rate (open-loop) plus -L for HdrHistogram-corrected latency. Tene wrote wrk2 specifically to fix the coordinated-omission bug in wrk. Using wrk for SLO-relevant measurements is the single most common mistake in Indian load-testing practice — half the load-test reports in production wikis cite wrk numbers and confidently claim p99 = 12 ms when the real number is 200 ms+.
"More virtual users means more load." With closed-loop tools yes, with open-loop tools no. In open-loop mode, the generator fires at a configured RPS regardless of VU count; the VU count just sizes the connection pool. Setting VU = 10000 in k6 with arrival-rate = 1000 RPS does not produce 10000 RPS — it produces 1000 RPS with 10000 connections available to absorb in-flight requests. Confusing the two is how teams report "we tested with 10000 users" when they actually tested with 1000 RPS.
"Higher concurrency in the test means tougher test." No — a test at high concurrency but low RPS is testing your connection-handling ceiling, not your throughput ceiling. The right axis is RPS at the target latency distribution, not concurrency. Concurrency is a consequence of RPS × latency by Little's Law (L = λ × W); pinning concurrency and varying it does not vary the load shape coherently.
"A load test that passes once means the SLO is safe." A 5-minute test at 2× peak does not catch the failure modes that surface only after 30 minutes (slow connection-pool leaks, JVM old-gen growth, file-descriptor leaks, log shipper buffer overflow). Always include at least one duration test that runs for the full real peak duration (2 hours for Diwali, 4 hours for BBD opening), even if the magnitude is lower than peak.
"Coordinated omission only matters at high overload." It matters at any server slowdown, including normal load when individual requests occasionally take 50× longer than the median. A single 500 ms response in a test stream of 5 ms responses produces a 100× CO factor on the few requests around it. The honest p99 on a moderately loaded service is typically 1.5–4× higher than the closed-loop tool reports — small enough to slip past review, large enough to invalidate the capacity report.
"Gatling reports are more accurate because they're prettier." The HTML report quality has nothing to do with the latency-measurement methodology. Gatling, k6, and wrk2 all produce comparable accuracy when configured open-loop with HdrHistogram. The Gatling report is more useful for postmortem drill-down, but a wrk2 run with -L and a custom Python parser produces the same underlying numbers.

Going deeper

Distributed load generation — when one host cannot saturate the service

Beyond about 200K RPS, a single load-generator host runs out of TCP source ports, kernel buffer memory, or NIC packet-rate. The standard solution is to run N identical generator hosts in parallel and aggregate their results. The aggregation is non-trivial: you must merge the per-host HdrHistograms (sum of counts per bucket, then recompute percentiles), not average the per-host p99 values — averaging percentiles is mathematically meaningless. k6 supports this natively via k6 Cloud or with Grafana's xk6-output-prometheus-remote extension. Gatling supports it via Gatling Enterprise. For wrk2, the standard approach is to write each host's HdrHistogram to disk (-W flag in some wrk2 forks) and merge them in Python with hdrh.histogram.HdrHistogram.add(). Razorpay's pre-Diwali test uses 16 hosts with a Python aggregator script that produces a single merged percentile table at the end — the script is about 60 lines, mostly subprocess orchestration.

`vegeta` — the underrated alternative

vegeta (Go, by Tomás Senart) is an open-loop constant-rate generator with HdrHistogram output, much simpler to script than wrk2 (no Lua) and faster to set up than k6 (no JavaScript runtime). It reads a list of HTTP requests from stdin (one per line, METHOD URL), fires them at the rate set by -rate, and produces a binary .bin log that vegeta report parses into percentiles or vegeta plot renders as a latency-vs-time chart. For ad-hoc tests where you need a quick honest p99 number against a few endpoints, vegeta is faster to deploy than k6 and more honest than wrk. The reason it is less popular than k6 is that it lacks the scripted-journey scenario engine — every request must be self-contained, which precludes the multi-step user flows k6 expresses cleanly.

The "load model" — what your test is actually simulating

A load test simulates a load model: a set of user types, their action distributions, their think times, and their session durations. Most teams do not write down the model explicitly, which is why their tests do not predict production. The discipline is to derive the model from production observability — typical user types (returning customer, new signup, browsing-only), action mix per type (45% cart-add, 30% search, 25% checkout for the returning-customer type), think time distribution (exponential with median 8 seconds), session duration (lognormal, median 12 minutes, p99 45 minutes) — and encode it in the load script. Flipkart's BBD drill builds this model from the previous year's session-replay logs and replays it at 1.4× scale; the drill catches scenario-specific regressions that a uniform-RPS test never sees because uniform RPS does not match how real users actually use the service.

Continuous load testing — running the test in CI

A load test that runs once a quarter is operationally useless for catching regressions. The right cadence is continuous: every PR that touches a hot path runs a 60-second k6 test against a staging environment, and the CI pipeline fails the PR if p99 regresses by more than 10% from the previous baseline. k6's threshold system (thresholds: { http_req_duration: ['p(99)<200'] }) makes this trivial to express, and the exit code maps directly to CI pass/fail. Razorpay added this in 2024 and it caught three production-bound p99 regressions in the first month — each one a single PR introducing a previously-uncached database lookup on the hot path. The compute cost is about ₹400/month per service per environment, paid with one Diwali-night incident saved.

Reproduce this on your laptop

# Install wrk2 and the Python parser
sudo apt install build-essential libssl-dev libz-dev git
git clone https://github.com/giltene/wrk2 && cd wrk2 && make && sudo cp wrk /usr/local/bin/wrk2

python3 -m venv .venv && source .venv/bin/activate
pip install hdrh requests

# Start a tiny test server (or point at a real one)
python3 -m http.server 8080 &
sleep 2

# Run the load test driver against it
python3 loadtest_driver.py http://localhost:8080/ 1000 30

You should see a CO-corrected percentile table from wrk2, plus the summary block printed by the Python driver. Edit TARGET_RPS upward until the achieved RPS falls below the target — that is the test server's cliff. Compare the corrected and uncorrected p99 columns to see the size of the coordinated-omission lie.

Where this leads next

This chapter is the load-testing foundation for the rest of Part 14. The next chapters build the full capacity-and-resilience pipeline.

/wiki/headroom-peak-and-degraded-modes — the previous chapter; the headroom calculations this chapter's measurements feed into.
/wiki/coordinated-omission-and-hdr-histograms — the deep dive into HdrHistogram internals and the CO-correction algorithm.
/wiki/chaos-under-load — combining the load tests in this chapter with failure injection, to find the degraded-mode bugs that only surface under stress.
/wiki/load-shedding-strategies — what your service does when the load test exceeds capacity; the patterns that turn cliff-edge load into graceful degradation.
/wiki/autoscaling-metric-based-predictive — the autoscaler that the load test should also exercise; its reaction time matters as much as its trigger thresholds.

The closing rule: a load test reports two numbers — throughput and latency — and the latency number is wrong unless the tool is open-loop, the histogram is HdrHistogram, and the measurement window excludes the warmup. Hold those three properties together and the load-test result becomes a prediction about production. Skip any one and the result is a number that passes review and fails Diwali.

References

Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the canonical talk that defines coordinated omission; required viewing before running any load test.
wrk2 source and docs (Gil Tene's GitHub) — the constant-arrival-rate fork of wrk with HdrHistogram-corrected latency.
k6 documentation — open-loop scenarios — the open-loop scenario engine that produces honest tail-latency numbers under realistic ramp shapes.
Gatling injection profiles — the injection model that makes Gatling open-loop by default.
HdrHistogram (Tene, original implementation) — the histogram data structure every honest latency tool uses.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 12 — Benchmarking — the methodology framework for any benchmark, including load tests.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — why the latency distribution matters more than the average, and why your load test must measure it accurately.
/wiki/coordinated-omission-and-hdr-histograms — the internal cross-link to the HdrHistogram + CO-correction deep dive.