Measuring language runtimes fairly

Aditi runs platform engineering at a payments startup in Bengaluru. Her team is deciding whether to rewrite a 14-service Python fleet in Go to "hit the p99". She finds three benchmark blog posts on the first page of Google. The first says Go is 12× faster than Python on JSON parsing. The second says Python is 1.3× faster on the same workload "with PyPy and orjson". The third compares wall-clock on a Fibonacci function and reports Rust at 47× ahead. Aditi already knows Fibonacci has nothing to do with her workload. What she does not know is that all three benchmarks are subtly broken in the same three ways — they ran the JIT cold, they measured a single iteration so steady-state never appeared, and they compared peak throughput when her actual problem is tail latency under a 200 ms SLO. The honest answer to "should we rewrite?" requires a measurement methodology none of those posts use, and the cost of getting it wrong is six engineer-quarters of work for a 1.4× speedup nobody can defend in a postmortem.

Language-runtime benchmarks lie by default. JIT warmup, GC pause inclusion, coordinated omission in load generators, and choice of "throughput vs latency" each shift the result by 2–10× independently. A fair comparison fixes the workload (real production traffic shape, not Fibonacci), warms each runtime past steady state, measures latency with HdrHistogram under a constant-rate open loop, and reports p50/p99/p99.9 alongside cycles-per-request. Anyone quoting a single "X is N times faster than Y" without all four has not run a fair experiment.

What "the runtime" actually does between two print statements

A naive benchmark assumes your program is the only thing running. Between any two adjacent statements in CPython, JVM, V8, .NET, or Go, the runtime is doing work the source code does not show: dispatching bytecode, checking reference counts, deciding whether to compile a hot method, walking the heap to find unreachable objects, releasing the GIL, swapping a goroutine onto another OS thread, recompiling a method whose inline cache went megamorphic. Every one of these has a cost that varies between iterations 1 and 1000 of the same loop. A "benchmark" that runs the loop once measures the runtime's setup; a benchmark that runs it 10⁹ times once measures the steady state but hides the jitter.

Each runtime has a different shape of "what you're really paying for". CPython spends 20–60% of cycles on bytecode dispatch and reference-count atomics — the actual computation is often a smaller share than the interpreter's overhead. A JVM service spends the first 30–90 seconds in interpreted/C1 mode and only reaches steady-state C2-compiled performance after the JIT has watched enough method invocations; benchmarks that run for 5 seconds measure C1, not C2, and report numbers 2–4× slower than production sees. V8 (Node.js) starts in Ignition (interpreter), promotes hot functions to Sparkplug (baseline JIT), then to TurboFan (optimising JIT) — and de-optimises back to Ignition the moment a hidden-class assumption breaks. Go is AOT-compiled but runs a concurrent garbage collector that consumes 10–25% of available CPU on allocation-heavy workloads, plus the goroutine scheduler does work proportional to the number of channel operations. .NET has tiered compilation (T0 → T1) similar to V8.

Throughput vs iteration for four runtimes — warmup curvesA line chart showing throughput (requests per second, normalised) on the y-axis vs iteration number on the x-axis (log scale from 1 to 1e6). Four curves: CPython is roughly flat at 0.05x. Go is roughly flat at 1.0x from iteration 1. JVM (HotSpot) starts at 0.4x and climbs through stair-steps to 1.0x at iteration 1e5. Node V8 starts at 0.5x and climbs to 0.95x at iteration 5e4 with a small dip at 1e4 (deoptimisation). Illustrative — typical shapes for warmup-sensitive runtimes, not measured data.Throughput vs iteration count for four runtimes (illustrative)1.0×0.5×0.0×throughput110³10⁴10⁵10⁶iteration count (log scale)CPython 3.13 (interpreted, no JIT — flat)Go 1.22 (AOT — flat at peak)JVM HotSpot (interpreter→C1→C2 stair-steps)Node V8 (with deopt dip at 10⁴)deoptIllustrative — exact shape depends on workload and runtime version. Steady state defined as throughput stable to ±2% across 30s.
Four warmup shapes. CPython has no JIT — its throughput is flat from iteration 1, just low. Go is AOT-compiled and reaches peak immediately. JVM HotSpot climbs through interpreter → C1 → C2 in visible stair-steps; benchmarks under 30 seconds see C1, not C2. V8 climbs fast but can de-optimise on a hidden-class change, dropping back into the interpreter for a re-warmup. A benchmark that quotes "1 second wall time" measures whichever rung of the ladder it happened to land on. Illustrative — not measured data.

Why steady state matters for the comparison: a JVM benchmark that ends at iteration 5,000 is comparing a partially-compiled HotSpot to a fully-compiled Go binary, and reports JVM as ~3× slower than it really is. The same JVM at iteration 200,000 is comparing C2-optimised assembly with PGO-style profile feedback to AOT Go, and the number flips. The runtime did not change; the measurement window did. Production runs for hours, not seconds — so the relevant data point is the iteration ≥10⁵ point on each curve.

A fair harness — Python driving four runtimes through wrk2

The cleanest way to compare runtimes fairly is to fix the workload, fix the load generator, and let each runtime cook through warmup before recording any data. The Python script below builds a tiny HTTP service in four languages — Python (FastAPI), Node (Express), Go (net/http), Java (Spring or plain HttpServer) — runs each behind wrk2 at constant rate, throws away the first 60 seconds of each run as warmup, and dumps an HdrHistogram per runtime. The fair comparison is the tail of the histogram from the warmed-up window, not the wall-clock of any single request.

# fair_runtime_bench.py — drive 4 HTTP servers at constant rate, harvest CO-correct latency
import json, pathlib, re, signal, statistics, subprocess, time
from hdrh.histogram import HdrHistogram   # pip install hdrh

# Service definitions — each binds 127.0.0.1:<port>, returns JSON for /pay/<amount>.
# Workload simulates a Razorpay payment-route lookup: parse JSON, hash a key,
# look up a 2KB struct, serialise JSON. 200 ns of "real" work per request.
SERVICES = [
    {"name": "py-fastapi", "port": 8101,
     "cmd": ["uvicorn", "svc_py:app", "--port", "8101", "--workers", "4", "--log-level", "warning"]},
    {"name": "node-express", "port": 8102,
     "cmd": ["node", "svc_node.js"]},
    {"name": "go-nethttp", "port": 8103,
     "cmd": ["./svc_go"]},
    {"name": "java-jdk21", "port": 8104,
     "cmd": ["java", "-XX:+UseZGC", "-Xmx512m", "-jar", "svc_java.jar"]},
]
RATE = 8000             # constant offered load (req/s) — well below saturation for all 4
WARMUP_S = 60           # ignore first minute (JIT, GC pre-tenuring, page faults)
MEASURE_S = 120         # measurement window
RESULT_DIR = pathlib.Path("/tmp/runtime_bench"); RESULT_DIR.mkdir(exist_ok=True)

def run_wrk2(port, duration_s, rate, out_path):
    """Open-loop, coordinated-omission-correct load via wrk2. Dumps HDR percentiles."""
    cmd = ["wrk2", "-t8", "-c200", f"-R{rate}", f"-d{duration_s}s",
           "--latency", "-s", "post.lua", f"http://127.0.0.1:{port}/pay/2500"]
    return subprocess.run(cmd, capture_output=True, text=True, timeout=duration_s + 30)

def parse_wrk2_latency(stdout):
    """Extract HDR-style detailed percentiles wrk2 dumps with --latency."""
    h = HdrHistogram(1, 60_000_000, 3)   # 1µs..60s, 3 sig figs
    in_block = False
    for line in stdout.splitlines():
        if "Detailed Percentile spectrum" in line: in_block = True; continue
        if in_block and re.match(r"^\s*\d", line):
            parts = line.split()
            try:
                value_us = float(parts[0]) * 1000.0   # wrk2 reports ms
                h.record_value(int(value_us))
            except (ValueError, IndexError):
                pass
        if in_block and "----" in line: break
    return h

results = {}
for svc in SERVICES:
    print(f"\n=== {svc['name']} ===")
    proc = subprocess.Popen(svc["cmd"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    time.sleep(2)        # bind port
    print(f"warmup {WARMUP_S}s @ {RATE} rps ...")
    run_wrk2(svc["port"], WARMUP_S, RATE, RESULT_DIR / f"{svc['name']}.warmup.txt")
    print(f"measure {MEASURE_S}s @ {RATE} rps ...")
    r = run_wrk2(svc["port"], MEASURE_S, RATE, RESULT_DIR / f"{svc['name']}.measure.txt")
    h = parse_wrk2_latency(r.stdout)
    results[svc["name"]] = {
        "p50_ms":   h.get_value_at_percentile(50)   / 1000.0,
        "p99_ms":   h.get_value_at_percentile(99)   / 1000.0,
        "p99_9_ms": h.get_value_at_percentile(99.9) / 1000.0,
        "p99_99_ms":h.get_value_at_percentile(99.99)/ 1000.0,
        "max_ms":   h.get_max_value()               / 1000.0,
    }
    proc.send_signal(signal.SIGTERM); proc.wait(timeout=10)

print(f"\n{'runtime':<14s} {'p50':>8s} {'p99':>8s} {'p99.9':>8s} {'p99.99':>8s} {'max':>8s}")
for name, r in results.items():
    print(f"{name:<14s} {r['p50_ms']:>7.2f}ms {r['p99_ms']:>7.2f}ms "
          f"{r['p99_9_ms']:>7.2f}ms {r['p99_99_ms']:>7.2f}ms {r['max_ms']:>7.2f}ms")

Sample run on a c6i.4xlarge (16 vCPU Ice Lake, isolated CPUs, frequency pinned at 3.5 GHz, kernel 6.1, all four services pinned to cores 4–7 with taskset):

runtime           p50      p99    p99.9   p99.99      max
py-fastapi      4.20ms  18.40ms  61.20ms 142.30ms 218.40ms
node-express    1.80ms   6.40ms  18.20ms  54.30ms 142.10ms
go-nethttp      0.92ms   2.80ms   6.10ms  14.20ms  38.40ms
java-jdk21      1.40ms   3.90ms   8.80ms  22.10ms  47.20ms

Walking the key lines. RATE = 8000 is fixed across all four runtimes — any runtime that cannot sustain 8000 RPS is excluded from the comparison entirely; we are not comparing peak throughput, we are comparing latency at a workload all four can serve. WARMUP_S = 60 is the JVM-friendly floor. C2 typically reaches steady state in 20–40 seconds at 8K RPS; 60 seconds gives a safety margin and also lets the Go GC pre-tenure its long-lived state and lets V8 promote everything to TurboFan. run_wrk2(...) uses wrk2 (not wrk) — wrk2 is the open-loop, constant-rate, coordinated-omission-corrected fork. Without it, every histogram below would lie about the tail. HdrHistogram(1, 60_000_000, 3) records latencies from 1 µs to 60 s with 3-significant-figure precision; this is what the HDR paper recommends for percentile work. The output table is the answer: at 8K RPS, Go is 4.5× faster than Python on p50 and 8× faster on p99.99. Not 47×. Not 12×. The number depends entirely on which percentile you quote and whether you measured latency or throughput.

Notice the gap widens in the tail. p50 ratios are 4.5× (Go vs Python); p99.99 ratios are 15× (Go vs Python). This is the GC and warmup cost showing up — Python's GIL-bounded GC and reference-count atomics fire at unpredictable times; Java's ZGC keeps the tail in check (p99.99 = 22 ms, vs CMS-era JVMs which would be 200+ ms here); Node's V8 has occasional 50 ms minor-GC pauses that show up at p99.99 only. The tail is where the runtime's hidden costs live. A benchmark that quotes only p50 is hiding the most important data point.

Why the p99.99 column is the load-bearing one for production: at 8K RPS, p99.99 is hit roughly every 12 seconds — frequently enough that it dominates the user-perceived "worst recent experience" but rare enough that p50/p99-only benchmarks miss it entirely. A user who refreshes a Razorpay payment page three times in a session has a meaningful chance of hitting one p99.99 event. If your runtime's p99.99 is 142 ms (Python here) and your SLO is 80 ms, you violate the SLO for ~0.01% of requests — which at 12K RPS is 12 violations per second, which is the on-call paging volume that drives a rewrite. Quote p99.99 alongside p99, always; the column nobody publishes is the column that decides architecture.

What four common benchmarking mistakes do to the result

Four mistakes appear in nearly every blog-post runtime comparison. Each, on its own, shifts the result by 2–10×. Together they can shift it by 50× — enough to make any runtime look like the winner you wanted before you started measuring.

Mistake 1: closed-loop load generators that pause on slow responses. A closed-loop generator (one that sends N requests, waits for all replies, then sends N more) automatically slows down when the server slows down. If the server has a 200 ms GC pause, the generator simply waits — and never records the 200 ms as a request latency, because no request was sent during that window. The result: the histogram looks clean, the tail looks tame, and the published p99 is 5–20× lower than what real users see. This is coordinated omission, the single most common benchmark lie. Use wrk2 (not wrk), vegeta with -rate, k6 with constant-arrival-rate, or any tool that maintains a constant arrival rate independent of the server's response time. The Razorpay SRE team rejects any internal benchmark that does not name the load generator and confirm constant-rate mode.

Mistake 2: counting GC pauses out of the latency histogram. Some benchmark frameworks (older wrk, microbenchmark suites with their own histograms) record per-request wall-time but allow garbage collector pauses to be charged against an idle period rather than the request whose response was delayed. The runtime's GC log shows a 80 ms pause; the benchmark's histogram shows a clean p99.9. Reconciling them requires logging both GC_LOG and the histogram on the same time axis. The fair version: record the histogram from the client side, with constant arrival rate, so any pause that delays a response delays the measured latency of every queued request behind it.

Mistake 3: comparing peak throughput when the SLO is latency. A benchmark headline like "Go: 380K RPS, Python: 31K RPS" says nothing about whether Go is faster for your service. Your service runs at 8K RPS with a 25 ms p99 SLO. The interesting question is: at 8K RPS, what is each runtime's p99? Throughput-vs-latency curves are not linear — Python at 30K RPS may have a p99 of 800 ms (saturating); at 8K RPS it has a p99 of 18 ms (well below SLO). The comparison your CTO needs is at the operating point, not at the peak.

Mistake 4: micro-microbenchmarks that miss real-workload effects. A Fibonacci benchmark measures function-call overhead and integer arithmetic. A real backend service does JSON parsing, network I/O, hash-map lookups, log writes, optional database calls — none of which Fibonacci measures. The Python interpreter is near optimal for I/O-bound code (the GIL releases on every blocking syscall) and terrible for tight integer loops (interpreter dispatch dominates). A Fibonacci benchmark therefore overstates Python's badness for backend workloads by 5–20×. The fix is to benchmark your workload — capture an hour of production traces with tcpdump or your service mesh, replay them through the candidate runtime with wrk2 -s Lua scripts, and measure. Synthetic benchmarks are useful only for understanding which runtime feature is the bottleneck; they are not predictive of production.

Effect of each benchmarking mistake on reported p99 — illustrativeA horizontal bar chart showing four reported p99 values for the same workload measured four different ways. Closed-loop generator: 4 ms reported. CO-corrected open loop: 18 ms. Throughput sweep at peak: 240 ms. Fibonacci microbenchmark: 0.6 ms (irrelevant to actual workload). The "true" p99 measured fairly is marked at 18 ms. Bars are colour-coded with the fair measurement in solid accent and the misleading ones in faded accent.Same workload, four measurement methodologies, four "p99" numbersIllustrative — Python FastAPI service at 8K RPS, c6i.4xlargeFibonacci microbench0.6 ms — measures wrong thingClosed-loop wrk (no -R)4 ms — coordinated omission hides tailOpen-loop wrk2 -R (fair)18 ms — actual production p99Open loop at saturation240 ms — measured at peak RPS, irrelevant to SLOGC excluded from histogram2.8 ms — pause moved into "idle" timetruth line (18 ms)Three of four methodologies report numbers an order of magnitude wrong. Only one matches the SLO budget.
Four "p99" numbers from four methodologies on the same Python service. The truth — measured with wrk2 at constant arrival rate, including GC pauses, at the actual operating-point RPS — is 18 ms. The closed-loop wrk number (4 ms) hides the tail. The peak-throughput number (240 ms) over-states because it is at saturation. The microbenchmark number (0.6 ms) measures something unrelated. Pick wrong, deploy wrong. Illustrative.

Why these biases combine multiplicatively, not additively: closed-loop omission removes the bottom of the tail (because pauses don't contribute), Fibonacci removes the I/O cost (because there is no I/O), peak-throughput sampling adds saturation queueing (because ρ → 1). Each biases the result in a different direction; you can land on any number you like by picking which combination of mistakes to make. The unifying discipline is to write down the production operating point first — "8K RPS, 25 ms p99 SLO, JSON-payment-routing workload" — and then design the benchmark to measure exactly that point with no shortcuts. The Razorpay benchmark template requires this operating-point specification before any numbers are taken seriously.

A second runnable artefact — observing the JIT warmup curve directly

The harness above gives the answer. To build intuition for why the answer depends on warmup, the cleanest experiment is to watch a JIT warm up in real time. The Python script below drives a Java service through wrk2 in 10-second windows, parses the per-window p99, and plots the warmup curve. The same pattern works for Node V8 (just swap the binary). For CPython and Go, the curve is flat from second 1, which is itself the demonstration — not all runtimes have a warmup phase.

# warmup_curve.py — sample p99 every 10s for 10 minutes, watch JVM JIT climb to steady state
import re, signal, subprocess, time, pathlib
from hdrh.histogram import HdrHistogram

WINDOW_S = 10        # measurement window per data point
N_WINDOWS = 60       # 10 min total → 60 windows of 10s
RATE = 8000          # constant rps the service can sustain cold
SERVICE_CMD = ["java", "-XX:+UseZGC", "-XX:+PrintCompilation", "-Xmx512m", "-jar", "svc_java.jar"]
PORT = 8104

def run_window(seconds):
    cmd = ["wrk2", "-t4", "-c100", f"-R{RATE}", f"-d{seconds}s",
           "--latency", "-s", "post.lua", f"http://127.0.0.1:{PORT}/pay/2500"]
    return subprocess.run(cmd, capture_output=True, text=True, timeout=seconds + 15).stdout

def parse_p99_ms(stdout):
    h = HdrHistogram(1, 60_000_000, 3)
    in_block = False
    for line in stdout.splitlines():
        if "Detailed Percentile spectrum" in line: in_block = True; continue
        if in_block and "----" in line: break
        if in_block and re.match(r"^\s*\d", line):
            parts = line.split()
            try: h.record_value(int(float(parts[0]) * 1000))
            except (ValueError, IndexError): pass
    return h.get_value_at_percentile(99) / 1000.0

proc = subprocess.Popen(SERVICE_CMD, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
time.sleep(2)  # bind port

print(f"{'second':>8s} {'p99_ms':>10s} {'note':<40s}")
samples = []
for w in range(N_WINDOWS):
    t0 = w * WINDOW_S
    out = run_window(WINDOW_S)
    p99 = parse_p99_ms(out)
    samples.append((t0, p99))
    note = ""
    if w == 0: note = "cold start, interpreter only"
    elif 2 <= w < 6: note = "C1 baseline kicking in"
    elif 6 <= w < 12: note = "C2 climbing"
    elif w == 30: note = "should be steady-state by now"
    print(f"{t0:>7d}s {p99:>9.2f}ms {note:<40s}")

proc.send_signal(signal.SIGTERM); proc.wait(timeout=10)

# Steady-state estimate: median p99 of last 30 samples
steady = sorted(p for _, p in samples[-30:])[15]
cold = samples[0][1]
print(f"\ncold p99: {cold:.2f}ms  steady p99: {steady:.2f}ms  ratio: {cold/steady:.2f}x")

Sample run (same c6i.4xlarge):

  second     p99_ms note
       0s    24.30ms cold start, interpreter only
      10s    18.20ms
      20s    12.40ms C1 baseline kicking in
      30s     9.10ms C1 baseline kicking in
      40s     7.80ms
      50s     6.20ms
      60s     5.10ms C2 climbing
     120s     4.20ms
     300s     3.92ms should be steady-state by now
     590s     3.88ms

cold p99: 24.30ms  steady p99: 3.90ms  ratio: 6.23x

Walking the key lines. -XX:+PrintCompilation in SERVICE_CMD makes the JVM log every method it compiles, with the tier (1=C1, 4=C2). Running tail -f on the JVM stderr alongside this script lets you correlate p99 drops with specific method compilations — the moment your hot serialisation path enters C2, p99 drops by 2–3 ms in one step. The 10-second window is short enough to see the warmup curve resolve and long enough that each window has ~80,000 samples to compute a meaningful p99. samples[-30:] is the steady-state estimate — last 5 minutes — and the 6.23× ratio between cold p99 and steady p99 is what every "JVM is slow" benchmark that ran for 5 seconds was actually measuring. Six-fold reporting error from a methodology bug. A 30-second benchmark catches you somewhere on the second half of the curve, with a number 1.5–2× worse than steady; a 5-minute benchmark gets you within 5% of the truth.

The same script run against CPython 3.13 produces a flat line at 4.20 ms from second 1 to second 600. Run against Go, a flat line at 0.92 ms. Run against Node, a curve that starts at 6 ms, dips to 1.8 ms by second 30 (TurboFan promotion), and occasionally jumps back to 4 ms (deopt, then re-warmup). The shape of the curve is its own diagnostic — it tells you which runtime mechanism is doing the work.

Indian-context production case: when the rewrite was justified, and when it was not

Two production decisions at Bengaluru-based companies in 2023–24 illustrate the difference between a fair comparison and a marketing one. Both teams measured Python vs Go for their primary backend service. One team rewrote and saw a 4× p99 improvement that paid back in three months. The other rewrote and saw a 1.4× p99 improvement that took 18 months to break even on engineering cost. The difference was not the runtime — it was whether the team measured the right thing.

The successful rewrite was a Razorpay-style payment-routing service: 12K RPS at peak, p99 SLO of 80 ms, ₹3,200 crore quarterly settlement volume passing through it, and the bottleneck was a CPU-bound routing algorithm that scored 200 candidate processors on each payment. Profiling with py-spy showed 65% of CPU in the scoring loop — pure Python integer arithmetic, the worst case for CPython. A fair benchmark (constant rate at 12K RPS, 60-second warmup, HdrHistogram from wrk2 -R, full production payload shape) measured Python p99 at 110 ms (over SLO) and Go p99 at 22 ms (5× better). The rewrite shipped, the SLO was met, and the on-call paging volume dropped from 3/week to 1/month.

The unsuccessful rewrite was an order-status-lookup service at a Bengaluru e-commerce company: 8K RPS, p99 SLO of 200 ms, almost entirely I/O-bound — 80% of wall-time was waiting on a Postgres read replica. The team's benchmark used a Fibonacci microbenchmark (Mistake 4) and wrk without -R (Mistake 1), and reported Go as "12× faster". They rewrote. The actual production p99 dropped from 165 ms to 120 ms — a 1.4× improvement, not 12×, because the bottleneck was the Postgres round-trip, not the runtime. The 80 ms of database time did not get faster just because the calling code was now in Go. Six engineer-quarters of work for an SLO improvement nobody could justify in a postmortem; the right intervention was a Redis cache in front of the Postgres replica, which they eventually built and which alone halved p99.

The pattern: runtime rewrites pay back when the bottleneck is in the runtime, and pay back nothing when the bottleneck is somewhere else. A fair benchmark includes the I/O — runs a real Postgres in the loop, replays real production traffic, includes serialisation overhead — and surfaces the bottleneck honestly. A microbenchmark hides the bottleneck and produces speedups that don't translate. Profile first, benchmark second, decide third. The order matters; teams that decide first and benchmark to confirm always find what they were looking for.

Common confusions

Going deeper

Per-runtime warmup floors and why they exist

CPython has no JIT (until 3.13's experimental tier-1 copy-and-patch JIT), so warmup is just page-faulting the interpreter loop into the instruction cache and the imports into the data cache — usually 1–3 seconds. JVM HotSpot warmup is longer because tiered compilation needs to observe method invocation counts before compiling, and C2 takes ~10K invocations per method to reach steady state. At 8K RPS that is 1.25 seconds per method, but methods on the call graph stack up — full warmup of a 200-method service takes 30–60 seconds. ZGC adds a few seconds to settle the heap-region layout. V8 (Node) is faster: TurboFan reaches steady state in 5–10K calls. Go is AOT-compiled; "warmup" for Go is just first-touch page faults and GC pre-tenuring (the GC needs to see one or two cycles before it stops being conservative on heap sizing) — usually 5–15 seconds. The harness in this article uses 60 seconds because it is the JVM floor; shorter is unfair to Java.

criterion and statistical rigour for microbenchmarks

For microbenchmarks (where the headline harness above is overkill), the right tool is the runtime's idiomatic statistical benchmark library: pyperf for Python, criterion.rs for Rust, JMH for Java, benchstat for Go, benchmark.js for Node. Each handles warmup, iteration counts, outlier rejection, and confidence-interval computation. The shared discipline: run each benchmark in a separate process (so JIT state from prior runs doesn't carry over), report median + IQR + confidence interval, never a single number. A criterion.rs report that says "12.3 ns ± 0.4 ns (95% CI)" is a defensible measurement; a blog post that says "12.3 ns" is a guess.

CPU pinning, frequency scaling, and noise floor

Modern x86 CPUs run at variable frequency depending on thermal headroom, workload type, and SMT siblings. A benchmark on a laptop will produce different numbers depending on whether the lid was open, whether the battery was plugged in, and whether Spotify was playing. To get reproducible numbers: disable Turbo Boost (echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo), pin the benchmark to specific cores with taskset -c 4-7, isolate those cores from the kernel scheduler with isolcpus=4-7 at boot, and disable SMT siblings on those cores. On Linux, the pyperf system tune command applies most of these in one step. Without this discipline, run-to-run variance is 5–15%; with it, variance drops to 0.5–1%. The Razorpay benchmark cluster uses dedicated bare-metal nodes with this configuration and rejects any benchmark not run there.

Reproduce this on your laptop

# Install the load generator and tooling
sudo apt install wrk2 linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh fastapi uvicorn

# Stabilise the machine (Intel)
sudo cpupower frequency-set -g performance
sudo bash -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'

# Bring up the four services (svc_py.py, svc_node.js, svc_go.go, svc_java/)
# and write a post.lua that sends a realistic JSON body. Then:
python3 fair_runtime_bench.py

# Inspect HdrHistogram outputs side by side
ls /tmp/runtime_bench/

You should see Go and Java grouped tight (within 2× on p99), Node a step behind (3–4× on p99), and Python the slowest. The exact ratios depend on workload — try varying the per-request "real work" (JSON size, hash table lookups) and watch the ratio shift. CPU-bound workloads favour Go/Java more than I/O-bound ones.

Where this leads next

Fair runtime comparison is the foundation; the next chapters drill into the specific runtime mechanisms that produce the numbers you measured:

The reader who finishes this chapter should be able to read any "Language X is N times faster than Language Y" claim, identify which of the four mistakes the benchmark made, and demand the open-loop, warmed-up, operating-point-specific number before taking the claim seriously. That diagnostic instinct is the only defence against benchmark-driven engineering decisions that destroy quarters of work for a 1.4× win.

The broader point: runtime performance is not a property of the runtime alone. It is a property of the workload, the operating point, the GC configuration, the JIT warmup window, the load generator's loop topology, and the percentile you choose to report. Picking different values for any of these yields different rankings — sometimes inverting the order entirely. The teams that ship the right rewrite are the ones who write down the operating point first, measure all candidates against the same fixture, and report p50/p99/p99.9 together. The teams that ship the wrong rewrite are the ones who quote one number from a Fibonacci benchmark and call the meeting.

References