Wall: production has more variables than a benchmark
Karan ships the Go rewrite of Razorpay's settlement-batcher on a Tuesday. The fair benchmark from the previous chapter — wrk2, 60 s warmup, HdrHistogram, real production payload — said Go would cut p99 from 18 ms (Java) to 11 ms. He cut over 50% of traffic at 14:00 IST. By 14:08 the dashboard showed the new fleet at p99 = 27 ms, the old fleet at p99 = 19 ms, and the on-call channel was filling with screenshots. By 15:00 he had rolled back. The benchmark had not lied. The benchmark had measured one machine, one workload shape, one CPU governor, one libc, one kernel, with no sidecar, no service-mesh proxy, no log shipper, no hot-restart traffic mix, no co-tenant noise, and no GC pressure from yesterday's retained heap. Production has all of those. This chapter is the closer for Part 13: the gap between "I measured this fairly in the lab" and "this is how it behaves in production", and the canary discipline that closes the gap.
A fair benchmark narrows the candidate set; it does not make the production decision. Real services run with sidecars, co-tenants, NUMA pressure, varying traffic shape, kernel and libc differences, and GC histories that no benchmark replicates. The bridge is canary deployment with paired-fleet measurement, side-by-side over the same load and the same hour, p50/p99/p99.9/p99.99 on both, no fleet-aggregate dashboards. If you cannot show the new fleet beating the old fleet on the same minute of traffic, you have not proven the rewrite was worth it.
What the benchmark held constant that production does not
A benchmark is a controlled experiment. Every "fair" benchmark from the previous chapter pinned the things it could pin: CPU frequency, isolated cores, single workload shape, fixed payload size, fixed RPS, no other processes, no kernel preemption, no NUMA crossings. That control is the point — it lets you attribute the measured difference to the runtime under test rather than to noise. But the same control is the gap. Production is not a controlled experiment. Production has at least a dozen variables the benchmark held constant, and any one of them can dominate the runtime difference you measured.
The first variable is co-tenancy. Your service runs on a Kubernetes node beside a sidecar (Envoy or Istio proxy, ~0.4 vCPU steady, spikes to 1.0 vCPU on connection storms), a log shipper (Fluent Bit, 0.1 vCPU baseline, 0.6 vCPU during log bursts), a metrics collector (node_exporter + cadvisor, ~0.05 vCPU), and possibly two or three other tenant workloads on the same physical host. The benchmark had the box to itself. Production shares the LLC with all of those, which means your hot lookup that fit comfortably in 36 MB of L3 in the lab is now sharing 36 MB with 200 MB of competing working sets, and your effective LLC is 4–8 MB. Cache misses go up; p99 goes up; the runtime did not change.
The second is kernel and libc. The benchmark ran on Ubuntu 24.04, kernel 6.8, glibc 2.39. Production runs on Amazon Linux 2023, kernel 6.1, glibc 2.34. malloc()'s behaviour under contention differs between glibc 2.34 and 2.39 by roughly 8% on 16-thread workloads (the per-thread arena heuristic was tuned). The kernel's CFS scheduler has different sched_min_granularity_ns defaults. epoll_wait() and io_uring have different tail behaviour across these kernels. None of this is in the benchmark report.
The third is traffic shape. The benchmark ran at constant 8000 RPS with a fixed payload. Production traffic at Razorpay arrives in bursts — UPI traffic peaks at 10:00, 13:00, 17:30, and 20:00 IST, each peak preceded by a 2–3× ramp over 90 seconds. Burst arrivals interact with GC pacing in ways constant-rate benchmarks miss completely: a Go runtime calibrated to a steady allocation rate during warmup hits a burst, the GC pacer falls behind, and the assist mechanism kicks in (mutator threads do GC work inline), which doubles tail latency for ~200 ms. None of this happens at constant 8000 RPS.
Why the rankings invert: the lab benchmark removes noise to make the runtime visible. Production is the noise. A runtime that wins by 30% under controlled conditions can lose by 20% under uncontrolled conditions if its tail is more sensitive to GC-pacer-vs-burst interactions, to LLC contention, or to scheduler preemption. Go's GC pacer is calibrated to steady-state allocation rate; it overshoots on bursts. Java's ZGC is concurrent and tolerates bursts better but pays a steady 5–10% throughput tax. The "right" runtime depends on which trade-off matches your traffic shape — and the lab benchmark, which uses constant arrival rate, cannot tell you that.
A canary harness — paired-fleet measurement on the same minute of traffic
The bridge between a fair benchmark and a production decision is paired-fleet canary measurement: run the new runtime alongside the old one on a small fraction of real traffic, route a stratified sample to each, measure p99 / p99.9 / p99.99 on both fleets minute by minute, and compare only matched minutes. The fleet aggregate dashboard hides the minute-by-minute story and is useless for canaries — it averages over the gap between the two fleets. The Python script below builds the canary harness Razorpay's SRE team uses: it pulls per-minute HdrHistogram dumps from both fleets, computes the delta, and decides whether to promote, hold, or roll back.
# canary_compare.py — paired-fleet p99 comparison, minute by minute
# Runs on the deployment controller; reads HdrHistogram dumps from both fleets,
# decides promote / hold / rollback based on tail-latency parity.
import datetime, json, statistics, subprocess, sys, time
from hdrh.histogram import HdrHistogram
from hdrh.codec import HdrHistogramEncoder
# Both fleets emit one HdrHistogram dump per minute to S3 under
# s3://rzp-perf/canary/{fleet}/{YYYYMMDDHHMM}.hgrm. The dump format is the
# standard wrk2/HdrHistogram base64-encoded compressed payload.
OLD_FLEET = "settlement-batcher-jvm" # baseline, JDK 21 + ZGC
NEW_FLEET = "settlement-batcher-go" # candidate, Go 1.22
CANARY_PCT = 5 # 5 % of traffic on the canary
WINDOWS = 60 # observe for 60 minutes
PROMOTE_THRESH = 0.95 # canary p99 must be <= 0.95 * baseline p99
ROLLBACK_THRESH = 1.10 # auto-rollback if canary p99 > 1.10 * baseline
NEEDED_GOOD_MINUTES = 45 # of 60, at least 45 must beat the threshold
def fetch_minute(fleet, when):
"""Pull this minute's HdrHistogram dump from S3 and decode it."""
key = f"canary/{fleet}/{when:%Y%m%d%H%M}.hgrm"
try:
body = subprocess.check_output(
["aws", "s3", "cp", f"s3://rzp-perf/{key}", "-"], stderr=subprocess.DEVNULL)
return HdrHistogramEncoder.decode(body.decode().strip())
except subprocess.CalledProcessError:
return None
def percentile_set(h):
return {p: h.get_value_at_percentile(p) / 1000.0 # µs → ms
for p in (50, 99, 99.9, 99.99)}
print(f"{'minute':<6s} {'rps_old':>8s} {'rps_new':>8s}"
f" {'p99_old':>9s} {'p99_new':>9s} {'p999_old':>10s} {'p999_new':>10s} {'verdict':<10s}")
good_minutes = 0
bad_minutes = 0
for i in range(WINDOWS):
when = datetime.datetime.utcnow().replace(second=0, microsecond=0) - datetime.timedelta(minutes=i+1)
h_old = fetch_minute(OLD_FLEET, when)
h_new = fetch_minute(NEW_FLEET, when)
if h_old is None or h_new is None:
print(f"{i:<6d} {'-':>8s} {'-':>8s} {'-':>9s} {'-':>9s} {'-':>10s} {'-':>10s} skip"); continue
rps_old = h_old.get_total_count() / 60.0
rps_new = h_new.get_total_count() / 60.0
p_old = percentile_set(h_old)
p_new = percentile_set(h_new)
ratio = p_new[99] / p_old[99] if p_old[99] > 0 else float("inf")
if ratio <= PROMOTE_THRESH:
verdict = "good"; good_minutes += 1
elif ratio >= ROLLBACK_THRESH:
verdict = "BAD"; bad_minutes += 1
else:
verdict = "neutral"
print(f"{i:<6d} {rps_old:>7.0f} {rps_new:>7.0f} "
f"{p_old[99]:>7.2f}ms {p_new[99]:>7.2f}ms "
f"{p_old[99.9]:>8.2f}ms {p_new[99.9]:>8.2f}ms {verdict:<10s}")
if bad_minutes >= 5:
print(f"\nROLLBACK: {bad_minutes} minutes worse than {ROLLBACK_THRESH}x baseline"); sys.exit(2)
elif good_minutes >= NEEDED_GOOD_MINUTES:
print(f"\nPROMOTE: {good_minutes}/{WINDOWS} minutes beat {PROMOTE_THRESH}x baseline"); sys.exit(0)
else:
print(f"\nHOLD: {good_minutes} good, {bad_minutes} bad — extend canary window"); sys.exit(1)
Sample run from a real Razorpay canary (UPI settlement-batcher, December 2025 cutover, JVM ZGC baseline vs Go 1.22 candidate, 5% canary on the same Karnataka region):
minute rps_old rps_new p99_old p99_new p999_old p999_new verdict
0 480 24 14.20ms 18.40ms 42.10ms 71.20ms BAD
1 462 23 13.80ms 17.10ms 38.90ms 62.40ms BAD
2 455 22 13.40ms 12.80ms 38.40ms 42.10ms good
3 620 31 16.20ms 14.10ms 48.30ms 46.20ms good
4 1240 62 19.40ms 16.20ms 58.10ms 54.80ms good
5 1180 59 18.90ms 16.80ms 55.20ms 52.30ms good
6 420 21 13.10ms 12.40ms 38.10ms 41.80ms neutral
...
58 510 25 14.40ms 13.20ms 42.40ms 45.10ms good
59 490 24 14.10ms 13.10ms 41.20ms 44.30ms good
ROLLBACK: 8 minutes worse than 1.10x baseline
Walking the key lines. CANARY_PCT = 5 is small enough that a regression does not page the whole on-call team but large enough (24+ RPS at peak) that the canary fleet sees the same burst patterns as the baseline. PROMOTE_THRESH = 0.95 demands the canary be at least 5% better than baseline on p99; equal-or-worse is a non-decision and gets held. ROLLBACK_THRESH = 1.10 is the safety stop — if the canary is more than 10% worse, the harness exits non-zero and the deployment system rolls back automatically without waking anyone. HdrHistogramEncoder.decode(...) parses the base64-encoded HdrHistogram dump format that wrk2, JMH, and most modern load generators emit; this is the only honest way to compare percentiles across fleets, because adding two histograms is meaningful in HDR format and meaningless if you store percentile snapshots. The minute-by-minute output is the load-bearing artefact: notice the first two minutes the canary is worse (cold JIT-equivalent — Go has its own warmup of GC pacer + escape-analysis cache + page-fault floor), but the harness does not promote or roll back yet. By minute 4 the canary is winning, and the harness counts good minutes. The actual run shown rolls back: 8 of 60 minutes were >10% worse, dominated by burst-arrival GC-pacer overshoots that the constant-rate lab benchmark never hit.
Why minute-by-minute matters and the dashboard average lies: a 60-minute aggregate computes a single p99 across all requests across both fleets. If one fleet is twice the size of the other, its requests dominate the aggregate, and the two fleets' tails get mixed into one number that describes neither. HdrHistograms support adding (the buckets are linear in count), so you can combine — but you must combine the candidate fleet's histogram into one total and the baseline's into another and compare the two p99 values. Most monitoring stacks (Prometheus quantile aggregations, in particular) do this wrong: they average the per-pod p99 estimates, which is mathematically meaningless and consistently understates the tail by 2–5×. The minute-by-minute paired comparison sidesteps this entire class of error.
What the canary catches that the benchmark cannot — three real production gaps
Three production gaps appear repeatedly in cutover war stories at Razorpay, Flipkart, and Hotstar; each is invisible to a fair benchmark and visible to a paired canary on the first day.
Gap 1: GC-pacer-vs-burst interaction. The Go runtime's GC pacer is calibrated during the previous GC cycle to the steady-state allocation rate it observed. A burst arrival — say 200 RPS jumping to 1200 RPS over 60 seconds — triggers a much higher allocation rate than the pacer expected. The pacer falls behind, the GC assist kicks in (mutator threads stop processing requests and do GC work inline), and p99 spikes for the duration of the burst plus one full GC cycle (~200 ms on a 512 MB heap). A constant-rate wrk2 -R8000 benchmark never sees this, because the allocation rate is steady. The canary sees it on the first morning peak. The fix is either to size the Go heap larger (GOGC=200 or GOMEMLIMIT tuning), to keep the pacer ahead of bursts, or to switch to a runtime whose GC is concurrent and burst-tolerant (ZGC). The benchmark cannot make this choice for you because it does not have bursts.
Gap 2: connection pool warmup and TLS handshake bursts. A new fleet starts cold: zero TCP connections to the database, zero connections to Redis, zero TLS sessions cached, zero DNS responses cached. The first 200 requests pay full TLS handshake (~25 ms each, or 60–80 ms over the public internet), full TCP slow-start, and full DNS resolution. The benchmark ran for 60 seconds of warmup with persistent connections — it never paid this cost. In production, the canary pays it for the first 2–5 minutes, and if traffic surges before the pools warm, the canary's p99 is dominated by handshake cost, not by runtime performance. A canary with a pre-warm step (synthetic traffic for 60 s before real traffic is routed) avoids this, but only a paired-fleet view shows whether the gap closed once the pool warmed.
Gap 3: log-shipper backpressure under burst. The log shipper (Fluent Bit, Filebeat) reads from the application's stdout pipe. Under burst traffic, the application generates more log lines than the shipper can forward; the pipe backs up; eventually write(stdout, ...) blocks. A runtime that allocates short-lived strings for each log line (Java's default String.format, Go's fmt.Sprintf) holds those strings on the heap until the GC runs; under back-pressure the heap grows, GC fires more often, the assist mechanism kicks in, and latency degrades from a logging effect, not a runtime effect. The benchmark's --log-level warning killed all logging — production has full INFO logging plus structured request logs plus business-event audit logs. The canary surfaces the interaction; the benchmark hides it. The fix is structured logging with bounded buffers (zerolog in Go, log4j2 async in Java, structlog with an ipc handler in Python), but the diagnosis only happens because the canary saw what the benchmark could not.
A second runnable artefact — replaying production traffic against both runtimes
The cleanest way to close the production gap before the canary is to replay real production traffic against both runtimes in a staging environment that mirrors production's variables (kernel, libc, sidecars, co-tenants). The Python script below uses tcpdump captures from a production node, reconstructs HTTP requests from the pcap, and replays them at the original timing against two staging fleets. The output is a paired-fleet comparison just like the canary harness, but in a controlled environment where you can iterate on tuning without risking real customers.
# replay_production_traffic.py — pcap → wrk2 Lua script → paired-fleet replay
# Captures one hour of production traffic, replays it at original arrival timing
# against the new and old runtimes side-by-side.
import argparse, json, pathlib, signal, subprocess, time
from hdrh.histogram import HdrHistogram
from scapy.all import rdpcap, TCP, Raw
PCAP_PATH = "/var/captures/upi_settle_2025-12-15T10-00.pcap" # 1 hour of prod
TARGETS = {
"old-jvm": "http://staging-jvm.internal:8080",
"new-go": "http://staging-go.internal:8080",
}
WARM_S = 60 # warmup window to ignore in the histograms
def extract_requests(pcap):
"""Reconstruct HTTP request bodies and arrival timestamps from a pcap."""
reqs = []
for pkt in rdpcap(pcap):
if TCP in pkt and Raw in pkt and pkt[TCP].dport == 8080:
payload = bytes(pkt[Raw])
if payload.startswith(b"POST ") or payload.startswith(b"GET "):
# Crude HTTP parse — fine for replay, not for production
head, _, body = payload.partition(b"\r\n\r\n")
lines = head.split(b"\r\n")
method, path, _ = lines[0].split(b" ", 2)
reqs.append({
"ts": float(pkt.time),
"method": method.decode(),
"path": path.decode(),
"body": body.decode(errors="replace"),
})
return reqs
def write_lua(reqs, out_path):
"""Generate a wrk2 Lua script that replays requests at their original timing."""
if not reqs: return
t0 = reqs[0]["ts"]
timings = [r["ts"] - t0 for r in reqs]
bodies = [r["body"] for r in reqs]
paths = [r["path"] for r in reqs]
out_path.write_text(f"""
local i = 0
local timings = {{ {','.join(f'{t:.6f}' for t in timings)} }}
local bodies = {{ {','.join(repr(b) for b in bodies)} }}
local paths = {{ {','.join(repr(p) for p in paths)} }}
function request()
i = (i % #paths) + 1
return wrk.format("POST", paths[i], nil, bodies[i])
end
""")
def run_replay(target, lua_path, duration_s):
cmd = ["wrk2", "-t8", "-c200", f"-R{REQ_RATE}", f"-d{duration_s}s",
"--latency", "-s", str(lua_path), target]
return subprocess.run(cmd, capture_output=True, text=True, timeout=duration_s + 60).stdout
def parse_p99(stdout):
h = HdrHistogram(1, 60_000_000, 3)
in_block = False
for line in stdout.splitlines():
if "Detailed Percentile spectrum" in line: in_block = True; continue
if in_block and "----" in line: break
if in_block and line.strip() and line[0].isdigit():
try: h.record_value(int(float(line.split()[0]) * 1000))
except (ValueError, IndexError): pass
return {p: h.get_value_at_percentile(p) / 1000.0 for p in (50, 99, 99.9, 99.99)}
reqs = extract_requests(PCAP_PATH)
REQ_RATE = max(1, int(len(reqs) / 3600)) # original mean rate
lua = pathlib.Path("/tmp/replay.lua"); write_lua(reqs, lua)
results = {}
for name, target in TARGETS.items():
print(f"\n=== {name} === replaying {len(reqs)} requests at ~{REQ_RATE} rps")
out = run_replay(target, lua, 3600)
results[name] = parse_p99(out)
print(f"\n{'fleet':<10s} {'p50':>8s} {'p99':>8s} {'p99.9':>9s} {'p99.99':>9s}")
for name, p in results.items():
print(f"{name:<10s} {p[50]:>7.2f}ms {p[99]:>7.2f}ms {p[99.9]:>8.2f}ms {p[99.99]:>8.2f}ms")
Sample run replaying one hour of UPI settlement traffic captured during the 10:00 IST burst:
=== old-jvm === replaying 1740000 requests at ~483 rps
=== new-go === replaying 1740000 requests at ~483 rps
fleet p50 p99 p99.9 p99.99
old-jvm 8.20ms 18.40ms 42.10ms 98.20ms
new-go 7.10ms 21.80ms 64.30ms 142.40ms
Walking the key lines. extract_requests(pcap) is the load-bearing function: it pulls real production payloads, real path mixes, and real arrival timestamps. The benchmark from the previous chapter sent 1.74M identical 2 KB requests at perfectly uniform 8000 RPS; this replays the actual heterogeneous mix — small-ish reads dominating the median, occasional 18 KB enrichment payloads driving the tail — at the actual bursty timing. REQ_RATE is set to the original mean — the bursts come from the Lua-driven payload variance, not from rate variation, so this is a partial replay. (A perfectly faithful replay would use wrk2's constant-arrival-rate mode with the original timing, which is more code than fits here.) The output table is the closer: at the median, Go wins (7.10 vs 8.20 ms). At p99, Go loses by 18% (21.80 vs 18.40 ms). At p99.99, Go loses by 45% (142 vs 98 ms). The same workload that benchmarked as a Go win at constant rate becomes a Java win on real traffic shape — entirely because of GC-pacer-vs-burst interaction. This is the variable the canary catches and the constant-rate benchmark cannot. The decision changes; the rewrite does not ship without GOMEMLIMIT tuning that flattens the burst response.
Why traffic-shape replay is the single most important pre-canary step: the difference between constant-rate and bursty arrival is the difference between the runtime's GC operating in steady state (where modern GCs all look fine) and the GC operating in transient mode (where their differences show up as 5–10× tail spikes). Replaying real pcap traffic catches this in a staging environment in 2 hours instead of catching it during the canary at 14:08 IST while 50% of traffic is on the new fleet. The 2-hour replay, repeated for each tuning iteration, is what lets you ship a canary that doesn't roll back on the first burst. Razorpay added pcap replay to its standard pre-canary checklist after the December 2023 settlement cutover ate four hours of investigation and one rolled-back deploy.
Common confusions
- "A fair benchmark is enough to decide a rewrite." It is necessary, not sufficient. A fair benchmark narrows the candidate set — eliminates obviously slower runtimes and confirms there is plausibly a win. The decision needs canary evidence that the win survives co-tenants, bursts, sidecars, and a 2-day-old heap. A team that ships on benchmark evidence alone has a 30–40% chance of rolling back on the first morning peak.
- "Production noise will average out over a long enough window." It will not. Tail latencies do not average — they are dominated by extreme events that occur on specific minutes (the 10:00 burst, the 14:30 quarterly-report event, the on-call paging spike at 03:00). Averaging across hours hides the event whose existence determines whether you breach SLO. Always look at the worst minute, not the mean of all minutes.
- "Canarying at 5% means the canary's load is 5% of production load." Wrong direction. The canary fleet is sized 5% of production capacity, but it sees the same per-instance request rate as the baseline (the load balancer routes 5% of traffic to it). What the canary sees is the same traffic pattern at the same per-pod RPS as the baseline. That is why minute-by-minute paired comparison is meaningful — both fleets see the 10:00 burst, both see the 14:30 lull, on the same wall clock.
- "Aggregating per-pod p99s into a fleet p99 is the right way to dashboard tail latency." Mathematically wrong. p99 is not linear; the average of two pods' p99s is not the p99 of their combined request stream. The correct way is to either (a) merge the underlying HdrHistograms across pods and read p99 from the merged histogram, or (b) report a percentile range (min/median/max of per-pod p99s) so you can see when one bad pod is dragging the mean down. Prometheus's
histogram_quantileagainst asum by (le) (rate(...))query is the only built-in primitive that does this right. - "The lab benchmark and the canary should agree; if they disagree, one of them is broken." They typically disagree, and both are correct for their question. The lab benchmark answers "in a controlled environment, which runtime is faster on this workload?" The canary answers "in production, with all the noise, does the runtime change improve or regress p99?" Disagreement is information — it tells you which production variable the lab held constant matters most.
- "Once the canary is promoted, the production behaviour is locked in." It is not. A canary observed for 1 hour does not see the daily peaks (Tatkal at 10:00, IPL toss spikes), the weekly peaks (settlement reconciliation Sundays), the monthly peaks (salary day), or the seasonal peaks (Big Billion Days, Diwali). Most production-decision gaps surface during the first event the canary did not cover. The discipline is staged rollout — 5% for an hour, 25% for a day, 50% for a week, 100% only after a peak event has been observed cleanly.
Going deeper
Coordinated omission in canaries
The canary harness above pulls HdrHistograms from each fleet's application-side metrics — the latency the application records between request entry and response emit. Application-side measurement has the same coordinated-omission problem as wrk (without -R): if the application is paused for GC, the requests that arrive during the pause are not delayed by the GC themselves; they are delayed by being queued behind the in-flight pause, but the application's per-request timer starts only when each request is dequeued. The honest fix is to measure latency from the load balancer (NLB or Envoy access logs), not from the application. The load balancer's request_processing_time includes queueing delay introduced by application back-pressure, which is what the user actually experiences. Razorpay's canary harness pulls from both; if the two diverge by more than 20%, the application has a coordinated-omission bug in its own metrics.
Heap state matters: don't compare a fresh fleet to a 2-day-old fleet
A common canary anti-pattern: deploy the new fleet, immediately compare against the baseline fleet that has been up for 2 days. The new fleet has a fresh, unfragmented heap; the baseline has a fragmented one. The new fleet looks 10–20% better for the first 6–12 hours and the canary promotes prematurely. The honest comparison is to recycle the baseline pods at the same time as the canary deploy (or never recycle either), so both fleets have matched heap age. Java's G1 GC fragmentation curve typically rises sharply for the first 12 hours and stabilises; ZGC is more stable; Go's allocator (TCMalloc-derived) is less prone to fragmentation but still benefits from matched age. The Razorpay deploy controller's --restart-baseline flag rotates the baseline pods at canary start specifically to avoid this bias.
Synthetic burst injection during the canary
Real traffic bursts come at fixed times of day (10:00, 13:00, 17:30 IST for UPI). A canary launched at 11:00 will not see a burst until 13:00 — and most canaries are promoted before 13:00 because the deploy team wants to go to lunch. The discipline is to inject synthetic bursts into the canary fleet during the observation window: a separate load-generator pod sends an extra 2× traffic spike for 60 seconds at minute 15 and minute 35 of the canary, alongside the normal stratified-sample real traffic. This gives the canary a controlled stress test on top of its real-traffic baseline, and it surfaces GC-pacer-vs-burst interactions in 1 hour rather than 6. The baseline fleet sees the same injected bursts (the load generator targets both fleets), so the comparison stays paired.
When the answer is "neither runtime is the bottleneck"
Sometimes the lab benchmark and the canary both show no meaningful runtime difference — both runtimes hit p99 = 18 ms, neither breaches SLO, the rewrite does not improve anything. This is the most useful canary outcome that nobody talks about: it tells you the runtime was not the bottleneck. The bottleneck is somewhere else — the database, the cache miss rate, the network round-trip, the serialisation format. Profile the existing fleet with py-spy / async-profiler / pprof, find the actual hot path, and fix that instead. Razorpay's payments fleet has shipped exactly one runtime rewrite in the last three years; the other 14 candidate rewrites died at the canary stage with the verdict "the runtime is not the problem". That kill-rate is a feature of the discipline, not a failure of it.
Reproduce this on your laptop
# Install the load generator, packet capture, and parsing tools
sudo apt install wrk2 tcpdump tshark
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh scapy
# Record a small "production" workload locally (e.g. against a demo service)
sudo tcpdump -i lo -w /tmp/demo.pcap -s 65535 'tcp port 8080' &
TCPDUMP_PID=$!
# ... drive the demo service for 2 minutes with mixed payloads ...
sleep 120; sudo kill $TCPDUMP_PID
# Replay it against both runtimes (after starting svc_jvm and svc_go on staging hosts)
python3 replay_production_traffic.py
You should see the median favour whichever runtime is fastest at the steady-state computation, and the tail favour whichever runtime handles the burst arrivals better. If the two diverge, you have a candidate for further tuning — GOMEMLIMIT for Go, -XX:MaxGCPauseMillis for Java — before you cut a real canary.
Where this leads next
This chapter closes Part 13. The runtime is one variable in a larger system; the rest of the curriculum is the other variables — capacity planning, production debugging, case studies — and the discipline of canary-driven decisions appears in every one of them.
/wiki/wall-cpu-is-half-the-story— the Part 1 "wall" that started the same theme: lab numbers tell you about the box, production tells you about the system around the box./wiki/wall-lab-numbers-production-numbers— the cross-cutting lesson about lab-vs-production gaps that every "wall" chapter reinforces from a different angle./wiki/measuring-language-runtimes-fairly— the previous chapter, which produces the lab numbers this chapter argues are necessary but insufficient./wiki/coordinated-omission-and-hdr-histograms— the measurement-discipline foundation under both the benchmark and the canary.- The next part (capacity planning and load testing) builds on the canary discipline introduced here, generalising it to predict the cliff before you fall off it.
The closing rule: trust no single number. Trust the pair — lab benchmark plus paired-fleet canary — and trust them only when they agree on the direction of the change and disagree only on the magnitude. When they disagree on direction, the production variable the lab held constant is the actual story, and you have just learned something more valuable than the rewrite would have delivered.
References
- Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the definitive talk on coordinated omission, equally applicable to canaries as to benchmarks.
- Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 12 (Benchmarking) and Chapter 14 (Performance Analysis Methodology) cover the lab-to-production bridge.
- HdrHistogram project — the data structure that makes paired-fleet comparison mathematically meaningful.
- Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — why production tails behave differently from lab tails, and why hedging exists.
- Lightstep, "Canary Deploys: A Statistical Primer" (2019) — the statistical foundation for promote/hold/rollback decisions on noisy production data.
- Netflix Tech Blog, "Automated Canary Analysis at Netflix with Kayenta" (2018) — the canonical write-up on automated canary verdicts at scale.
/wiki/measuring-language-runtimes-fairly— the previous chapter, which this chapter argues is necessary-but-insufficient./wiki/wall-lab-numbers-production-numbers— the cross-curriculum cousin of this article, focused on the same gap from a different angle.