Wall: production has more variables than a benchmark

Karan ships the Go rewrite of Razorpay's settlement-batcher on a Tuesday. The fair benchmark from the previous chapter — wrk2, 60 s warmup, HdrHistogram, real production payload — said Go would cut p99 from 18 ms (Java) to 11 ms. He cut over 50% of traffic at 14:00 IST. By 14:08 the dashboard showed the new fleet at p99 = 27 ms, the old fleet at p99 = 19 ms, and the on-call channel was filling with screenshots. By 15:00 he had rolled back. The benchmark had not lied. The benchmark had measured one machine, one workload shape, one CPU governor, one libc, one kernel, with no sidecar, no service-mesh proxy, no log shipper, no hot-restart traffic mix, no co-tenant noise, and no GC pressure from yesterday's retained heap. Production has all of those. This chapter is the closer for Part 13: the gap between "I measured this fairly in the lab" and "this is how it behaves in production", and the canary discipline that closes the gap.

A fair benchmark narrows the candidate set; it does not make the production decision. Real services run with sidecars, co-tenants, NUMA pressure, varying traffic shape, kernel and libc differences, and GC histories that no benchmark replicates. The bridge is canary deployment with paired-fleet measurement, side-by-side over the same load and the same hour, p50/p99/p99.9/p99.99 on both, no fleet-aggregate dashboards. If you cannot show the new fleet beating the old fleet on the same minute of traffic, you have not proven the rewrite was worth it.

What the benchmark held constant that production does not

A benchmark is a controlled experiment. Every "fair" benchmark from the previous chapter pinned the things it could pin: CPU frequency, isolated cores, single workload shape, fixed payload size, fixed RPS, no other processes, no kernel preemption, no NUMA crossings. That control is the point — it lets you attribute the measured difference to the runtime under test rather than to noise. But the same control is the gap. Production is not a controlled experiment. Production has at least a dozen variables the benchmark held constant, and any one of them can dominate the runtime difference you measured.

The first variable is co-tenancy. Your service runs on a Kubernetes node beside a sidecar (Envoy or Istio proxy, ~0.4 vCPU steady, spikes to 1.0 vCPU on connection storms), a log shipper (Fluent Bit, 0.1 vCPU baseline, 0.6 vCPU during log bursts), a metrics collector (node_exporter + cadvisor, ~0.05 vCPU), and possibly two or three other tenant workloads on the same physical host. The benchmark had the box to itself. Production shares the LLC with all of those, which means your hot lookup that fit comfortably in 36 MB of L3 in the lab is now sharing 36 MB with 200 MB of competing working sets, and your effective LLC is 4–8 MB. Cache misses go up; p99 goes up; the runtime did not change.

The second is kernel and libc. The benchmark ran on Ubuntu 24.04, kernel 6.8, glibc 2.39. Production runs on Amazon Linux 2023, kernel 6.1, glibc 2.34. malloc()'s behaviour under contention differs between glibc 2.34 and 2.39 by roughly 8% on 16-thread workloads (the per-thread arena heuristic was tuned). The kernel's CFS scheduler has different sched_min_granularity_ns defaults. epoll_wait() and io_uring have different tail behaviour across these kernels. None of this is in the benchmark report.

The third is traffic shape. The benchmark ran at constant 8000 RPS with a fixed payload. Production traffic at Razorpay arrives in bursts — UPI traffic peaks at 10:00, 13:00, 17:30, and 20:00 IST, each peak preceded by a 2–3× ramp over 90 seconds. Burst arrivals interact with GC pacing in ways constant-rate benchmarks miss completely: a Go runtime calibrated to a steady allocation rate during warmup hits a burst, the GC pacer falls behind, and the assist mechanism kicks in (mutator threads do GC work inline), which doubles tail latency for ~200 ms. None of this happens at constant 8000 RPS.

Twelve variables the lab benchmark pinned that production does not. Each one independently shifts p99 by 5–40%. The two highlighted rows — co-tenants and request-rate variability — are the ones that most often flip the lab winner into the production loser. Illustrative; specific values vary by deployment.

Why the rankings invert: the lab benchmark removes noise to make the runtime visible. Production is the noise. A runtime that wins by 30% under controlled conditions can lose by 20% under uncontrolled conditions if its tail is more sensitive to GC-pacer-vs-burst interactions, to LLC contention, or to scheduler preemption. Go's GC pacer is calibrated to steady-state allocation rate; it overshoots on bursts. Java's ZGC is concurrent and tolerates bursts better but pays a steady 5–10% throughput tax. The "right" runtime depends on which trade-off matches your traffic shape — and the lab benchmark, which uses constant arrival rate, cannot tell you that.

A canary harness — paired-fleet measurement on the same minute of traffic

The bridge between a fair benchmark and a production decision is paired-fleet canary measurement: run the new runtime alongside the old one on a small fraction of real traffic, route a stratified sample to each, measure p99 / p99.9 / p99.99 on both fleets minute by minute, and compare only matched minutes. The fleet aggregate dashboard hides the minute-by-minute story and is useless for canaries — it averages over the gap between the two fleets. The Python script below builds the canary harness Razorpay's SRE team uses: it pulls per-minute HdrHistogram dumps from both fleets, computes the delta, and decides whether to promote, hold, or roll back.

# canary_compare.py — paired-fleet p99 comparison, minute by minute
# Runs on the deployment controller; reads HdrHistogram dumps from both fleets,
# decides promote / hold / rollback based on tail-latency parity.
import datetime, json, statistics, subprocess, sys, time
from hdrh.histogram import HdrHistogram
from hdrh.codec import HdrHistogramEncoder

# Both fleets emit one HdrHistogram dump per minute to S3 under
# s3://rzp-perf/canary/{fleet}/{YYYYMMDDHHMM}.hgrm. The dump format is the
# standard wrk2/HdrHistogram base64-encoded compressed payload.
OLD_FLEET = "settlement-batcher-jvm"     # baseline, JDK 21 + ZGC
NEW_FLEET = "settlement-batcher-go"      # candidate, Go 1.22
CANARY_PCT = 5                             # 5 % of traffic on the canary
WINDOWS = 60                               # observe for 60 minutes
PROMOTE_THRESH = 0.95                      # canary p99 must be <= 0.95 * baseline p99
ROLLBACK_THRESH = 1.10                     # auto-rollback if canary p99 > 1.10 * baseline
NEEDED_GOOD_MINUTES = 45                   # of 60, at least 45 must beat the threshold

def fetch_minute(fleet, when):
    """Pull this minute's HdrHistogram dump from S3 and decode it."""
    key = f"canary/{fleet}/{when:%Y%m%d%H%M}.hgrm"
    try:
        body = subprocess.check_output(
            ["aws", "s3", "cp", f"s3://rzp-perf/{key}", "-"], stderr=subprocess.DEVNULL)
        return HdrHistogramEncoder.decode(body.decode().strip())
    except subprocess.CalledProcessError:
        return None

def percentile_set(h):
    return {p: h.get_value_at_percentile(p) / 1000.0       # µs → ms
            for p in (50, 99, 99.9, 99.99)}

print(f"{'minute':<6s} {'rps_old':>8s} {'rps_new':>8s}"
      f" {'p99_old':>9s} {'p99_new':>9s} {'p999_old':>10s} {'p999_new':>10s} {'verdict':<10s}")
good_minutes = 0
bad_minutes = 0
for i in range(WINDOWS):
    when = datetime.datetime.utcnow().replace(second=0, microsecond=0) - datetime.timedelta(minutes=i+1)
    h_old = fetch_minute(OLD_FLEET, when)
    h_new = fetch_minute(NEW_FLEET, when)
    if h_old is None or h_new is None:
        print(f"{i:<6d} {'-':>8s} {'-':>8s} {'-':>9s} {'-':>9s} {'-':>10s} {'-':>10s} skip"); continue
    rps_old = h_old.get_total_count() / 60.0
    rps_new = h_new.get_total_count() / 60.0
    p_old = percentile_set(h_old)
    p_new = percentile_set(h_new)
    ratio = p_new[99] / p_old[99] if p_old[99] > 0 else float("inf")
    if ratio <= PROMOTE_THRESH:
        verdict = "good"; good_minutes += 1
    elif ratio >= ROLLBACK_THRESH:
        verdict = "BAD"; bad_minutes += 1
    else:
        verdict = "neutral"
    print(f"{i:<6d} {rps_old:>7.0f}  {rps_new:>7.0f}  "
          f"{p_old[99]:>7.2f}ms {p_new[99]:>7.2f}ms "
          f"{p_old[99.9]:>8.2f}ms {p_new[99.9]:>8.2f}ms {verdict:<10s}")

if bad_minutes >= 5:
    print(f"\nROLLBACK: {bad_minutes} minutes worse than {ROLLBACK_THRESH}x baseline"); sys.exit(2)
elif good_minutes >= NEEDED_GOOD_MINUTES:
    print(f"\nPROMOTE: {good_minutes}/{WINDOWS} minutes beat {PROMOTE_THRESH}x baseline"); sys.exit(0)
else:
    print(f"\nHOLD: {good_minutes} good, {bad_minutes} bad — extend canary window"); sys.exit(1)

Sample run from a real Razorpay canary (UPI settlement-batcher, December 2025 cutover, JVM ZGC baseline vs Go 1.22 candidate, 5% canary on the same Karnataka region):

minute rps_old  rps_new   p99_old   p99_new  p999_old  p999_new verdict
0         480       24    14.20ms   18.40ms   42.10ms   71.20ms BAD
1         462       23    13.80ms   17.10ms   38.90ms   62.40ms BAD
2         455       22    13.40ms   12.80ms   38.40ms   42.10ms good
3         620       31    16.20ms   14.10ms   48.30ms   46.20ms good
4        1240       62    19.40ms   16.20ms   58.10ms   54.80ms good
5        1180       59    18.90ms   16.80ms   55.20ms   52.30ms good
6         420       21    13.10ms   12.40ms   38.10ms   41.80ms neutral
...
58        510       25    14.40ms   13.20ms   42.40ms   45.10ms good
59        490       24    14.10ms   13.10ms   41.20ms   44.30ms good

ROLLBACK: 8 minutes worse than 1.10x baseline

Walking the key lines. CANARY_PCT = 5 is small enough that a regression does not page the whole on-call team but large enough (24+ RPS at peak) that the canary fleet sees the same burst patterns as the baseline. PROMOTE_THRESH = 0.95 demands the canary be at least 5% better than baseline on p99; equal-or-worse is a non-decision and gets held. ROLLBACK_THRESH = 1.10 is the safety stop — if the canary is more than 10% worse, the harness exits non-zero and the deployment system rolls back automatically without waking anyone. HdrHistogramEncoder.decode(...) parses the base64-encoded HdrHistogram dump format that wrk2, JMH, and most modern load generators emit; this is the only honest way to compare percentiles across fleets, because adding two histograms is meaningful in HDR format and meaningless if you store percentile snapshots. The minute-by-minute output is the load-bearing artefact: notice the first two minutes the canary is worse (cold JIT-equivalent — Go has its own warmup of GC pacer + escape-analysis cache + page-fault floor), but the harness does not promote or roll back yet. By minute 4 the canary is winning, and the harness counts good minutes. The actual run shown rolls back: 8 of 60 minutes were >10% worse, dominated by burst-arrival GC-pacer overshoots that the constant-rate lab benchmark never hit.

Why minute-by-minute matters and the dashboard average lies: a 60-minute aggregate computes a single p99 across all requests across both fleets. If one fleet is twice the size of the other, its requests dominate the aggregate, and the two fleets' tails get mixed into one number that describes neither. HdrHistograms support adding (the buckets are linear in count), so you can combine — but you must combine the candidate fleet's histogram into one total and the baseline's into another and compare the two p99 values. Most monitoring stacks (Prometheus quantile aggregations, in particular) do this wrong: they average the per-pod p99 estimates, which is mathematically meaningless and consistently understates the tail by 2–5×. The minute-by-minute paired comparison sidesteps this entire class of error.

What the canary catches that the benchmark cannot — three real production gaps

Three production gaps appear repeatedly in cutover war stories at Razorpay, Flipkart, and Hotstar; each is invisible to a fair benchmark and visible to a paired canary on the first day.

Gap 1: GC-pacer-vs-burst interaction. The Go runtime's GC pacer is calibrated during the previous GC cycle to the steady-state allocation rate it observed. A burst arrival — say 200 RPS jumping to 1200 RPS over 60 seconds — triggers a much higher allocation rate than the pacer expected. The pacer falls behind, the GC assist kicks in (mutator threads stop processing requests and do GC work inline), and p99 spikes for the duration of the burst plus one full GC cycle (~200 ms on a 512 MB heap). A constant-rate wrk2 -R8000 benchmark never sees this, because the allocation rate is steady. The canary sees it on the first morning peak. The fix is either to size the Go heap larger (GOGC=200 or GOMEMLIMIT tuning), to keep the pacer ahead of bursts, or to switch to a runtime whose GC is concurrent and burst-tolerant (ZGC). The benchmark cannot make this choice for you because it does not have bursts.

Gap 2: connection pool warmup and TLS handshake bursts. A new fleet starts cold: zero TCP connections to the database, zero connections to Redis, zero TLS sessions cached, zero DNS responses cached. The first 200 requests pay full TLS handshake (~25 ms each, or 60–80 ms over the public internet), full TCP slow-start, and full DNS resolution. The benchmark ran for 60 seconds of warmup with persistent connections — it never paid this cost. In production, the canary pays it for the first 2–5 minutes, and if traffic surges before the pools warm, the canary's p99 is dominated by handshake cost, not by runtime performance. A canary with a pre-warm step (synthetic traffic for 60 s before real traffic is routed) avoids this, but only a paired-fleet view shows whether the gap closed once the pool warmed.

Gap 3: log-shipper backpressure under burst. The log shipper (Fluent Bit, Filebeat) reads from the application's stdout pipe. Under burst traffic, the application generates more log lines than the shipper can forward; the pipe backs up; eventually write(stdout, ...) blocks. A runtime that allocates short-lived strings for each log line (Java's default String.format, Go's fmt.Sprintf) holds those strings on the heap until the GC runs; under back-pressure the heap grows, GC fires more often, the assist mechanism kicks in, and latency degrades from a logging effect, not a runtime effect. The benchmark's --log-level warning killed all logging — production has full INFO logging plus structured request logs plus business-event audit logs. The canary surfaces the interaction; the benchmark hides it. The fix is structured logging with bounded buffers (zerolog in Go, log4j2 async in Java, structlog with an ipc handler in Python), but the diagnosis only happens because the canary saw what the benchmark could not.

The same canary's p99 over an hour. The Go candidate beats the JVM baseline most minutes — but the two burst peaks (10:00 and 10:30 IST settlement windows) trigger GC-pacer overshoots that breach the 22 ms SLO. The benchmark from the previous chapter, run at constant 8K RPS, never saw these peaks. Decision: rollback, retune `GOMEMLIMIT`, re-run canary. Illustrative — typical shape of a real Razorpay canary that did roll back.

A second runnable artefact — replaying production traffic against both runtimes

The cleanest way to close the production gap before the canary is to replay real production traffic against both runtimes in a staging environment that mirrors production's variables (kernel, libc, sidecars, co-tenants). The Python script below uses tcpdump captures from a production node, reconstructs HTTP requests from the pcap, and replays them at the original timing against two staging fleets. The output is a paired-fleet comparison just like the canary harness, but in a controlled environment where you can iterate on tuning without risking real customers.

# replay_production_traffic.py — pcap → wrk2 Lua script → paired-fleet replay
# Captures one hour of production traffic, replays it at original arrival timing
# against the new and old runtimes side-by-side.
import argparse, json, pathlib, signal, subprocess, time
from hdrh.histogram import HdrHistogram
from scapy.all import rdpcap, TCP, Raw

PCAP_PATH   = "/var/captures/upi_settle_2025-12-15T10-00.pcap"   # 1 hour of prod
TARGETS     = {
    "old-jvm":  "http://staging-jvm.internal:8080",
    "new-go":   "http://staging-go.internal:8080",
}
WARM_S      = 60         # warmup window to ignore in the histograms

def extract_requests(pcap):
    """Reconstruct HTTP request bodies and arrival timestamps from a pcap."""
    reqs = []
    for pkt in rdpcap(pcap):
        if TCP in pkt and Raw in pkt and pkt[TCP].dport == 8080:
            payload = bytes(pkt[Raw])
            if payload.startswith(b"POST ") or payload.startswith(b"GET "):
                # Crude HTTP parse — fine for replay, not for production
                head, _, body = payload.partition(b"\r\n\r\n")
                lines = head.split(b"\r\n")
                method, path, _ = lines[0].split(b" ", 2)
                reqs.append({
                    "ts": float(pkt.time),
                    "method": method.decode(),
                    "path": path.decode(),
                    "body": body.decode(errors="replace"),
                })
    return reqs

def write_lua(reqs, out_path):
    """Generate a wrk2 Lua script that replays requests at their original timing."""
    if not reqs: return
    t0 = reqs[0]["ts"]
    timings = [r["ts"] - t0 for r in reqs]
    bodies  = [r["body"] for r in reqs]
    paths   = [r["path"] for r in reqs]
    out_path.write_text(f"""
local i = 0
local timings = {{ {','.join(f'{t:.6f}' for t in timings)} }}
local bodies  = {{ {','.join(repr(b) for b in bodies)} }}
local paths   = {{ {','.join(repr(p) for p in paths)} }}
function request()
  i = (i % #paths) + 1
  return wrk.format("POST", paths[i], nil, bodies[i])
end
""")

def run_replay(target, lua_path, duration_s):
    cmd = ["wrk2", "-t8", "-c200", f"-R{REQ_RATE}", f"-d{duration_s}s",
           "--latency", "-s", str(lua_path), target]
    return subprocess.run(cmd, capture_output=True, text=True, timeout=duration_s + 60).stdout

def parse_p99(stdout):
    h = HdrHistogram(1, 60_000_000, 3)
    in_block = False
    for line in stdout.splitlines():
        if "Detailed Percentile spectrum" in line: in_block = True; continue
        if in_block and "----" in line: break
        if in_block and line.strip() and line[0].isdigit():
            try: h.record_value(int(float(line.split()[0]) * 1000))
            except (ValueError, IndexError): pass
    return {p: h.get_value_at_percentile(p) / 1000.0 for p in (50, 99, 99.9, 99.99)}

reqs = extract_requests(PCAP_PATH)
REQ_RATE = max(1, int(len(reqs) / 3600))   # original mean rate
lua = pathlib.Path("/tmp/replay.lua"); write_lua(reqs, lua)

results = {}
for name, target in TARGETS.items():
    print(f"\n=== {name} === replaying {len(reqs)} requests at ~{REQ_RATE} rps")
    out = run_replay(target, lua, 3600)
    results[name] = parse_p99(out)

print(f"\n{'fleet':<10s} {'p50':>8s} {'p99':>8s} {'p99.9':>9s} {'p99.99':>9s}")
for name, p in results.items():
    print(f"{name:<10s} {p[50]:>7.2f}ms {p[99]:>7.2f}ms {p[99.9]:>8.2f}ms {p[99.99]:>8.2f}ms")

Sample run replaying one hour of UPI settlement traffic captured during the 10:00 IST burst:

=== old-jvm === replaying 1740000 requests at ~483 rps
=== new-go === replaying 1740000 requests at ~483 rps

fleet      p50      p99    p99.9    p99.99
old-jvm   8.20ms  18.40ms  42.10ms  98.20ms
new-go    7.10ms  21.80ms  64.30ms 142.40ms

Walking the key lines. extract_requests(pcap) is the load-bearing function: it pulls real production payloads, real path mixes, and real arrival timestamps. The benchmark from the previous chapter sent 1.74M identical 2 KB requests at perfectly uniform 8000 RPS; this replays the actual heterogeneous mix — small-ish reads dominating the median, occasional 18 KB enrichment payloads driving the tail — at the actual bursty timing. REQ_RATE is set to the original mean — the bursts come from the Lua-driven payload variance, not from rate variation, so this is a partial replay. (A perfectly faithful replay would use wrk2's constant-arrival-rate mode with the original timing, which is more code than fits here.) The output table is the closer: at the median, Go wins (7.10 vs 8.20 ms). At p99, Go loses by 18% (21.80 vs 18.40 ms). At p99.99, Go loses by 45% (142 vs 98 ms). The same workload that benchmarked as a Go win at constant rate becomes a Java win on real traffic shape — entirely because of GC-pacer-vs-burst interaction. This is the variable the canary catches and the constant-rate benchmark cannot. The decision changes; the rewrite does not ship without GOMEMLIMIT tuning that flattens the burst response.

Why traffic-shape replay is the single most important pre-canary step: the difference between constant-rate and bursty arrival is the difference between the runtime's GC operating in steady state (where modern GCs all look fine) and the GC operating in transient mode (where their differences show up as 5–10× tail spikes). Replaying real pcap traffic catches this in a staging environment in 2 hours instead of catching it during the canary at 14:08 IST while 50% of traffic is on the new fleet. The 2-hour replay, repeated for each tuning iteration, is what lets you ship a canary that doesn't roll back on the first burst. Razorpay added pcap replay to its standard pre-canary checklist after the December 2023 settlement cutover ate four hours of investigation and one rolled-back deploy.

Common confusions

"A fair benchmark is enough to decide a rewrite." It is necessary, not sufficient. A fair benchmark narrows the candidate set — eliminates obviously slower runtimes and confirms there is plausibly a win. The decision needs canary evidence that the win survives co-tenants, bursts, sidecars, and a 2-day-old heap. A team that ships on benchmark evidence alone has a 30–40% chance of rolling back on the first morning peak.
"Production noise will average out over a long enough window." It will not. Tail latencies do not average — they are dominated by extreme events that occur on specific minutes (the 10:00 burst, the 14:30 quarterly-report event, the on-call paging spike at 03:00). Averaging across hours hides the event whose existence determines whether you breach SLO. Always look at the worst minute, not the mean of all minutes.
"Canarying at 5% means the canary's load is 5% of production load." Wrong direction. The canary fleet is sized 5% of production capacity, but it sees the same per-instance request rate as the baseline (the load balancer routes 5% of traffic to it). What the canary sees is the same traffic pattern at the same per-pod RPS as the baseline. That is why minute-by-minute paired comparison is meaningful — both fleets see the 10:00 burst, both see the 14:30 lull, on the same wall clock.
"Aggregating per-pod p99s into a fleet p99 is the right way to dashboard tail latency." Mathematically wrong. p99 is not linear; the average of two pods' p99s is not the p99 of their combined request stream. The correct way is to either (a) merge the underlying HdrHistograms across pods and read p99 from the merged histogram, or (b) report a percentile range (min/median/max of per-pod p99s) so you can see when one bad pod is dragging the mean down. Prometheus's histogram_quantile against a sum by (le) (rate(...)) query is the only built-in primitive that does this right.
"The lab benchmark and the canary should agree; if they disagree, one of them is broken." They typically disagree, and both are correct for their question. The lab benchmark answers "in a controlled environment, which runtime is faster on this workload?" The canary answers "in production, with all the noise, does the runtime change improve or regress p99?" Disagreement is information — it tells you which production variable the lab held constant matters most.
"Once the canary is promoted, the production behaviour is locked in." It is not. A canary observed for 1 hour does not see the daily peaks (Tatkal at 10:00, IPL toss spikes), the weekly peaks (settlement reconciliation Sundays), the monthly peaks (salary day), or the seasonal peaks (Big Billion Days, Diwali). Most production-decision gaps surface during the first event the canary did not cover. The discipline is staged rollout — 5% for an hour, 25% for a day, 50% for a week, 100% only after a peak event has been observed cleanly.

Going deeper

Coordinated omission in canaries

The canary harness above pulls HdrHistograms from each fleet's application-side metrics — the latency the application records between request entry and response emit. Application-side measurement has the same coordinated-omission problem as wrk (without -R): if the application is paused for GC, the requests that arrive during the pause are not delayed by the GC themselves; they are delayed by being queued behind the in-flight pause, but the application's per-request timer starts only when each request is dequeued. The honest fix is to measure latency from the load balancer (NLB or Envoy access logs), not from the application. The load balancer's request_processing_time includes queueing delay introduced by application back-pressure, which is what the user actually experiences. Razorpay's canary harness pulls from both; if the two diverge by more than 20%, the application has a coordinated-omission bug in its own metrics.

Heap state matters: don't compare a fresh fleet to a 2-day-old fleet

A common canary anti-pattern: deploy the new fleet, immediately compare against the baseline fleet that has been up for 2 days. The new fleet has a fresh, unfragmented heap; the baseline has a fragmented one. The new fleet looks 10–20% better for the first 6–12 hours and the canary promotes prematurely. The honest comparison is to recycle the baseline pods at the same time as the canary deploy (or never recycle either), so both fleets have matched heap age. Java's G1 GC fragmentation curve typically rises sharply for the first 12 hours and stabilises; ZGC is more stable; Go's allocator (TCMalloc-derived) is less prone to fragmentation but still benefits from matched age. The Razorpay deploy controller's --restart-baseline flag rotates the baseline pods at canary start specifically to avoid this bias.

Synthetic burst injection during the canary

Real traffic bursts come at fixed times of day (10:00, 13:00, 17:30 IST for UPI). A canary launched at 11:00 will not see a burst until 13:00 — and most canaries are promoted before 13:00 because the deploy team wants to go to lunch. The discipline is to inject synthetic bursts into the canary fleet during the observation window: a separate load-generator pod sends an extra 2× traffic spike for 60 seconds at minute 15 and minute 35 of the canary, alongside the normal stratified-sample real traffic. This gives the canary a controlled stress test on top of its real-traffic baseline, and it surfaces GC-pacer-vs-burst interactions in 1 hour rather than 6. The baseline fleet sees the same injected bursts (the load generator targets both fleets), so the comparison stays paired.

When the answer is "neither runtime is the bottleneck"

Sometimes the lab benchmark and the canary both show no meaningful runtime difference — both runtimes hit p99 = 18 ms, neither breaches SLO, the rewrite does not improve anything. This is the most useful canary outcome that nobody talks about: it tells you the runtime was not the bottleneck. The bottleneck is somewhere else — the database, the cache miss rate, the network round-trip, the serialisation format. Profile the existing fleet with py-spy / async-profiler / pprof, find the actual hot path, and fix that instead. Razorpay's payments fleet has shipped exactly one runtime rewrite in the last three years; the other 14 candidate rewrites died at the canary stage with the verdict "the runtime is not the problem". That kill-rate is a feature of the discipline, not a failure of it.

Reproduce this on your laptop

# Install the load generator, packet capture, and parsing tools
sudo apt install wrk2 tcpdump tshark
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh scapy

# Record a small "production" workload locally (e.g. against a demo service)
sudo tcpdump -i lo -w /tmp/demo.pcap -s 65535 'tcp port 8080' &
TCPDUMP_PID=$!
# ... drive the demo service for 2 minutes with mixed payloads ...
sleep 120; sudo kill $TCPDUMP_PID

# Replay it against both runtimes (after starting svc_jvm and svc_go on staging hosts)
python3 replay_production_traffic.py

You should see the median favour whichever runtime is fastest at the steady-state computation, and the tail favour whichever runtime handles the burst arrivals better. If the two diverge, you have a candidate for further tuning — GOMEMLIMIT for Go, -XX:MaxGCPauseMillis for Java — before you cut a real canary.

Where this leads next

This chapter closes Part 13. The runtime is one variable in a larger system; the rest of the curriculum is the other variables — capacity planning, production debugging, case studies — and the discipline of canary-driven decisions appears in every one of them.

/wiki/wall-cpu-is-half-the-story — the Part 1 "wall" that started the same theme: lab numbers tell you about the box, production tells you about the system around the box.
/wiki/wall-lab-numbers-production-numbers — the cross-cutting lesson about lab-vs-production gaps that every "wall" chapter reinforces from a different angle.
/wiki/measuring-language-runtimes-fairly — the previous chapter, which produces the lab numbers this chapter argues are necessary but insufficient.
/wiki/coordinated-omission-and-hdr-histograms — the measurement-discipline foundation under both the benchmark and the canary.
The next part (capacity planning and load testing) builds on the canary discipline introduced here, generalising it to predict the cliff before you fall off it.

The closing rule: trust no single number. Trust the pair — lab benchmark plus paired-fleet canary — and trust them only when they agree on the direction of the change and disagree only on the magnitude. When they disagree on direction, the production variable the lab held constant is the actual story, and you have just learned something more valuable than the rewrite would have delivered.

References

Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the definitive talk on coordinated omission, equally applicable to canaries as to benchmarks.
Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 12 (Benchmarking) and Chapter 14 (Performance Analysis Methodology) cover the lab-to-production bridge.
HdrHistogram project — the data structure that makes paired-fleet comparison mathematically meaningful.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — why production tails behave differently from lab tails, and why hedging exists.
Lightstep, "Canary Deploys: A Statistical Primer" (2019) — the statistical foundation for promote/hold/rollback decisions on noisy production data.
Netflix Tech Blog, "Automated Canary Analysis at Netflix with Kayenta" (2018) — the canonical write-up on automated canary verdicts at scale.
/wiki/measuring-language-runtimes-fairly — the previous chapter, which this chapter argues is necessary-but-insufficient.
/wiki/wall-lab-numbers-production-numbers — the cross-curriculum cousin of this article, focused on the same gap from a different angle.