Case: p99 spike that was a GC tuning flag

It is 09:14 IST on a Tuesday. Razorpay's payments-API has been serving the morning UPI rush for forty-four minutes, and Aditi, on call for the order-creation service, is staring at a Grafana panel that has flatlined at the worst possible value: p99 latency, normally 180 ms, has climbed to 1.4 seconds and is holding there. Throughput is unchanged — 28,000 requests per second across the fleet, exactly yesterday's number for this hour. CPU utilisation is unchanged — 38% average, 54% on the busiest pod. Heap usage is unchanged — a clean 12 GB sawtooth oscillating between 7 GB post-GC and 11 GB pre-GC. Every signal that usually screams during an outage is silent. The only thing that changed overnight was a JVM upgrade: OpenJDK 17.0.8 to 21.0.2, rolled out at 04:15 IST under the standard low-traffic deploy window, and the engineer who pushed it had verified that all 412 integration tests passed and the canary pod ran clean for ninety minutes. This chapter is how a single GC tuning flag — MaxGCPauseMillis=200, a flag that had been correct for the OpenJDK 17 G1 collector and was catastrophically wrong for the OpenJDK 21 collector on the same heap — was found, named, and removed in fifty minutes once someone finally opened the GC log instead of the dashboard.

A p99 latency cliff with flat CPU, flat throughput, and flat heap-usage is almost always a stop-the-world pause that the dashboard does not surface — usually a GC pause, sometimes a safepoint, occasionally a kernel scheduler latency. The diagnostic ladder for these incidents is GC log → pause-time histogram → root-cause flag (heap size, target pause, region size, concurrent thread count), and the pause-time histogram is the discriminating instrument the way the dominator tree is for retention bugs.

What the dashboards showed and why they were silent

Aditi's first move when the alert fired was the standard four-panel grid: requests-per-second, p50/p99/p99.9, CPU utilisation, heap usage. The shape was the kind of incident on-callers dread because none of the panels had a story.

The silent-dashboard pattern: every panel except p99 is flat. CPU is unchanged because GC threads are not counted in user-mode CPU on the request-handler accounting; heap is unchanged because the bug is the *pause* per collection, not the size of the live set.

Three things about this picture are worth dwelling on, because they recur in every "silent-dashboard" incident. First, throughput is unchanged because the offered load is exactly yesterday's load — UPI traffic at 09:00 IST on a Tuesday is one of the most predictable curves Razorpay sees, and the load generator (in this case, the actual customers) is closed-loop with retries that absorb pauses without dropping requests. Second, CPU is unchanged because the dashboard's CPU panel reports user-mode percentage averaged over five-second windows, and a 200 ms STW pause every two seconds adds 10% to the kernel-mode pseudo-CPU but the dashboard does not plot kernel-mode separately. Third, heap is unchanged because the amount of garbage being produced is unchanged — what changed is how long each collection takes, which is invisible to a heap-size dashboard.

Why heap-size dashboards miss GC-pause regressions: heap usage is a space metric, GC pauses are a time metric, and the two are decoupled. A collector that produces a 50 ms pause and a collector that produces a 500 ms pause can run on the same heap, with the same throughput, with the same allocation rate, and produce the same heap-usage chart. The difference shows up only in the application's response-latency distribution. This is the recurring trap of latency debugging: the question "what changed in the heap" is well-instrumented and almost always the wrong question; the question "what changed in the GC pause distribution" is poorly instrumented and almost always the right one.

The team's first hypothesis, formed in the four minutes between alert and war-room, was network — a peer dependency at p99 had spiked, dragging the whole request graph. The downstream-service latency panel ruled this out: every dependency was within its normal band, and the slow time was being spent inside the JVM, not waiting on a socket. The second hypothesis was a thread-pool starvation event, perhaps a deadlock or a slow allocator. Aditi pulled a jstack dump from one of the loud pods. The threads were spread across RUNNABLE and WAITING (on a condition) in roughly the proportion the team saw under steady load. No deadlock, no obvious starvation pattern, no thread blocked on a slow lock. The third hypothesis — and the one that took the team forty minutes too long to reach — was garbage collection. The reason it took so long is instructive: every dashboard the team had built over four years emphasised heap size, not GC pause time, because every prior memory incident had been a leak (rising floor) and the team had calibrated its instruments to that family of bug.

The dashboard's silence was not a failure of the system; it was a failure of the instruments aimed at the system. Different bug-families need different instruments, and this one wanted GC log analysis. The dashboards, optimised for the prior incident's shape, had nothing to say about this one's shape.

The GC log and the pause-time histogram

JVM applications running G1GC produce a structured log of every garbage-collection event. Razorpay's payments-API was already running with -Xlog:gc*:file=/var/log/jvm/gc.log:time,level,tags, capturing every Young, Mixed, and Full pause along with the concurrent-mark events between them. The log file from the loud pod was 340 MB by the time Aditi pulled it. Reading 340 MB of GC log line-by-line is not the right move; the right move is to parse it into a pause-time histogram and read the histogram.

# gc_pause_histogram.py — parse a G1GC log into a percentile histogram
# Run: python3 gc_pause_histogram.py /var/log/jvm/gc.log
import argparse, re, sys
from collections import defaultdict
from hdrh.histogram import HdrHistogram

# Match lines like:
# [2026-04-25T09:14:03.412+0530][info][gc] GC(8421) Pause Young (Normal) (G1 Evacuation Pause) 412.7ms
PAUSE_RE = re.compile(
    r"\[(?P<ts>[\dT:.+\-]+)\].*?GC\(\d+\) Pause (?P<kind>Young|Mixed|Full|Remark|Cleanup) "
    r"(?:\([^)]+\) )*(?P<dur>[\d.]+)ms"
)

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("logfile")
    ap.add_argument("--since", help="ISO timestamp to start from")
    a = ap.parse_args()

    hist_by_kind = defaultdict(lambda: HdrHistogram(1, 60_000_000, 3))  # 1us..60s, 3 sig figs
    counts = defaultdict(int)

    with open(a.logfile) as f:
        for line in f:
            m = PAUSE_RE.search(line)
            if not m:
                continue
            if a.since and m.group("ts") < a.since:
                continue
            kind = m.group("kind")
            dur_us = int(float(m.group("dur")) * 1000)
            hist_by_kind[kind].record_value(dur_us)
            counts[kind] += 1

    print(f"{'kind':<10}{'count':>8}{'p50':>10}{'p99':>10}{'p99.9':>10}{'max':>10}")
    for kind, h in sorted(hist_by_kind.items()):
        p50 = h.get_value_at_percentile(50) / 1000
        p99 = h.get_value_at_percentile(99) / 1000
        p999 = h.get_value_at_percentile(99.9) / 1000
        mx = h.get_max_value() / 1000
        print(f"{kind:<10}{counts[kind]:>8}{p50:>9.1f}{p99:>9.1f}{p999:>9.1f}{mx:>9.1f}  ms")

if __name__ == "__main__":
    main()

# Sample run on the payments-API pod GC log, 09:00–09:15 IST window:
# python3 gc_pause_histogram.py gc.log --since 2026-04-25T09:00
kind         count       p50       p99     p99.9       max
Young         412      18.4     412.7     798.2     1182.0  ms
Mixed          37     142.0     1180.4    1340.0    1380.0  ms
Remark         12       8.2      14.1      18.3      19.0  ms
Cleanup        12       2.1       3.4       4.0       4.2  ms

Why the histogram is the right lens, not the per-event log: a 340 MB GC log has roughly 470,000 lines and 12,000 pause events over a 90-minute window. Eyeballing the log line-by-line for the slow pauses does not scale and biases toward whichever pauses your eye happens to land on. The percentile histogram answers the actionable question — "what does the tail of pause time look like?" — directly, and turns 12,000 events into six numbers per pause-kind. The shape of those numbers (Young at p99 = 412 ms, p99.9 = 798 ms; Mixed at p99 = 1180 ms) is the diagnosis: Young pauses, which are normally 10–30 ms on this heap, are spending an order of magnitude more time at the tail than they should, and the Mixed pauses are catastrophic.

Walk through six lines of the parser:

PAUSE_RE: The regex captures four pieces of every G1 pause line — timestamp, kind (Young/Mixed/Full/Remark/Cleanup), an optional cause-string in parentheses (which we ignore for the histogram, but is useful for incident narration), and the duration in milliseconds. G1 logs are line-oriented and stable across JVM minor versions, so a regex is the right tool — no need to pull in a full parser.
HdrHistogram(1, 60_000_000, 3): HdrHistogram with 1 µs to 60 s range, 3 significant figures of precision. This is the standard configuration the data-engineering and database curricula use and the same one the rest of this chapter expects. The 3-sig-fig precision means p99 is reported to within 0.1% of its true value — well below the noise floor of the underlying measurement.
record_value(dur_us): The histogram is in microseconds because that is HdrHistogram's natural unit. Converting from milliseconds to microseconds at ingest keeps the get_value_at_percentile outputs scaled correctly.
get_value_at_percentile(99): This is the percentile-of-pause-times, not percentile-of-request-latencies. The two are distinct. A pause-time p99 of 412 ms means 1% of GC pauses last longer than 412 ms; the corresponding request p99 depends on how often pauses fall on hot requests, which is what the rest of the diagnosis works out.
The output table: Read column-by-column. The Mixed pause p99 of 1180 ms — one second eighteen — is the buried lead. Mixed collections in G1 are supposed to be tens of milliseconds; the only way they reach a second is if the collector is failing to meet its pause-time target.
Per-kind separation matters: pooling Young and Mixed into one histogram would average out the Mixed-collection signal because Young dominates the count. Always separate by GC kind when reading G1 logs.

The diagnosis was now concrete: Mixed collections, which the OpenJDK 17 G1 collector had been completing in 80–200 ms on this heap, were taking over a second on the OpenJDK 21 G1 collector. The next question is why, and the answer is in one tuning flag.

The tuning flag and why the JVM upgrade exposed it

Razorpay's payments-API JVM had been launched, for the last three years, with this set of GC flags:

-Xms32g -Xmx32g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=4m
-XX:InitiatingHeapOccupancyPercent=45
-XX:ConcGCThreads=4
-XX:ParallelGCThreads=16

Every flag was set by an SRE in 2023 based on a tuning exercise against OpenJDK 17. The exercise had measured pause times against a representative load, swept the flags, and landed on this configuration as the one that produced p99 pause times under 200 ms while keeping throughput above 95% of the unconstrained baseline. The configuration shipped, the team moved on, and for three years it served the morning UPI rush at p99 = 180 ms request latency — a number the team was justifiably proud of.

The OpenJDK 21 release notes called out, in section 4.7 of the release notes, a behavioural change in G1's region-selection algorithm during Mixed collections. The change is genuinely an improvement in the average case (it tracks per-region remembered-set scan cost more accurately, leading to better pause-time prediction on most workloads) but it interacts badly with one specific configuration: a MaxGCPauseMillis setting that is aggressive relative to the heap size, on a workload with a large old-generation working set. The new heuristic, when asked to hit a 200 ms pause target on a 32 GB heap with a 18 GB old-generation working set, fails to evacuate enough regions per Mixed pause because it pessimistically estimates each region's evacuation cost. Falling behind on Mixed evacuation, the old generation grows, the heap fills toward the IHOP threshold, and the next Mixed pause has more work to do — but the same 200 ms budget. The pause overshoots, the next overshoots more, and within forty seconds the team is in the regime where every Mixed pause is a stop-the-world second.

The fix was to remove MaxGCPauseMillis=200 and let G1's adaptive defaults pick a target — which on a 32 GB heap with this workload settled at roughly 350 ms pause-target during normal operation and adaptively widened to 600 ms during Mixed-heavy periods. Counterintuitively, increasing the target pause time made the request p99 better: 350 ms pauses that always met their target produced a request p99 of 195 ms, while 200 ms pauses that overshot to 1.4 s produced a request p99 of 1.4 s. The arithmetic is brutal — a tail-pause that is seven times the target dominates the entire request distribution, while a pause that is twice as long but consistent produces only a small p99 tax.

Why "tighter target = worse p99" is not a paradox: a pause-time target in G1 is a budget, not a hard limit. The collector tries to evacuate as many regions as it estimates will fit in the budget. If the estimate is wrong, the pause overshoots — and the overshoot is unbounded. A tight budget on a workload the collector cannot accurately estimate produces more overshoots and bigger overshoots than a loose budget on the same workload. The mental model "smaller is better, the JVM will achieve what I ask" breaks down when the JVM's estimator is wrong. The defensive setting — particularly across JVM upgrades — is to leave MaxGCPauseMillis unset and accept the JVM's self-tuned default, which the JVM revises with each release based on whatever the new estimator's accuracy looks like. The team that explicitly configures MaxGCPauseMillis is signing up to re-validate the configuration on every JVM upgrade. In practice, no team has the bandwidth to do this re-validation reliably, so the setting drifts into wrongness on the next release.

The fix rolled out at 10:04 IST. Within twelve minutes every restarted pod's request p99 was back at 195 ms, slightly above the previous baseline of 180 ms because the new collector spent slightly more in concurrent-mark CPU. Throughput, CPU, and heap usage continued to be unchanged from the morning's silent baseline — the only signal that ever moved was the one the dashboards had not been showing.

It is worth pausing on the timeline. The deploy landed at 04:15 IST. The first customer-visible degradation was at 09:00 IST when UPI traffic ramped. The first page fired at 09:14 IST. The team formed a war-room at 09:18 IST. The first hour was spent hypothesising about network and thread-pool issues. The GC log was opened at 09:54 IST and the histogram was generated by 10:00 IST. The fix was identified at 10:02 IST and rolled out by 10:04 IST. From the first read of the GC log to the rollout was eight minutes; from the first page to the first read of the GC log was forty minutes. The lesson, which the post-incident review codified explicitly: the on-call should reach for the GC log before the dashboard on any latency-tail incident on a JVM service. Reversing the default order of investigation — log first, dashboard second — would have shaved most of the forty-minute pre-fix window.

What the team changed beyond removing the flag

The single-flag fix stopped the bleeding, but the post-incident review at 14:00 IST produced four follow-ups that mattered more than the hot-fix. Performance debugging culture is graded on what you change after the fire, not on how fast you put the fire out — and Razorpay's payments-platform team has a specific cultural emphasis on this, having published their incident-postmortem template in 2024.

The first follow-up was a GC pause-time SLO panel. The dashboard now plots pause-time p99 by GC kind — Young, Mixed, Full — alongside the existing request-latency panels. The pause-time panel uses the same percentile-histogram code as the request-latency panel (the gc_pause_histogram.py script above, run continuously by a sidecar, exporting to Prometheus). The team set an SLO of pause-time p99 < 250 ms for Young, < 500 ms for Mixed; alerts fire when either is breached for more than three minutes. The morning of the OpenJDK 21 incident, the alert would have fired at 04:35 IST — twenty minutes after the deploy and four hours before any customer noticed.

The second follow-up was a staged JVM-upgrade protocol. The previous protocol — a single canary pod for ninety minutes, full rollout if no errors — turned out to be inadequate because the canary pod's traffic mix at 04:15 IST was lower than the morning rush, and the GC pathology only emerged under the morning rush's allocation rate. The new protocol stages the rollout: 1 pod for 24 hours, 10% of pods for 24 hours, 50% for 24 hours, 100% only after a full UPI-morning-rush has been observed at each stage. The pause-time SLO panel is the gating signal at each stage; if pause-time p99 for any GC kind moves outside its band, the stage halts and the JVM upgrade reverts. Why staged rollout is not just paranoia: the JVM is a piece of software with hundreds of tunables, and any single release contains dozens of behavioural changes that interact non-linearly with workload patterns. The probability that some tunable is wrongly configured for some workload after a major-version JVM upgrade is, empirically, near 1. Catching the misconfiguration costs hours of staged-rollout time per upgrade; missing it costs an entire morning of degraded p99 plus the engineering cost of incident response. The expected-value math favours staged rollout for any workload at Razorpay-scale or above.

The third follow-up was removal of explicit GC tuning flags across the JVM-services fleet. The audit found 47 services with MaxGCPauseMillis explicitly set, 31 with G1HeapRegionSize explicitly set, 22 with InitiatingHeapOccupancyPercent explicitly set, and 14 with ConcGCThreads explicitly set. Each of these had been added at some point by an SRE who had run a tuning exercise against whatever JVM was current at the time. The audit's recommendation was to remove every flag whose justification could not be traced to a current, reproduced benchmark on the current JVM version against the current workload. Sixty-one of the 114 explicit settings were removed without measurable degradation. The remaining 53 were kept but tagged with a JVM-VERSION-LAST-VALIDATED annotation in the deployment config, with a 12-month expiry that auto-flags the setting for re-validation. The cultural change is from "tune once, set forever" to "tune is a perishable artefact that decays on each JVM upgrade".

The fourth follow-up was a GC log retention upgrade. Pre-incident, GC logs were rotated daily and kept for seven days on the pod's local volume — long enough to cover a normal incident, but lost when the pod restarted. Post-incident, every JVM service writes GC logs to a sidecar that ships them to S3 with a 90-day retention. Total cost: roughly 12 GB per service per day, ₹4,800 per month across all 47 JVM services, at S3 Glacier Instant Retrieval pricing. The benefit is that for any incident in the past 90 days, the team has the full GC log to parse — which means the diagnostic ladder always has rung two available, regardless of when the incident occurred or whether the pod has since restarted. Many of the most expensive incidents in the team's history would have been resolved hours faster had this retention existed; the cost is a rounding error against the engineering hours saved.

The combined effect over the six months following the incident: zero GC-related p99 incidents, two pause-time SLO alerts that caught regressions before customer impact (one of which was a different MaxGCPauseMillis issue on a different service that the audit had missed), and a measurable reduction in the on-call team's time-to-diagnose for any latency-tail incident — from a median of 47 minutes (pre-incident) to 14 minutes (post-incident), driven primarily by the new pause-time panel pointing the on-caller at the GC log within the first five minutes.

The four follow-ups generalise. Any team running a JVM service in production should have a pause-time SLO panel, a staged JVM-upgrade protocol, an audit of explicit GC flags with a re-validation cadence, and a long-retention GC log shipped to durable storage. None of these are JVM-specific; the same shape applies to any runtime with stop-the-world events (V8 GC for Node.js services, CPython GC for long-lived Python services, Erlang BEAM scheduler pauses on large message queues). The instrument differs, the discipline does not.

The taxonomy of "p99 cliffs without obvious cause"

The Razorpay incident is one instance of a recurring class — incidents where the dashboards are silent except for the latency tail. Across the published postmortems from Razorpay, Hotstar, Zerodha, Flipkart, and Swiggy over the last four years, the same five sources account for the bulk of these silent-dashboard p99 cliffs.

Five flavours of silent-dashboard p99 cliff. Each has a distinct discriminating instrument; the diagnostic ladder is the same in every case, but the rung that breaks the case is different. Reaching for the right rung quickly is the skill; building the right dashboards is the team-level investment that makes that skill possible.

Why this taxonomy matters operationally: the on-caller's first instinct on a p99 cliff is to look at the application metrics that produced the previous incident's signal. A team whose previous three incidents were leaks will look at heap; a team whose previous three were CPU saturation will look at flame graphs. The category of bug that breaks p99 silently — pauses, scheduler latency, retry amplification — is precisely the category that the request-handler's own metrics do not surface. The instruments live one layer down: the JVM's GC log, the kernel's scheduler trace, the request-graph trace. Diagnosing this family of incident requires the team to have built — before the incident — the second-tier dashboards that surface those instruments. Building them after the incident is normal; not having a plan for which instruments to build is the failure mode.

The diagnostic ladder generalises across the five flavours:

Confirm the dashboards are silent on every other panel. A latency cliff with a CPU spike is a CPU problem; a latency cliff with a heap-floor climb is a memory problem; the silent-dashboard family is the one where every other signal is flat. Confirming the silence is the unblocking step — it tells you the instrument is not in the request-handler's metrics.
Open the GC log. For JVM applications, this is rung two regardless of which sub-flavour the bug ends up being. The GC log captures pauses (flavour 1), safepoints (flavour 2), and JIT events (flavour 3), and the absence of pause-time anomalies in the log redirects the search to scheduler or retry causes.
Reach for the kernel-level instruments. bpftrace runqlat.bt for scheduler latency, perf sched for context-switch storms, dmesg for memory-pressure events. These are the rung-three instruments when rung two clears.
Trace one slow request end-to-end. A single distributed-trace of a request that took 1.4 seconds, with span-level timing at every hop, will name the source of the latency in seconds — but you must have distributed tracing wired in before the incident.

The fix is then matched to the flavour: GC pause regressions get a flag review or a JVM downgrade; safepoint pile-ups get a lock-review or biased-locking disable; JIT deopt storms get a warmup or inline-budget tune; scheduler latency gets CPU pinning or cgroup adjustment; retry tail amplification gets hedged requests or capped retry budgets. None of these are deploys to the application code, none are scaling decisions, and all of them are recoverable in minutes once the right instrument has named the cause.

Common confusions

"A latency cliff with flat CPU is impossible." CPU dashboards typically report user-mode utilisation averaged over five-second windows. A 200 ms STW pause every two seconds adds 10% to total CPU time, but if the pause is reported as kernel-mode (or as separate JVM-internal time the dashboard does not aggregate), the user-mode panel stays flat. Latency-with-flat-CPU is the signature of a stop-the-world event, not an impossibility.
"Lower MaxGCPauseMillis is better." A pause-time target in G1 is a budget, not a hard limit. A target the collector cannot meet on the current workload produces overshoots that are worse than a relaxed target the collector can hit consistently. Tighter targets only help when the collector's pause-time estimator is accurate, which is workload-dependent and JVM-version-dependent. The defensive setting is to leave the flag unset and accept the JVM-specific default.
"GC is a solved problem with ZGC / Shenandoah." ZGC and Shenandoah have sub-millisecond pauses, which solves the pause problem at the cost of higher CPU overhead during concurrent collection (typically 10–20% of one core). They do not eliminate pauses entirely — read-barrier costs are amortised but still present, and pathological cases (very short-lived objects allocated faster than the concurrent collector can keep up) still produce visible latency. ZGC is excellent for this curriculum's family of bug, but it is not a free lunch.
"A canary pod for 90 minutes catches JVM-upgrade regressions." A canary pod under low traffic does not exercise the same allocation rate, the same heap occupancy, or the same Mixed-collection frequency as the production workload at peak. Canaries that pass under low traffic and fail under peak traffic are the recurring shape of "the canary lied" incidents. Staged rollouts that observe each stage under a peak window — at Razorpay, at least one full UPI morning rush — are the protocol that catches load-dependent regressions.
"jstack shows the cause of a slow request." jstack captures thread states at one instant. A 1.4-second request spends most of its time in paused (not stack-recorded) state during a GC pause, then resumes and completes normally — by the time you jstack, the request is already in RUNNABLE post-pause. The instant-in-time view of jstack is the wrong instrument for stop-the-world events; the GC log is the right one.
"More heap fixes pause-time problems." Increasing -Xmx usually increases pause times because the GC has more memory to scan during each collection. The relationship is non-monotonic — going from 8 GB to 16 GB might reduce GC frequency enough to net-improve latency, but going from 16 GB to 32 GB on the same workload usually increases per-pause time more than it reduces frequency. There is no safe rule that "more heap is better" for latency-sensitive applications.

Going deeper

How `MaxGCPauseMillis` interacts with G1's region-selection algorithm

G1's Mixed-collection algorithm proceeds in two phases on every Mixed pause: it picks which old-generation regions to evacuate, then it evacuates them. The picking phase uses a per-region cost estimate — how long the collector predicts each region will take to evacuate — and it sums estimates until it hits the pause-time budget. If the budget is 200 ms and each region is estimated at 8 ms, the collector picks 25 regions; if the actual cost is 15 ms per region, the pause overshoots to 375 ms.

The OpenJDK 21 change to the cost estimator is a refinement: it tracks per-region remembered-set size and uses it to predict scan cost more accurately. On most workloads this produces better pause-time prediction. On workloads with a large old-generation working set and skewed reference patterns — payments-API has both, because each payment record references a customer record which references a merchant record — the new estimator overestimates the cost of high-RSet regions. The collector picks too few regions per Mixed pause, falls behind on old-generation evacuation, the IHOP threshold triggers more Mixed pauses, and each one has more work backed up than the last. The cascade ends only when the application is in degraded steady-state with second-long pauses every few seconds.

The fix is structural: removing MaxGCPauseMillis lets the JVM's adaptive sizing pick a target the estimator can actually hit on this workload, which on the payments-API turned out to be 350 ms — wide enough to allow the collector to stay ahead of old-generation growth.

Why `-XX:+PrintGCDetails -Xlog:gc*` is non-negotiable for production JVMs

Every production JVM should run with the verbose GC log enabled to a file with rotation. The cost is approximately 50 KB per pod per minute of disk I/O — well below the noise floor of any real workload — and the benefit is that any latency-tail incident in the past N days has the GC log available for analysis. Without the log, the diagnostic ladder for GC-pause incidents stops at rung one (the silent dashboard) and the on-caller is reduced to inference.

The standard configuration is -Xlog:gc*:file=/var/log/jvm/gc.log:time,level,tags:filecount=10,filesize=100M. This produces ten 100 MB files rotated, so the last 1 GB of GC log is always available — typically 5–10 days of activity on a busy pod. Pair with a daily ship-to-S3 sidecar for retention beyond pod restarts.

Reading a G1 log: the four event kinds that matter

A G1 log has many event kinds; four of them dominate diagnosis:

Pause Young (Normal): Eden-region evacuation, fired when Eden fills. Should be 5–30 ms on a well-tuned heap. Anomalous when p99 climbs above 100 ms.
Pause Young (Mixed): Eden plus a slice of old-generation regions. Should be 30–150 ms. Anomalous when p99 climbs above 500 ms — that is the Razorpay signature.
Concurrent Mark Cycle: The concurrent phase that identifies which old regions are mostly garbage. Not a pause. The frequency tells you how aggressively old generation is being scanned.
Pause Full: The all-stop full collection, fired only when concurrent collection cannot keep up with allocation. Should be zero on a healthy heap. Any Full pause is a bug.

A pause-time histogram by kind, as in the script above, surfaces these four signals at a glance. Reading the histogram before reading the raw log is the time-saving move on every GC incident.

The IHOP threshold and why it matters more than the team realised

The InitiatingHeapOccupancyPercent flag (IHOP) controls when G1 starts a concurrent-mark cycle: when the heap reaches IHOP percent of -Xmx, the cycle begins, and Mixed pauses follow once the cycle completes. The default is adaptive — the JVM picks a value based on observed allocation rate and concurrent-mark duration — but the Razorpay configuration had it pinned at 45%, set by the same 2023 tuning exercise that fixed MaxGCPauseMillis.

A pinned IHOP at 45% on a 32 GB heap means concurrent-mark starts when 14.4 GB is allocated. On the OpenJDK 17 collector, this was the right setting: concurrent-mark took roughly four seconds, during which another 2 GB of allocation occurred, so Mixed pauses began with 16.4 GB of old-generation occupancy and roughly 13 GB of headroom to work through. On the OpenJDK 21 collector, with the new estimator falling behind on Mixed evacuation, the headroom was consumed faster than expected, the heap grew through the 80% mark mid-Mixed-cycle, and the collector entered an emergency mode that further slowed Mixed pauses. Removing the IHOP pin alongside the MaxGCPauseMillis removal let the JVM raise IHOP to 55% adaptively, giving the new estimator more headroom to work with. The interaction between two pinned flags is the subtle hazard of explicit GC tuning — each flag was defensible in isolation; their interaction on a different JVM version was not.

Reproduce this on your laptop

sudo apt install openjdk-21-jdk
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh

# Run a synthetic allocation-heavy workload and capture its GC log:
cat > AllocBench.java <<'EOF'
import java.util.*;
public class AllocBench {
  public static void main(String[] a) throws Exception {
    Map<Long,byte[]> live = new LinkedHashMap<>(){
      protected boolean removeEldestEntry(Map.Entry e){ return size() > 200_000; }};
    Random r = new Random();
    long start = System.currentTimeMillis();
    for (long i = 0; i < 50_000_000L; i++) {
      live.put(r.nextLong(), new byte[1024 + r.nextInt(8192)]);
      if (i % 1_000_000 == 0)
        System.out.printf("i=%d elapsed=%ds%n", i, (System.currentTimeMillis()-start)/1000);
    }
  }
}
EOF
javac AllocBench.java

# Run with aggressive pause target — observe Mixed-pause overshoot:
java -Xms2g -Xmx2g -XX:+UseG1GC -XX:MaxGCPauseMillis=50 \
  -Xlog:gc*:file=gc-tight.log:time,level,tags AllocBench

# Run without pause-time target — observe stable pauses:
java -Xms2g -Xmx2g -XX:+UseG1GC \
  -Xlog:gc*:file=gc-default.log:time,level,tags AllocBench

# Compare pause-time histograms:
python3 gc_pause_histogram.py gc-tight.log
python3 gc_pause_histogram.py gc-default.log

The synthetic reproduction shows the same shape as the Razorpay incident in miniature: the tight-target run produces Mixed-pause p99 above 500 ms and frequent pauses near 1 s, while the default-target run produces Mixed-pause p99 below 200 ms and a smoother distribution. The aggregate runtime of the tight-target version is also longer, because pauses dominate. This is the empirical demonstration of the "tighter target = worse p99" point above.

Where this leads next

This is the third and final case study in section 15.1. The triad covers the three families that account for the bulk of paging incidents in steady-state JVM services: CPU saturation, memory growth, and tail-latency cliffs.

/wiki/case-cpu-saturation-without-user-load — the first case, where the CPU is hot but the request graph is flat. The diagnostic ladder ends at the flame graph.
/wiki/case-memory-leak-that-wasnt — the second case, where heap usage climbs over six trading days and the dominator tree names the owner. The ladder ends at the reference path.
/wiki/wall-performance-engineering-is-culture — the part-closing wall that ties all three cases together: the diagnostic ladder is a cultural artefact, not a tool. Teams that build the dashboards and run the audits before the incident have ten-minute incidents; teams that build them during the incident have hours-long ones.

Across the triad: the CPU case ends at the flame graph (rung three of the on-CPU ladder); the memory case ends at the dominator tree and reference path (rung four of the heap ladder); this case ends at the GC pause-time histogram (rung two of the pause ladder, with rungs three and four available for non-GC-pause sub-flavours). The reader who reads all three has a complete production-debug ladder for steady-state-service incidents.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 9 — the canonical chapter on CPU and runtime overhead, including the framework for distinguishing user-mode from kernel-mode latency contributions.
OpenJDK 21 G1GC release notes — the public documentation of the region-selection refinement that exposed the Razorpay configuration's fragility, with the per-version behavioural changes the upgrade-protocol audit should consult.
Monica Beckwith, "G1 GC Tuning: A Case Study" — the foundational practitioner's guide to G1 tuning that this chapter's framing of "remove explicit flags by default" pushes against, with the cases where explicit tuning is genuinely warranted.
Gil Tene, "Understanding GC Latency" — the talk that first popularised the pause-time-distribution-by-kind framing that the histogram in this chapter implements.
JEP 354: Shenandoah GC and JEP 333: ZGC — the modern low-pause collectors that solve this chapter's family of bug at the cost of higher concurrent CPU; recommended reading for the next-step of any team hitting GC-pause incidents repeatedly.
Razorpay engineering blog — the public engineering writeups from the payments-platform team on production incidents in the Indian fintech domain.
/wiki/case-memory-leak-that-wasnt — the prior chapter in this triad, on retention-graph diagnostics, which this chapter's pause-time framing complements.
/wiki/coordinated-omission-and-hdr-histograms — the chapter on HdrHistogram methodology that this case study's parser builds on; the same percentile-precision contract applies to GC pauses as to request latencies.