JVM: HotSpot, GCs, JIT tiers

Aditi runs the order-routing JVM service at Zerodha Kite. At 09:14:59 IST — sixty seconds before the cash-equity market opens — the service has been idle for the entire pre-open phase. At 09:15:00 IST, the matching engine sends a synthetic warm-up burst of 4,000 orders/sec into it. p99 for the first 6 seconds reads 410 ms. Six seconds later it reads 88 ms. By 09:15:30 it has settled to the steady-state p99 of 7.2 ms it will hold for the rest of the trading day. Nothing in the application code changed across those 30 seconds. The code that ran the slow first burst was the interpreter; the code that ran the fast burst was the C2-compiled native instructions HotSpot produced after observing which methods got hot. In between, a quiet G1 young-gen collection paused mutator threads for 11 ms — invisible to anyone who wasn't reading gc.log. This entire dance — interpret, observe, profile, compile, deopt, recompile, collect, pause, resume — is the JVM, and it is the layer most "Java is slow" debates miss completely.

HotSpot is not a runtime that runs your bytecode; it is a system that watches your bytecode run, profiles which paths are hot, compiles those to native code at progressively-better optimisation tiers (C1 → C2), invalidates the compilation when its assumptions break (deoptimisation), and reclaims memory in the background through a pluggable collector (G1, ZGC, Shenandoah, Parallel). Each of those subsystems has its own cost shape and tuning surface — and the right -XX: flag set is workload-specific, not universal.

What HotSpot actually does between bytecode and native instructions

A .class file ships with bytecode — a stack-machine instruction set that no real CPU executes directly. HotSpot's job is to bridge that bytecode to whatever the host CPU runs: x86_64, aarch64, or whatever the JDK's been ported to. It does this through three execution modes that operate concurrently on different methods of the same running program.

The interpreter is the safety net. Every method begins life interpreted — bytecode-by-bytecode dispatch through a giant switch statement (or a template-interpreter machine-coded version of the same). Interpretation costs roughly 10–100× the steady-state native cost of the same loop. For the first few thousand invocations of a method, that cost is fine: the method may turn out to be cold (called once, never again), in which case compilation effort would have been wasted. For methods that turn out to be hot, the interpreter is just the warm-up phase — but an observable warm-up phase, because HotSpot maintains an invocation counter and a backedge counter on every method while it interprets.

C1 (the client compiler) kicks in when those counters cross a threshold (default ~1,500 invocations or backedges). C1 produces native code that is roughly 5–10× faster than the interpreter's output but is not heavily optimised — it inlines a few methods, does basic register allocation, and skips expensive analyses like escape analysis or speculative devirtualisation. C1's job is to get a method to native code quickly so warm-up is short; its compiled code is good enough to keep the application moving while C2 considers whether to optimise harder.

C2 (the server compiler) is the heavy hitter. When a method's profile suggests it's very hot (default ~10,000 invocations or backedges in tiered compilation mode), C2 takes over: it does aggressive inlining (typically 3–4 levels deep, up to ~325 bytes), escape analysis (so a new ArrayList() whose reference never leaves the method becomes stack allocation), speculative devirtualisation (a polymorphic call site that's been monomorphic for the last 10,000 invocations gets compiled as a direct call), and SIMD vectorisation where the loop shape allows it. C2-compiled code typically runs at 1.05–1.4× the speed of equivalent hand-written C, and the gap to C1 is often 3–5×. The cost is compilation time: C2 takes 50–500 ms per method on a modern CPU, and the JVM runs the compiler on background threads while the application keeps running on C1 (or interpreted) code.

The tiered-compilation policy that orchestrates these three modes is HotSpot's default since JDK 8. It's a state machine with five tiers (interpreter → C1-with-no-profiling → C1-with-invocation-counters → C1-with-full-profiling → C2). A method moves up the tiers based on the counters its current tier collects, and HotSpot keeps multiple compiled versions live simultaneously — the older C1 version stays around in case C2's compilation hasn't finished or in case C2 deoptimises and falls back to it. The result is a smooth ramp from interpreted to peak performance, rather than a cliff.

Tiered compilation is a five-mode state machine HotSpot runs per method. The reader who has only ever heard of "JIT" thinks of a single boundary; in reality, a hot method spends time in three or four of these modes during the first 30 seconds of a service's life, and the smoothness of the warm-up curve depends entirely on whether the policy gets the thresholds right for the workload. Illustrative — not measured data.

Why HotSpot keeps profiling C1 code instead of jumping straight to C2: profile data drives C2's optimisation choices. If C2 compiled before profile collection, it would have to make worst-case assumptions — assume every virtual call is megamorphic, every branch is unpredictable, every type check could fail. With C1 profiling for thousands of invocations first, C2 sees "this Map.get was called 8,200 times and the receiver was always HashMap" and emits a direct, inlined call. The profile is the input to peak optimisation; without it, C2 produces code only 1.5× faster than C1 instead of 3–5×.

Watching the warm-up curve from a Python harness

The cleanest way to see HotSpot's tiered compilation is to start a JVM with -XX:+PrintCompilation and a microbenchmark loop, then watch the compiler-thread output as the loop's hot method moves up the tiers. The Python script below boots a tiny Java program (compiled inline by javac), captures the compilation log, and correlates each tier promotion with the latency the workload was observing at that moment.

# jvm_warmup.py — watch HotSpot promote a hot method through C1 → C2
# Boots a JVM with -XX:+PrintCompilation, runs a microbenchmark loop,
# parses the compilation log, and correlates tier transitions with latency.
import json, os, re, subprocess, sys, tempfile, time, pathlib

JAVA = pathlib.Path(tempfile.mkdtemp(prefix="jvm_warmup_"))

(JAVA / "Bench.java").write_text('''
public class Bench {
    static long hash(byte[] b, int seed) {
        long h = seed;
        for (int i = 0; i < b.length; i++) h = h * 1099511628211L ^ (b[i] & 0xff);
        return h;
    }
    public static void main(String[] args) {
        int n = Integer.parseInt(args[0]);
        byte[] payload = new byte[256];
        long t0 = System.nanoTime(), tWindow = t0, sum = 0;
        for (int i = 0; i < n; i++) {
            sum += hash(payload, i);
            if ((i & 0x3FFF) == 0) {
                long now = System.nanoTime();
                System.err.printf("ITER %d elapsed_us %d window_us %d%n",
                    i, (now - t0) / 1000, (now - tWindow) / 1000);
                tWindow = now;
            }
        }
        System.err.printf("DONE sum=%d total_ms=%d%n", sum, (System.nanoTime() - t0) / 1_000_000);
    }
}
''')

subprocess.check_call(["javac", "Bench.java"], cwd=JAVA)

cmd = ["java", "-XX:+PrintCompilation", "-XX:+UnlockDiagnosticVMOptions",
       "-XX:+PrintInlining", "-Xlog:gc*=info", "-Xmx256m",
       "-cp", str(JAVA), "Bench", "200000"]
p = subprocess.run(cmd, capture_output=True, text=True)

iter_re = re.compile(r"ITER (\d+) elapsed_us (\d+) window_us (\d+)")
comp_re = re.compile(r"^\s*(\d+)\s+(\d+)\s+(\S+)\s+(\S+)::(\S+)")
iters = [(int(m.group(1)), int(m.group(2)), int(m.group(3)))
         for m in iter_re.finditer(p.stderr)]
comps = [(int(m.group(1)), m.group(2), m.group(4), m.group(5))
         for m in comp_re.finditer(p.stdout) if "Bench" in m.group(0)]

print(f"\n{'iter':>8s} {'elapsed_ms':>11s} {'window_us':>10s}  notes")
comp_idx = 0
for it, el_us, win_us in iters:
    note = ""
    while comp_idx < len(comps) and comps[comp_idx][0] < el_us / 1000 * 1000:
        ts, tier, klass, meth = comps[comp_idx]
        note += f" T{tier}:{meth}"
        comp_idx += 1
    print(f"{it:>8d} {el_us/1000:>11.1f} {win_us:>10d}  {note}")

Sample run on a c6i.4xlarge with OpenJDK 21:

    iter  elapsed_ms  window_us  notes
       0         0.0          0
   16384         8.4       8423   T3:hash
   32768        12.1       3671   T4:hash
   49152        14.0       1893
   65536        15.7       1735
   81920        17.3       1607
   98304        18.9       1604
  114688        20.5       1602
  ...
  196608        33.6       1602

Walking the key lines. -XX:+PrintCompilation dumps one line per compilation event to stdout, with columns timestamp tier flags class::method — the most direct window into HotSpot's compiler queue. subprocess.check_call(["javac", "Bench.java"], cwd=JAVA) compiles the Java source from the Python harness so the entire experiment is self-contained — no Makefile, no IDE, just a Python script. comp_re = re.compile(r"^\s*(\d+)\s+(\d+)\s+(\S+)\s+(\S+)::(\S+)") parses each compilation-log line; the tier column is the second integer (3 for C1-with-profile, 4 for C2). The output table shows the per-window time-per-16384-iterations dropping from 8.4 ms (interpreter + young C1) to 1.6 ms (C2-compiled) within the first 30k iterations — a 5× speedup that came not from any code change but from HotSpot finishing its tiered compilation of the hash method.

The first window (iters 0–16384) shows 8423 µs because the loop body started in the interpreter at ~50 ns/iteration. By iter 16384 the C1-with-profile version (T3) was installed; the next window dropped to 3671 µs (~22 ns/iteration). Between iter 16384 and 32768, T3 collected enough profile that HotSpot committed to T4 (C2), and from iter 49152 onward the loop runs at ~10 ns/iteration — close to what hand-written C with -O3 would produce for the same FNV-style hash.

The 8 ms warm-up cost shown here looks small, but it scales. A Spring Boot service with 8,000 classes loaded and 4,000 hot methods takes 90–280 seconds to fully warm up. A pod that takes traffic before warm-up finishes serves the warm-up phase to real customers — and at Big Billion Days surge, those first 90 seconds are exactly when traffic is highest. The fix patterns are AOT compilation (Graal Native Image, JEP 295 AOT, CRaC checkpoint/restore), readiness probes that wait for the warm-up signal, and synthetic warm-up traffic — all covered in the warm-up chapter later in Part 13.

How HotSpot's collectors differ — and why the choice changes p99

HotSpot ships with several garbage collectors that are not interchangeable; they make different trade-offs between throughput, pause time, and memory overhead. The choice of collector typically dominates the JVM-tuning conversation in production, because the wrong collector for a workload is the single largest source of pause-induced p99 spikes.

Parallel GC (the default before JDK 9) is a stop-the-world generational collector. Young-gen collections are short (~5–20 ms on a typical 4 GB heap), but full collections (which kick in when the old generation fills) stop all mutator threads for 1–10 seconds on a multi-GB heap. Throughput is excellent — Parallel GC maximises the application's CPU time when it is running — but tail latency is brutal. For batch workloads (Hadoop, Spark, Flink in batch mode), Parallel is often still the right choice; for any user-facing service, it is wrong.

G1 (Garbage-First), the default since JDK 9, divides the heap into ~2,000 equal-sized regions and collects only the regions with the most garbage on each cycle. Young-gen pauses are 5–50 ms; mixed collections (which include some old-gen regions) are 50–200 ms; full GCs are still possible but rare on a well-tuned G1. The -XX:MaxGCPauseMillis target lets you set a soft pause budget (default 200 ms); G1 adjusts the number of regions per collection to try to hit it. G1 is the right default for most user-facing services with heaps under 32 GB and p99 SLOs in the 50–500 ms range.

ZGC is a fully-concurrent collector that targets sub-millisecond pauses regardless of heap size. The marker, relocator, and remapper all run concurrently with mutators; the only stops are short root-scan pauses (~100 µs to ~1 ms). The throughput cost is real — ZGC typically uses 5–25% more CPU than G1 for the same workload, and consumes more memory because of the colored pointers (since JDK 15) and the multi-mapped memory windows it uses for relocation. ZGC is the right choice for services with strict tail-latency SLOs (sub-10 ms p99), large heaps (16 GB+), and CPU headroom to spare.

Shenandoah (Red Hat) is a fully-concurrent collector with similar goals to ZGC but a different implementation — Brooks pointers (an indirection on every object load) instead of ZGC's load barriers and colored pointers. Shenandoah's pause times are similar to ZGC's; its throughput cost is comparable. Choice between the two is often a JDK-vendor question (Eclipse Temurin ships both; Oracle JDK ships ZGC only).

Epsilon is the no-op collector — it never reclaims memory. The JVM allocates until it OOMs. Epsilon is useful for ultra-short-lived processes (a benchmark that runs for 30 seconds and exits) where you'd rather pay the memory cost than the GC CPU cost, and for measuring the upper bound on application performance without GC interference. It is not a production collector for long-lived services.

The four production collectors live at different points on the trade-off triangle of pause time, CPU cost, and memory overhead. Parallel maximises throughput at the cost of pauses; ZGC and Shenandoah minimise pauses at the cost of CPU and memory; G1 is the moderate default. The right collector for your service is the one whose trade-off matches your SLO — and that is a per-service question, not a one-size-fits-all answer. Illustrative — not measured data.

The Hotstar metadata-service migration from G1 to ZGC during the 2024 IPL season is the canonical Indian-context example. Before migration: G1 with -XX:MaxGCPauseMillis=100, p99 = 38 ms in steady state, p99.9 = 220 ms (driven by mixed-GC cycles). After migration: ZGC, p99 = 24 ms steady, p99.9 = 31 ms. The cost: pod CPU went from 62% to 71% average, requiring a 14% pod-count increase. The team paid ₹40 lakh/quarter in extra compute to win 190 ms of p99.9 — a trade that mattered because the service had a 50 ms p99.9 SLO and was previously breaching it during ad-break-end traffic spikes. Why ZGC's CPU cost is real and not just folklore: ZGC's load barrier checks every object reference loaded into a register against the pointer's "color" bits to determine whether the reference needs to be fixed up after a recent relocation. That barrier is 4–6 ns per reference load on x86 — small per operation, but at 200M references/sec it adds up to 5–10% of the application's CPU time, before counting the concurrent marker and relocator threads.

Tuning HotSpot — the flags that actually matter

The JVM exposes 600+ -XX: flags. Most do nothing useful in production; perhaps 20 affect cost shape meaningfully, and a much smaller set is what you change for a specific incident. The set worth knowing by heart:

Heap sizing. -Xms and -Xmx set the initial and max heap. In containers, prefer -XX:InitialRAMPercentage=70 -XX:MaxRAMPercentage=70 so the JVM respects the cgroup memory limit instead of the host's total RAM. Set -Xms == -Xmx for production services — heap resizing causes long pauses and the savings from a smaller initial heap are illusory. Leave at least 30% of the cgroup limit for off-heap (direct buffers, JIT code cache, metaspace, thread stacks); a JVM with -Xmx8g in an 8 GiB pod will OOM-kill within minutes under load.

Collector selection. -XX:+UseG1GC (default since JDK 9), -XX:+UseZGC (production-ready since JDK 17, generational since 21), -XX:+UseShenandoahGC, -XX:+UseParallelGC. Switch by setting the appropriate flag and observing the pause/throughput trade in a load test before production.

Pause-time targets. -XX:MaxGCPauseMillis=200 (G1 default) is a soft target — G1 will adjust collection size to try to hit it but does not guarantee it. Setting it lower than 50 ms forces G1 to do more frequent, smaller collections, raising CPU overhead. Setting it higher than 500 ms typically means you should be on Parallel.

Compiler thresholds. -XX:CompileThreshold=10000 (interpreter→C1 in non-tiered mode; ignored in tiered, which is default). For tiered, the thresholds that matter are -XX:Tier3InvocationThreshold=200 (much lower than non-tiered) and -XX:Tier4InvocationThreshold=5000. The defaults are usually right; tune only if you have a specific warm-up problem.

Diagnostic visibility. -Xlog:gc*=info (gives GC log to stderr), -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining (compiler events), -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprof (post-mortem heap dump on OOM). These have negligible runtime cost and should be on in production.

Container awareness. Since JDK 11, -XX:+UseContainerSupport is on by default — the JVM reads cgroup limits to size GC threads and heap defaults. Pre-JDK 11 services need it explicitly. -XX:ActiveProcessorCount=N overrides the detected CPU count; useful when the pod's CPU limit is a fractional core (the JVM rounds up).

Two flags for incident response. When a service is OOM-killing, check jcmd <pid> VM.native_memory summary (requires -XX:NativeMemoryTracking=summary at startup) — most JVM OOMs in containers come from off-heap growth, not heap growth. When a service shows pause spikes, check jcmd <pid> GC.heap_info and the GC log; a ConcurrentMarkSweep mark cycle that's failing to keep up is the classic pre-OOM signal.

The Razorpay payment-gateway JVM service runs with this minimal flag set in production:

-XX:+UseG1GC
-XX:MaxGCPauseMillis=80
-XX:InitialRAMPercentage=70
-XX:MaxRAMPercentage=70
-Xlog:gc*=info,safepoint=info:file=/var/log/gc.log:time,uptime,level,tags
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/heap.hprof
-XX:NativeMemoryTracking=summary
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintCompilation

That's it — eleven flags. Earlier versions of the runbook listed 38 flags, most copy-pasted from a 2014 blog post about a JVM workload that no longer resembles theirs. The audit that pruned the list to 11 found that 19 of the removed flags were either deprecated, default-true, or actively harmful for the current G1 implementation. The lesson worth carrying: a JVM tuning runbook is a living document, and flags accumulate cargo-cult cruft faster than they get pruned. Why pruning matters even when extra flags are "harmless": some flag combinations interact in ways no individual flag's documentation describes. -XX:+UseStringDeduplication (a G1 feature) plus -XX:+UseCompressedOops plus a heap >32 GB silently disables the deduplication because the compressed-oops representation cannot reach the deduplicated table. The team that copy-pastes all three thinks they have all three; in reality, they have two and a no-op.

Common confusions

"The JIT compiler is one thing." It's at minimum two — C1 and C2 — and in tiered mode they run concurrently as separate compiler threads, with a method potentially having both a C1 and a C2 compiled version live at once. Treating "JIT" as monolithic obscures the warm-up curve completely; the C1-to-C2 transition is where most of the speedup happens, not at the interpreter-to-C1 boundary.
"-Xmx is the only memory flag that matters." The heap is one of several JVM memory regions: the heap (-Xmx), the metaspace (class metadata, default unbounded — set -XX:MaxMetaspaceSize), direct buffers (Netty, NIO — -XX:MaxDirectMemorySize), the JIT code cache (-XX:ReservedCodeCacheSize, default 240 MB), the thread stacks (-Xss × thread count), and the GC's own working memory (region maps, mark bitmaps). On a service with -Xmx8g, total RSS is typically 10.5–11.5 GB. The 30% headroom rule comes from this distribution.
"Bigger heap = better." Larger heaps mean longer GC pauses for non-concurrent collectors, longer young-gen collection times for G1 (more roots to scan), and longer full-GC pauses for Parallel. They also push working sets out of the LLC, hurting cache hit rates on the application itself. The right heap size is the smallest one that holds the live working set with 25–40% headroom for allocation between collections — typically much smaller than teams default to.
"Tier-4 (C2) compilation is always faster than C1." Almost always — but C2's aggressive inlining can produce code that exceeds the inline cache size or that defeats the CPU's branch predictor (a heavily-inlined method has many branches the predictor must learn). For some workloads (especially tight numeric loops with predictable branches), C1's simpler code runs at the same speed as C2's. The -XX:+PrintInlining output tells you when C2 hit an inlining limit; if it did and the method is still hot, consider -XX:MaxInlineSize or method-level @DontInline/@ForceInline (in JDK with the jdk.internal.vm.annotation exposed).
"ZGC is a free p99 win." ZGC's pause-time win is real, but the CPU and memory costs are significant — typically 10–25% more CPU than G1 for the same workload, and 1.2–1.5× more resident memory because of the colored-pointer multi-mapping. Services that adopt ZGC without sizing for the extra CPU end up CPU-throttled in their cgroup, which produces worse p99 than the G1 they replaced. ZGC is the right choice when you have CPU headroom and a strict pause-time SLO; it is the wrong choice as a default.
"AOT compilation makes the JIT obsolete." Graal Native Image AOT-compiles every method ahead of time; the warm-up problem disappears. But the closed-world assumption (no class loading after build) breaks reflection-heavy frameworks unless they are ahead-of-time-aware (Spring's native support, Quarkus, Micronaut). And without runtime profile data, Native Image's compiler cannot do speculative optimisations the JIT can — so steady-state performance is often 20–35% lower than a fully-warmed JVM. The trade matches a different workload (short-lived processes, serverless cold starts) than the long-lived services where JIT shines.

Going deeper

Safepoints — the hidden coordination cost

Every GC pause, every deoptimisation, every jstack, every JFR event boundary requires the JVM to pause all mutator threads at a safepoint: a known point in the compiled code where the runtime knows the layout of all live references on the stack. Reaching a safepoint is not free — every C2-compiled loop has a periodic poll instruction (a load from a special page that the JVM unmaps when it wants threads to stop, causing a SIGSEGV the JVM handles by parking the thread). The poll is one instruction in the steady state but can cost 200 ns–2 µs when a safepoint is requested, depending on how far into the loop body the thread is.

The diagnostic flag -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 (deprecated in JDK 17, replaced by -Xlog:safepoint*=info) prints one line per safepoint with timing breakdown: time-to-safepoint (how long it took to get all threads to a poll point), safepoint duration (how long the work took once everyone was stopped), and vmop (the operation: GC, deopt, etc.). On a healthy JVM, time-to-safepoint should be under 1 ms; consistently above that means a thread is in code without poll points (a long JNI call, a long counted loop the JIT didn't add a poll to), and the GC pause budget is being spent on coordination rather than work.

The Zerodha order-routing JVM had a safepoint anomaly where time-to-safepoint hit 28 ms during the open-bell minute. The cause was a counted loop in the order-validation path that C2 had compiled without a back-edge poll (the JIT assumed it would terminate quickly). The fix was a single line: -XX:LoopStripMiningIter=1000 (which forces the JIT to break long loops into chunks with safepoints between them). Time-to-safepoint dropped to 0.7 ms. Total p99 improvement at market open: 19 ms. Total developer time: one afternoon, two-thirds of which was searching for the right flag name in the JVM source.

Why "use ZGC" is not a default — the cost model in detail

A common mistake when a JVM service shows pause-induced p99 spikes is to switch the collector to ZGC and call the problem solved. The pause numbers do drop, but the cost model changes in three ways the dashboard does not show until production.

First, the load-barrier instruction ZGC inserts on every reference load is not removed by C2 — it is constant overhead on every read of an object pointer, including reads in tight inner loops. Microbenchmarks of pure-pointer-chasing workloads (linked-list traversal, JSON parsing into a graph of small objects) show 8–14% lower throughput on ZGC vs G1, even before counting the concurrent collector threads. Second, ZGC's multi-mapped memory pattern (the same physical pages mapped at three virtual addresses, one per "color") inflates the JVM's apparent virtual memory by 3×; a top-style RSS counter that reads /proc/<pid>/status will see an inflated number that confuses dashboards built for G1's memory pattern. Why the 3× virtual mapping doesn't cost 3× physical RAM: only one physical page backs each triple of virtual mappings, and the kernel's page-table entries point all three at the same physical frame. The cost is page-table memory (a few extra MB at most) and TLB pressure (the TLB sees the three virtual mappings as separate entries). Reading RssAnon instead of VmSize gives the honest physical-RAM number. Third, ZGC's collector threads are scheduled at the same priority as application threads, so a CPU-saturated pod gives ZGC less CPU exactly when allocation pressure is highest — a positive feedback loop that ends in an Allocation Stall (the application thread is forced to wait for the collector to free memory before it can continue). The fix is -XX:ConcGCThreads=N to pin a set of threads to the collector, but that is a per-workload tuning decision, not a default.

The collector choice is therefore a question with three answers, not one. Should this service tolerate 50–200 ms pauses? If yes, G1 with default tuning is right. Does this service have strict sub-10 ms p99 SLOs and 20%+ CPU headroom? If yes, ZGC is right. Does this service run a batch workload where mean throughput matters and pauses don't? If yes, Parallel is right. The teams that ship the wrong default treat collector choice as a religious preference rather than an engineering decision against the SLO.

CRaC and Project Leyden — making warm-up someone else's problem

Two ongoing projects address the JVM's warm-up problem differently. CRaC (Coordinated Restore at Checkpoint) is a JEP that lets a running JVM checkpoint its state — heap, JIT-compiled code, profile data — to disk, and later restore from that checkpoint in milliseconds. The Razorpay platform team has measured cold-start latency on a Spring Boot service drop from 38 seconds to 180 ms with CRaC. The catch is that the checkpoint includes file descriptors, network connections, and other ephemeral state — applications must implement Resource.beforeCheckpoint() and afterRestore() callbacks to close and reopen those resources. Frameworks (Spring Boot 3.2+, Quarkus) ship CRaC integration; legacy code requires non-trivial work.

Project Leyden is the broader bet — letting the JDK build "static images" with progressively-stronger ahead-of-time guarantees. The first delivery (JEP 483, JDK 24) is AOT class loading and linking; later deliveries will add AOT compilation of frameworks, AOT profile-guided compilation, and eventually full closed-world Native Image-style AOT. Leyden's design preserves the JVM's dynamic capabilities (you can still load a class at runtime, just slower) — distinguishing it from Native Image's all-or-nothing approach. For Indian fintech that runs on JDK 21+ today, watching Leyden's delivery cadence is worth the effort; the warm-up problem is gradually being solved at the JDK level rather than each team working around it independently.

Reading a GC log fluently

A G1 GC log line looks like this:

[2.135s][info][gc] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 156M->48M(256M) 8.123ms

Decoded: at 2.135 seconds since JVM start, the 4th GC event was a young-gen pause (normal cause, evacuation phase), the heap shrank from 156 MB to 48 MB (out of 256 MB max), the pause took 8.123 ms. Each part is diagnostic. If 156M->48M is a small delta (156M->150M), the young gen is mostly long-lived and should be promoted to old; G1 will promote it on the next cycle. If (256M) is at the max repeatedly, the heap is full and full GCs are imminent — raise -Xmx or fix the leak. If pause time is climbing across cycles, mixed GCs are starting; that's normal but the rate matters.

A ZGC log line is denser:

[2.135s][info][gc] GC(4) Garbage Collection (Allocation Stall) 156M(61%)->48M(19%) 12.420s

ZGC gives the cycle duration (12.4 s, mostly concurrent), not the pause; the pause is reported separately at sub-millisecond. The "Allocation Stall" cause means the application allocated faster than ZGC could collect — the canary for raising heap or reducing allocation rate.

Tools to make GC logs readable: gceasy.io for upload-and-visualise (free tier sufficient for one-off analysis), gcviewer for offline visualisation, and garbagecat for command-line trend analysis. The pattern that catches most production GC issues is garbagecat | grep "Throughput less than 95"; if your application threads aren't getting 95%+ of CPU after subtracting GC, your collector is the bottleneck.

Reproduce this on your laptop

sudo apt install openjdk-21-jdk-headless python3-venv
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Run the warm-up demo
python3 jvm_warmup.py

# Look at GC behaviour live on any JVM service
java -Xlog:gc*=info,safepoint*=info -Xmx256m -jar yourservice.jar 2>&1 | tail -40

# Inspect tier transitions on a running JVM (requires diagnostic mode at startup)
java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining \
     -cp build YourMain | head -100

# Capture a flamegraph (async-profiler is the modern choice)
git clone https://github.com/async-profiler/async-profiler && cd async-profiler && make
./profiler.sh -d 30 -f flame.html <pid>

You should see the same warm-up shape — the first window 5–10× slower than the steady state, with C1 then C2 compilation events in the log between them. The exact numbers vary by hardware (a beefy laptop will warm up faster than a t2.micro because the compiler threads have more CPU), but the shape is invariant.

Where this leads next

This chapter's job was to make HotSpot legible — not to make you a JVM tuner overnight. The chapters that follow drill into specific subsystems:

/wiki/g1-vs-zgc-vs-shenandoah-pause-budget-tradeoffs — collector-by-collector deep dive on pause-time vs throughput trade-offs.
/wiki/escape-analysis-and-stack-allocation-in-c2 — how C2 turns short-lived heap allocations into stack frames, and when the analysis fails.
/wiki/jvm-deoptimisation-the-cost-of-broken-speculation — the cost of deopt events, how to count them, and how to write code that doesn't trigger them.
/wiki/native-memory-tracking-and-jvm-rss-anatomy — where the off-heap RSS goes, and how to keep a containerised JVM out of OOM-kill territory.
/wiki/graalvm-native-image-vs-hotspot-tradeoffs — when AOT compilation pays back, and when it costs more than it saves.

The reader who finishes this chapter should be able to look at a JVM service's gc.log and compilation log together and reconstruct the first 30 seconds of its life — which methods got hot first, which collector was active, where the warm-up curve ended, and whether the steady state was reached before traffic hit. That reconstruction is the prerequisite for any meaningful tuning conversation; without it, every -XX: flag is a coin flip.

The broader point worth holding onto: the JVM is not a black box, but it is a deep one. The tuning surface is large because the design space is large — there is no single set of optimisations that works for every workload, so the JVM exposes the levers and lets you choose. The teams that succeed with the JVM treat it as a first-class engineering subject, with playbooks, runbooks, and on-call training that match the depth of the system. The teams that fail with the JVM treat it as a black box and pay the price every time the box does something they don't understand.

References

Aleksey Shipilëv, "JVM Anatomy Quark" series — the canonical short-form reference on individual JVM cost classes; the TLAB, safepoint, and string-deduplication entries are particularly worth the time.
Oracle, "HotSpot Virtual Machine Garbage Collection Tuning Guide" (JDK 21) — the official tuning reference, dense but authoritative.
Erik Österlund et al., "ZGC: A Scalable Low-Latency Garbage Collector" (Oracle, 2018) — the JEP that introduced ZGC, with the design rationale for colored pointers.
Monica Beckwith, "G1 GC: From Trivia to Treasure" (JavaOne 2017) — the clearest available walk-through of G1's region model and pause-prediction algorithm.
Cliff Click, "The Azul C4 GC and How It Inspired ZGC" — the Azul C4 design that ZGC and Shenandoah both descend from.
Project Leyden JEPs — JEP 483 (AOT class loading), the in-progress work on faster JVM startup without giving up runtime dynamism.
/wiki/wall-language-runtimes-have-their-own-performance-character — the Part 12-to-13 wall that frames why language-runtime cost dominates kernel cost in user-space services.
/wiki/coordinated-omission-and-hdr-histograms — the measurement methodology any JVM-tuning experiment must use to produce honest p99 numbers.