Case: CPU saturation without user load

It is 03:14 IST. Aditi, on call for Flipkart's catalogue-search service, is woken by a page that says every one of the 240 pods is at 92% CPU, sustained for the last forty minutes. She checks the request rate panel on the same dashboard and it shows a flat line at 600 RPS — the overnight floor. Yesterday at 03:14 the same service was at 4% CPU on the same hardware. Nothing deployed in the last twelve hours. Nothing deployed all week, in fact, because Diwali code-freeze started Monday. There are no users awake. There are no batch jobs scheduled. There is no GC pressure visible in the JVM panel. And yet the 16-core c6i.4xlarge instances are running flat-out, burning roughly ₹3,800 per hour in extra compute against zero business value. This chapter is the walkthrough of how that bug was found, what it actually was, and why the four tools the on-caller reached for in sequence — top, perf top, a flame graph, and an eBPF off-CPU trace — are the right four tools in the right order for the entire class of "CPU saturation without user load" bugs.

When CPU saturates without proportional request load, the work is happening inside the process for reasons unrelated to the request stream — a runaway timer, a retry storm against a downed dependency, a tight loop in a poller, a JIT recompilation cascade, or a GC running constantly without freeing memory. The on-caller's job is to climb the diagnostic ladder fast: first confirm it is on-CPU work and not lock contention masquerading as load, then capture a flame graph to attribute the cycles to a function, then explain why that function runs by reading its arguments and call frequency. The case in this chapter ended at the third rung, with one boring line in the flame graph and a six-character config typo.

The first three minutes — confirming the shape of the problem

Aditi's first move is kubectl exec into one of the loud pods and run top -H to see whether the load is one runaway thread or every thread cooking together. The output is unambiguous within five seconds.

Two shapes, two diagnostic forks. Spread load means the cycles are baked into every request the workers pull off the queue; a single hot thread means there is a runaway you can isolate with one `gdb -p` attach. The Flipkart incident was the spread-load shape.

The catalogue-search pod has 16 worker threads handling the request queue and a handful of bookkeeping threads (metrics-publisher, JFR-recorder, GC reaper). Aditi's top -H shows every worker at roughly 5.7% CPU — sixteen workers × 5.7% ≈ 91%, matching the pod's headline number. The bookkeeping threads are at 0.1–0.4% CPU. Why this matters for the diagnostic path: spread load tells you the cycles are baked into whatever the workers pull off their queue, so the request-handler code path or a shared dependency is the suspect. A single hot thread would mean a runaway timer or a stuck poll loop, and the path would have been a gdb -p <tid>/jstack/py-spy dump to grab the one stack and read what it was doing. Different shape, different next step. Do not skip this rung — confirming the shape is twenty seconds of work that prevents twenty minutes of looking in the wrong place.

The second rung is perf top -p <pid> — a sampling profiler view of which functions are eating cycles, refreshing every two seconds. It does not require a flame graph render or a JFR dump; it is the live equivalent of "what is this process doing right now". The output Aditi sees:

Samples: 47K of event 'cycles', Event count (approx.): 38,421,556,118
Overhead  Shared Object              Symbol
  31.42%  perf-12389.map             [.] org.apache.lucene.util.fst.FST.findTargetArc
  18.07%  perf-12389.map             [.] org.apache.lucene.util.fst.BytesStore.readByte
  11.93%  perf-12389.map             [.] catalogue.search.QueryPlanner.expandSynonyms
   6.84%  perf-12389.map             [.] catalogue.search.QueryPlanner.normalize
   4.21%  libjvm.so                  [.] OopMapSet::find_map_at_offset
   3.76%  perf-12389.map             [.] java.util.HashMap.hash
   2.48%  [kernel]                   [k] entry_SYSCALL_64
   ...

The top-of-list is dominated by org.apache.lucene.util.fst.FST.findTargetArc and BytesStore.readByte — Lucene's finite-state-transducer traversal, which is what every catalogue query touches because the synonym index is stored as an FST. Combined, those two consume half the process CPU. The next two entries are catalogue-search's own query planner. So the cycles are being spent walking the synonym FST for queries that are, ostensibly, not arriving — the request graph says 600 RPS but the FST is being walked far more than 600 times per second. Either the request graph is wrong, or something inside the process is calling into the FST without coming from a user request.

A quick sanity check: Aditi runs bpftrace -e 'tracepoint:syscalls:sys_enter_accept4 /comm == "java"/ { @[pid] = count(); } interval:s:5 { print(@); clear(@); }' and confirms that the kernel sees roughly 600 inbound TCP accepts per second per pod — matching the request graph. So the work is genuinely not driven by user requests. Something inside the JVM is calling findTargetArc on its own initiative.

The flame graph that named the function

The third rung is a flame graph — a 60-second perf record of the JVM with stack traces, rendered through flamegraph.pl. The driver is a Python orchestrator (the same sp-profiler sidecar pattern from the previous chapter), reproduced here in compressed form to make the actual diagnostic visible.

# capture_flame.py - 60-second flame graph of a noisy pod
# Run from the sp-profiler sidecar:
#   kubectl debug -it catalogue-search-7b8-jx2 -n storefront-prod \
#       --image=internal/sp-profiler:v3 --target=app -- \
#       python3 capture_flame.py --duration 60 --pid 12389 --out /tmp/flame
import argparse, os, re, subprocess, sys, time
from pathlib import Path

def perf_record(pid: int, duration: int, out_dir: Path) -> Path:
    raw = out_dir / "perf.data"
    cmd = ["perf", "record", "-F", "99", "-p", str(pid),
           "-g", "--call-graph", "fp", "-o", str(raw),
           "--", "sleep", str(duration)]
    print(f"# perf record -F 99 -p {pid} -g for {duration}s")
    subprocess.run(cmd, check=True)
    return raw

def to_collapsed(raw: Path, out_dir: Path) -> Path:
    script = out_dir / "perf.script"
    folded = out_dir / "perf.folded"
    with script.open("w") as f:
        subprocess.run(["perf", "script", "-i", str(raw)],
                       stdout=f, check=True)
    with script.open() as fin, folded.open("w") as fout:
        subprocess.run(["stackcollapse-perf.pl"],
                       stdin=fin, stdout=fout, check=True)
    return folded

def render_svg(folded: Path, out_dir: Path) -> Path:
    svg = out_dir / "flame.svg"
    with folded.open() as fin, svg.open("w") as fout:
        subprocess.run(["flamegraph.pl", "--colors=java",
                        "--title=catalogue-search 60s"],
                       stdin=fin, stdout=fout, check=True)
    return svg

def find_dominant_stack(folded: Path, threshold: float = 0.10) -> list:
    """Return stacks whose sample share exceeds threshold (e.g. 10%)."""
    pat = re.compile(r"^(.*) (\d+)$")
    rows, total = [], 0
    for line in folded.open():
        m = pat.match(line.rstrip())
        if not m: continue
        stack, n = m.group(1), int(m.group(2))
        rows.append((stack, n)); total += n
    rows.sort(key=lambda r: -r[1])
    return [(s, n, n / total) for s, n in rows if n / total > threshold]

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--pid", type=int, required=True)
    ap.add_argument("--duration", type=int, default=60)
    ap.add_argument("--out", type=Path, required=True)
    a = ap.parse_args(); a.out.mkdir(parents=True, exist_ok=True)
    raw = perf_record(a.pid, a.duration, a.out)
    folded = to_collapsed(raw, a.out)
    svg = render_svg(folded, a.out)
    print(f"# flame graph: {svg}")
    dom = find_dominant_stack(folded, threshold=0.05)
    print(f"# dominant stacks (>5% of samples):")
    for stack, n, share in dom[:5]:
        leaf = stack.split(";")[-1]
        print(f"  {share*100:5.1f}%  {n:>6} samples  ...;{leaf}")

if __name__ == "__main__":
    main()

# Sample run on the loud pod (60s, 99 Hz sampling, 16 cores):
# perf record -F 99 -p 12389 -g for 60s
# flame graph: /tmp/flame/flame.svg
# dominant stacks (>5% of samples):
   62.4%  35124 samples  ...;CacheWarmer.refresh;QueryPlanner.expandSynonyms;FST.findTargetArc
    9.1%   5121 samples  ...;CacheWarmer.refresh;QueryPlanner.normalize
    6.8%   3823 samples  ...;HttpHandler.handle;QueryPlanner.expandSynonyms;FST.findTargetArc
    5.4%   3041 samples  ...;G1ParEvacuateFollowersClosure::do_void

The flame graph names the runaway: 62.4% of samples are inside CacheWarmer.refresh calling QueryPlanner.expandSynonyms calling FST.findTargetArc. Only 6.8% of samples come from the actual HttpHandler.handle path — which is consistent with 600 RPS at a few milliseconds per query each. The rest of the CPU is the cache warmer, a background thread the team added six months ago to pre-populate the synonym cache so the morning rush would not see cold-cache p99 spikes. That cache warmer was supposed to run once at startup and once at 06:00 IST every day — so the morning rush, not the 03:14 dead window, was its target.

Why a flame graph could find this in seconds when the request graph could not: the request graph counted user-driven HTTP accept4 syscalls. The cache warmer is a background thread inside the same process, scheduled by the JVM's ScheduledExecutorService, doing CPU-bound work that never touches the network. From the outside it is invisible — same process, same memory, same hardware bill. From inside the process via stack-sampling it is the dominant consumer. The lesson generalises: when CPU saturates without proportional ingress, the work is internal, and the only tool that decomposes "internal work" by attribution is a stack-sampling profiler. Top tells you which thread; perf top tells you which symbol; a flame graph tells you the path — which caller produced this leaf. The path is the diagnosis.

Why the cache warmer was running at 03:14 — the off-CPU trace and the config typo

A flame graph names the function but does not always explain the why. The cache warmer is supposed to run once a day. Aditi's next move is to read the scheduler's state — has the warmer been firing on a fast cadence, or has one warmer call been stuck for 40 minutes? She runs an off-CPU trace with bpftrace for 30 seconds, picking up every time the cache-warmer thread blocks (sleeps, waits on a condition variable, blocks on I/O):

sudo bpftrace -e '
  kprobe:finish_task_switch /comm == "CacheWarmer"/ {
    @off[tid] = nsecs;
  }
  kfunc:try_to_wake_up /comm == "CacheWarmer" || args.p->comm == "CacheWarmer"/ {
    if (@off[args.p->pid]) {
      @off_us = hist((nsecs - @off[args.p->pid]) / 1000);
      delete(@off[args.p->pid]);
    }
  }
  interval:s:30 { print(@off_us); exit(); }'

@off_us:
[0, 1)              412  |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)              193  |@@@@@@@@@@@@@@@@@@@@@@@                            |
[2, 4)               21  |@@                                                 |
[4, 8)                3  |                                                   |

The cache warmer thread is barely sleeping at all — the longest off-CPU interval in 30 seconds is 4–8 µs, which is just kernel preemption between scheduler ticks. So the warmer is not running once at 06:00 IST and stuck; it is running continuously, returning from one refresh() invocation and being immediately re-scheduled to run another. That is a scheduling bug, not a stuck-thread bug, and it points at the ScheduledExecutorService configuration.

A grep through the deployed config in kubectl get configmap catalogue-search-config -o yaml finds the line:

warmer.refresh.interval: "0s"

The intended value was 6h — six hours, which is what the team wanted for the morning rush. Six months ago, when the cache warmer was added, the config schema was hand-written and the parser interpreted 0s as "fire as fast as possible". Why this typo survived six months in production: it was added during a Diwali-2025 hot-fix push at 02:00 IST, the team that added it tested it locally with a 30s value, and the production rollout used a templating system that defaulted unfilled fields to 0. The bug only surfaced now because last night's restart of the pods (a routine kernel-patch rollout) reset the in-memory schedule. Before the restart, the cache warmer had been in a state where its first run at 06:00 IST happened to consume enough CPU to delay the next scheduling decision — masking the bug behind the warmer's own runtime. After the restart, the schedule was clean and the 0s interval kicked in immediately. The bug had been latent in the config but visible only on this restart. Latent config bugs that need a specific runtime path to expose them are a recurring pattern in CPU-saturation incidents — when the on-caller asks "what changed", the honest answer is sometimes "nothing changed, but a precondition that hid the bug was reset".

The fix is a one-line config edit and a rolling restart: warmer.refresh.interval: "6h". Within four minutes of the rollout, every pod's CPU drops from 92% back to 4%. The catalogue-search service is back to its normal overnight floor.

Total time from page to fix: 47 minutes. Time spent on each rung of the diagnostic ladder: 3 minutes on top -H, 5 minutes on perf top, 12 minutes on the flame graph, 8 minutes on the off-CPU trace and config grep, the rest on writing the postmortem. The ladder works because each rung has a question the previous rung cannot answer. Skip a rung and you either miss the answer (if the question your missed rung answers is the discriminating one) or you overspend on observability budget (if you reach for an expensive tool to answer a question a cheap one would have answered). The 47-minute resolution time is not because Aditi is faster than other on-callers; it is because the ladder converged in four steps without a backtrack. The on-callers who take four hours on the same incident usually take four hours because they jumped from top -H straight to a heap dump or a deployment rollback, skipped the symbol and path rungs, and ended up debugging the wrong layer.

The general shape of "CPU saturation without user load"

The Flipkart incident is one instance of a recurring class. Across the postmortems published by Razorpay, Zerodha, Hotstar, PhonePe, and Swiggy in the last three years, the same five sources account for the vast majority of "CPU high, traffic flat" pages:

Five flavours, each diagnosed at a different rung. The order is fixed: shape, symbol, path, cause. Skip a rung and you risk a wrong answer; go in order and you converge in under an hour even on a service you have never debugged before.

The diagnostic ladder generalises beyond the Flipkart cache-warmer story. Each rung answers a question the previous one cannot:

top -H answers "what shape is the load" — spread across workers, or concentrated on one thread? This single piece of information forks the rest of the investigation.
perf top answers "what symbol is hot" — Lucene FST traversal, HTTP client retries, GC stop-the-world, JIT compiler threads. By the symbol you can usually rule three of the five flavours out.
A flame graph answers "what path is hot" — same symbol, multiple callers; the path tells you whether expandSynonyms is being called from the request handler or from a background warmer. This is where the Flipkart case ended.
An off-CPU trace + config inspection answers "why is the path running" — the warmer is firing because its scheduled interval is 0s. This is where most postmortems end with a fix; very few CPU-saturation incidents survive past rung four.

The ladder is also the cost-budget. top -H is free. perf top is a few percent of CPU on the target. A flame graph at 99 Hz sampling for 60 seconds is under 1% of one core. An off-CPU trace can spike higher if the thread is blocking thousands of times per second. By going in order, you spend the least observability budget at each step and only escalate if the answer is not yet clear. Reaching for a flame graph as your first move when top -H would have shown you a single hot thread is wasted work.

Common confusions

"High CPU with no traffic must be a kernel bug." Almost never. Kernel-side CPU saturation shows up in top as %sy (system CPU) being high; user-space saturation shows up as %us. The Flipkart incident was 88% user, 4% system — clearly a userspace problem before the on-caller looked at a single stack. Always read the user/sys split first; it eliminates the kernel as a suspect in seconds.
"Flat request graphs mean no work." A flat request graph means no external work. Background threads — schedulers, cache warmers, JIT compiler threads, GC threads, metrics publishers, log shippers, auto-discovery probes, Kubernetes liveness checks — all consume CPU without showing up in the request graph. Any production service has dozens of these; the request graph is the wrong instrument for measuring their cost.
"The fix is to add a circuit breaker." A circuit breaker is the fix for a retry storm (flavour 2 in the taxonomy above). For a runaway scheduled task (flavour 1), a JIT cascade (flavour 4), or a GC churn (flavour 5), a circuit breaker does nothing because there is no external dependency to break the circuit on. Match the fix to the flavour, not to the symptom.
"perf top and a flame graph are the same thing." They are different views of the same data. perf top is real-time and aggregates by leaf symbol — it tells you findTargetArc is hot. A flame graph aggregates by full call stack — it tells you findTargetArc is hot from CacheWarmer.refresh, not from HttpHandler.handle. The path matters because the same leaf can have different callers with different fixes.
"GC at 5% in the flame graph is fine." It depends. Healthy G1 on a steady-state JVM is typically 1–3% of CPU; ZGC is 0.5–2%. Sustained 5% GC with the heap not freeing is a leak in slow motion. Read the GC log alongside the flame graph; the two together tell you whether the GC is doing useful work or running flat-out without making progress.
"Cache warmers are a best practice, so leave them alone." Cache warmers are a fine pattern when their schedule is bounded and their CPU budget is tracked. An unbounded cache warmer (any interval: 0 config) is a runaway by construction. Treat warmer configs as production code; review them with the same care as request-handler timeouts.

Going deeper

Why the user/sys CPU split is the first instrument

Linux's /proc/stat exposes nine CPU-time buckets per CPU: user, nice, system, idle, iowait, irq, softirq, steal, guest. The first thing every CPU-saturation diagnostic should read is user vs system — userspace cycles vs kernel cycles. A pod at 92% with 88% user and 4% system is a userspace-code problem; the same pod at 92% with 30% user and 60% system is a kernel-path problem (typically a syscall storm, a network-stack hotspot, or a memory-management hotspot like compaction or page reclaim). The fix is in completely different code in each case. top shows this split as the first line; mpstat -P ALL 1 shows it per CPU. Read it before you read anything else; it eliminates half the diagnostic tree in two seconds.

Off-CPU profiling as the inverse of on-CPU

On-CPU profiling tells you which code is using cycles; off-CPU profiling tells you which code is blocked and for how long. A thread that is "supposed to be working" but is actually blocked 95% of the time will not appear in a flame graph because flame graphs only sample on-CPU threads. The way to find it is bpftrace -e 'kprobe:finish_task_switch { @[kstack, comm] = sum(...); }' or the offcputime tool from BCC, which records every off-CPU interval and its kernel stack. Brendan Gregg's USENIX 2016 paper and the BCC offcputime.py tool are the canonical references.

For "CPU saturation without user load" specifically, off-CPU is rung four — used to confirm whether a background thread is running continuously or firing on a fast cadence with brief blocks between calls. The Flipkart incident was the second case; an off-CPU trace would have shown long blocks if the schedule had been correct and the warmer was simply taking forever per call. The reason this distinction matters operationally: a thread that runs forever on one call is a deadlock or an infinite loop, and the fix is in the application code; a thread that fires every few microseconds is a scheduler-config bug, and the fix is in the configuration. Same on-CPU symptom, completely different remediation; only the off-CPU view tells you which one you have.

The Razorpay UPI 09:00 IST CPU-saturation ladder

Razorpay's payments-platform team (Bangalore, 2024 SREcon talk) published the runbook they follow for "CPU high, RPS flat" on the UPI authorization service. The ladder is the same four rungs as in this chapter, but with a Razorpay-specific extension at rung five: an eBPF tracepoint capture on tcp:tcp_retransmit_skb and block:block_rq_complete for 60 seconds, to rule out the case where the work is actually network-driven retries from a stuck NPCI dependency that the request graph is not capturing because the retries happen below the application's metric library.

The team reports that rung five catches roughly one in eight CPU-saturation pages — the case where a downed dependency is producing retry-storm CPU work that looks indistinguishable from a runaway scheduled task at rungs one through four. The lesson: every team should know which rungs catch which fraction of their incident classes; the ladder is the same, but the proportion of incidents caught at each rung is service-specific and should be measured. A team that has never measured this is operating on hope; a team that has measured it can prioritise tooling investment by which rung produces the most signal.

What to do when the cause is at rung five

Roughly one in twenty CPU-saturation incidents survive all four rungs and need rung five — full eBPF capture, application-internal metrics, scheduling-class introspection. At that point the right response is not to keep climbing the ladder solo; it is to escalate to a second on-caller and a service-owner, and to capture a core dump (gcore -p <pid>) before the workload is restarted so the post-mortem has a reference point.

Restarting the pod will fix the symptom but destroy the evidence; always capture a core or a heap dump before the rolling restart if the diagnostic ladder has not converged. The previous chapter on heap dumps and core dumps covers the capture mechanics; the discipline this chapter contributes is when to reach for them — only after rung four fails to produce a fix.

Reproduce this on your laptop

sudo apt install linux-tools-generic bpftrace bpfcc-tools \
                 openjdk-21-jre flamegraph
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh py-spy

# Spin up a tight CPU-burning Python loop in one terminal:
python3 -c 'while True: sum(i*i for i in range(1000))' &
PID=$!

# Climb the ladder in another terminal:
top -H -p $PID
sudo perf top -p $PID
py-spy record -o flame.svg --pid $PID --duration 30
sudo bpftrace -e 'kprobe:finish_task_switch /pid == '$PID'/ { @[tid] = nsecs; }
                  kfunc:try_to_wake_up   /args.p->pid == '$PID'/ {
                    if (@[args.p->pid]) {
                      @off_us = hist((nsecs - @[args.p->pid]) / 1000);
                      delete(@[args.p->pid]);
                    }
                  }
                  interval:s:30 { print(@off_us); exit(); }'
kill $PID

The four commands are the four rungs in miniature. The synthetic Python loop is a single hot thread (flavour 3), so top -H will already tell you the shape; perf top will name the symbol; the py-spy flame graph will show the call path; the off-CPU trace will show roughly zero off-CPU time because the loop never blocks. Vary the loop to match other flavours — a time.sleep(0) adds a tiny off-CPU signature, a requests.get(...) to a 502-ing endpoint produces flavour 2.

Where this leads next

This case study is the first of three in section 15.1's case-study triad. The other two follow the same shape — page, ladder, fix, generalisation — for two other recurring incident classes:

/wiki/case-memory-leak-that-wasnt — the case where the heap looks like it is leaking but the actual problem is a ThreadLocal cache that never evicts, with the diagnostic ladder centred on heap dumps and reference-graph introspection.
/wiki/case-p99-spike-that-was-a-gc-tuning-flag — the case where the p99 latency cliff turned out to be a single MaxGCPauseMillis flag that someone set to a value too low for the heap size, with the ladder centred on GC logs and ZGC vs G1 trade-offs.
/wiki/wall-performance-engineering-is-culture — the part-closing wall that ties the three case studies together: the diagnostic ladder is a cultural artefact, not a tool. Teams that write it down in their runbooks converge on incidents in 45 minutes; teams that reinvent it every page take four hours.

The arc across the three cases: rung-three diagnoses (flame graphs, today's case) catch the most common flavour; rung-four diagnoses (heap dumps, the next case) catch the heap-driven flavour; rung-five diagnoses (GC logs and tuning, the third case) catch the runtime-driven flavour. Each case extends the ladder by one rung and one tool. By the end of the section a reader has a complete production-debug toolbox for the CPU-saturation, memory-leak, and tail-latency families that account for the bulk of paging incidents in steady-state services.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 6 — the canonical CPU-profiling chapter, including the on-CPU/off-CPU framework this chapter's ladder formalises.
Brendan Gregg, "The Flame Graph" — the ACM Queue article that introduced flame graphs and explains why path-aggregated profiles dominate leaf-aggregated profiles for CPU-saturation diagnosis.
Brendan Gregg, "Off-CPU Flame Graphs" — the off-CPU complement that catches blocked threads invisible to on-CPU profiling.
Linux perf wiki — the upstream documentation for perf record, perf script, and perf top invocations used at rungs two and three.
bpftrace reference guide — the language reference for the off-CPU snippet at rung four.
Razorpay payments-platform on UPI debugging (SREcon APAC 2024) — the Razorpay team's runbook for CPU-saturation pages, including the rung-five tracepoint extension.
/wiki/flame-graphs-in-production — the previous chapter, which establishes the flame-graph capture pattern this case study builds on.
/wiki/tracepoints-and-dynamic-instrumentation — the chapter on the eBPF probe families that power rungs four and five of the ladder.