Wall: CPU is half the story

At 11:47 IST on a Tuesday, the Swiggy checkout service starts returning p99 = 1.8 s. The SLO is 400 ms. The on-call engineer Karan opens the runbook he has rehearsed: pull a flamegraph from the continuous profiling store, find the fattest bar, blame the slowest function. He pulls the profile. The fattest bar is _recv_from_socket at 4.1% of CPU samples. The next is json.loads at 3.6%. Nothing is anywhere close to the 1.4 s of latency he is trying to explain. Total on-CPU time across the whole flamegraph adds up to 280 ms of CPU per request. The request takes 1,800 ms. The other 1,520 ms is not in the flamegraph at all because it was not on the CPU when the profiler sampled — the thread was sitting in epoll_wait, or queued behind a pthread_mutex_t, or stuck in a recvmsg waiting for a slow downstream Postgres query, and the CPU profiler does not see threads that are not running. After eight chapters of perf, flamegraphs, differential diffs, hardware event sampling, and continuous profiling — every one of which Karan now knows how to use — the dashboard he needs at 11:47 IST is a wall-clock profile, an off-CPU flamegraph, a tail-latency histogram with a coordinated-omission correction, and a queue-depth time-series. None of those live in Part 5.

CPU profiling answers "where is compute being spent" with high fidelity. Most p99 incidents in modern web services answer to a different question: "where is wall time being spent". A service that is 80% blocked on Redis, Postgres, or a downstream HTTP call has a CPU profile dominated by the 20% that ran. The off-CPU half — locks, syscalls, scheduler delay, queueing — needs different tools (wall profiles, off-CPU flamegraphs, eBPF, HdrHistograms) and a different mental model. Part 5 ends here so Parts 6, 7, and 8 can begin.

On-CPU samples are blind to the time you are most likely to lose

A sampling CPU profiler (perf, py-spy, async-profiler in cpu mode) interrupts the CPU at a fixed frequency — say 99 Hz — and records the stack of whichever thread happens to be running on that CPU at that moment. Threads that are not running are not sampled. This is the single most important property of the tool, and it is the one most readers do not internalise until a bad incident teaches them.

The arithmetic is direct. A web service with a 200 ms request budget that spends 30 ms on actual computation and 170 ms blocked on a downstream call uses the CPU for 15% of wall time. A 99 Hz CPU profiler running on that thread captures, on average, 99 × 0.030 = 2.97 samples per request. The other 99 × 0.170 = 16.83 per-request "would-be samples" never fire because the CPU is busy with something else (or idle) while the thread is parked. Aggregate across thousands of requests and the flamegraph still works — it correctly shows where the 30 ms went — but the picture it paints is "this service spends time in json.loads and _recv_from_socket", not "this service spends time waiting for Postgres". The latter is invisible because there is nothing to sample when the thread is asleep.

Why this is not a bug in the profiler: the profiler is doing exactly what it advertises — sampling the running CPU. The flaw is in the reader's assumption that "where the CPU spends time" and "where the request spends time" are the same question. They are the same question only when the service is CPU-bound. In a 2026 web stack — gunicorn fronting Postgres + Redis + a few HTTPS upstreams — almost no request is CPU-bound. A 90/10 wait-vs-compute split is the typical case, not the exceptional one.

An honest framing: the CPU profiler is not lying. It is answering a different question. The reader's job is to know which question they are asking.

The mismatch shows up most cruelly during the kind of incident that has a clear external cause. The Hotstar streaming-router team's 2024 IPL-final incident: a downstream catalogue service started returning p99 = 4 s instead of 80 ms. The router's CPU flamegraph during the incident looked nearly identical to the healthy baseline — the same hot paths, the same percentages, slightly higher absolute sample counts. The router was not doing anything different. It was waiting longer on each call. The CPU profile, the diff against last week, the hardware event counts — all unchanged. The on-call engineer spent 22 minutes reading flamegraphs before realising the answer was not there. A 60-second look at the wall-clock profile (where epoll_wait had ballooned from 12% to 71%) would have ended the incident in 3 minutes.

CPU samples vs wall-clock samples for a 200 ms request: a 90/10 wait-vs-compute splitA horizontal timeline showing one request lasting 200 ms. The thread is on-CPU for 30 ms and off-CPU (blocked) for 170 ms. Above the timeline, dots show CPU profiler sample firings — only 3 fire during the on-CPU window, none during the off-CPU window. Below the timeline, dots show wall profiler sample firings — they fire across the whole 200 ms window, capturing both states. The contrast labels read "CPU profiler: 3 samples, all in compute" and "wall profiler: 20 samples, 3 compute + 17 wait".One 200 ms request: CPU samples vs wall samplesCPU profiler at 99 Hz (samples only when on-CPU)3 samples in compute windowno samples — thread is blockedThread state during the requeston-CPU 30 msoff-CPU (epoll_wait / recvmsg) 170 msWall-clock profiler at 99 Hz (samples every state)3 compute samples14 wait samples — visible because we sample blocked threads tooCPU flamegraph says "json.loads is hot". Wall flamegraph says "epoll_wait is hot". Both are true. Only the second answers the latency question.
The same 200 ms request, sampled by two profilers running at the same 99 Hz frequency. The CPU profiler fires only during the 30 ms on-CPU window, producing 3 samples that all land on compute frames. The wall profiler fires across the whole 200 ms window, producing 17 samples — 3 compute, 14 wait — so the off-CPU contribution is finally visible. Illustrative — sample counts are deterministic in this diagram for clarity; in practice they are Poisson-distributed around the expected value.

The fix is not "stop using CPU profilers". CPU profilers are correct, cheap, and the right answer for the part of the problem they cover. The fix is to carry both views, side by side, and to know which view answers which question. CPU when you suspect a hot loop, a regex, a serialisation hot path, a GC cycle, a deserialisation cliff. Wall when you suspect a slow downstream, a lock, a database, a network hop, a scheduler delay.

What "wall time" actually decomposes into

A request's wall-clock latency is a sum, not a single thing. The arithmetic at the top of this chapter — 30 ms compute + 170 ms wait = 200 ms wall — is the simplest possible decomposition. Real services have more terms: scheduler delay, runtime overhead, lock contention, syscall entries, page faults. Each term needs its own measurement.

Once you accept that wall time matters, the next discipline is to know what wall time is made of. A request's wall-clock budget breaks into a small, fixed taxonomy of states. Knowing the taxonomy is what lets you read a wall-clock flamegraph and say "this 60% block is pthread_cond_wait waiting on a mutex" instead of "this 60% block is mysterious wait".

The states a thread can be in, on Linux, with their typical observable signatures:

  1. Running on-CPU (TASK_RUNNING + on a runqueue + currently scheduled) — this is what CPU profilers see. Cost: actual compute.
  2. Runnable but not on-CPU (TASK_RUNNING but waiting for a CPU) — scheduler delay. Counted by the kernel as "schedstat run_delay". Common when CPU is saturated, when CFS quotas are tight in a Kubernetes container, or when isolcpus is misconfigured.
  3. Sleeping on a syscall (TASK_INTERRUPTIBLE / TASK_UNINTERRUPTIBLE) — blocked on I/O, futex, network. The thread is parked until something wakes it. This is by far the largest fraction of wall time in most web services. Subdivisions matter: blocked on disk I/O is different from blocked on a downstream HTTP call.
  4. Stopped or zombie — terminal states, irrelevant for live profiling.

A wall-clock profiler captures (1)+(2)+(3); a CPU profiler captures only (1). The asymmetry is the point of this whole chapter, and it is the reason the next ~30 chapters of the curriculum exist. The interesting decomposition lives almost entirely inside state (3), and getting visibility into the why of state (3) is what off-CPU flamegraphs (/wiki/off-cpu-flamegraphs-the-other-half) are designed for. They sample at every context-switch-out, recording the stack at the moment the thread blocked, so the resulting flamegraph shows "we spent 1.2 seconds blocked under requests.geturllib3.connection.HTTPConnection.send" rather than "we spent 1.2 seconds in some wait state".

Why state (2) — runnable but not on-CPU — quietly causes the worst incidents: it is invisible to the CPU profiler (the thread is not running) and almost invisible to the wall profiler (the thread is technically TASK_RUNNING, just not scheduled). The kernel exposes it via /proc/<pid>/schedstat field 2 ("time spent waiting on a runqueue, in ns") and via the sched:sched_wakeup → sched:sched_switch latency on tracepoints. A Kubernetes pod with CPU throttling enabled can spend 30–40% of its wall time in state (2) during throttle windows — the request thread is ready to run, the CPU is available on the host, but the cgroup quota is exhausted so the kernel parks the thread until the next 100 ms slice. CPU profile says "fine". Wall profile says "fine". p99 says "definitely not fine".

# wall_vs_cpu.py — measure how much of a request budget the CPU profiler is blind to.
# Spawns a worker that does a small CPU loop then a long network wait, then runs
# both py-spy in CPU mode and py-spy in wall-clock mode against the same process.
# Compares where the samples land. This is the single most important calibration
# you can do on any service before you start trusting its flamegraphs.

import os
import socket
import subprocess
import sys
import threading
import time
from pathlib import Path

WORKER_DURATION = 30        # seconds
CPU_BURN_MS = 30            # per "request"
NET_WAIT_MS = 170           # per "request" (mocks a slow downstream)

def cpu_burn(ms: int) -> int:
    """Tight loop for ~ms milliseconds; returns work done."""
    end = time.perf_counter_ns() + ms * 1_000_000
    n = 0
    while time.perf_counter_ns() < end:
        n += 1
    return n

def fake_request() -> None:
    """One request: a little compute, then a long blocking wait."""
    cpu_burn(CPU_BURN_MS)
    time.sleep(NET_WAIT_MS / 1000)         # stand-in for a Postgres call

def worker() -> None:
    end = time.perf_counter() + WORKER_DURATION
    while time.perf_counter() < end:
        fake_request()

def run_pyspy(pid: int, mode: str, out: Path, secs: int = 20) -> None:
    """mode='cpu' uses --idle off (CPU only); mode='wall' uses --idle on."""
    cmd = ["py-spy", "record",
           "-p", str(pid), "-d", str(secs),
           "-r", "99", "-o", str(out)]
    if mode == "wall":
        cmd.append("--idle")
    subprocess.run(cmd, check=True, capture_output=True)

def main() -> None:
    t = threading.Thread(target=worker, daemon=True)
    t.start()
    time.sleep(2)                          # let the worker reach steady state

    pid = os.getpid()
    cpu_svg = Path("/tmp/flame_cpu.svg")
    wall_svg = Path("/tmp/flame_wall.svg")
    run_pyspy(pid, "cpu", cpu_svg)
    run_pyspy(pid, "wall", wall_svg)
    print(f"cpu  flamegraph: {cpu_svg}  ({cpu_svg.stat().st_size} bytes)")
    print(f"wall flamegraph: {wall_svg}  ({wall_svg.stat().st_size} bytes)")

if __name__ == "__main__":
    main()
# Sample run on a 4-core c6i.xlarge in ap-south-1:
$ python3 wall_vs_cpu.py
cpu  flamegraph: /tmp/flame_cpu.svg  (24817 bytes)
wall flamegraph: /tmp/flame_wall.svg  (28104 bytes)

# Inspecting the two SVGs (or running py-spy with --format speedscope and grep):
# CPU flamegraph top frames:
#   84.1%  cpu_burn        (the tight loop)
#   12.6%  fake_request
#    2.1%  worker
#    1.2%  <other Python overhead>
#
# Wall flamegraph top frames:
#   85.4%  time.sleep      (the off-CPU wait the CPU profile cannot see)
#   12.7%  cpu_burn
#    1.2%  fake_request
#    0.7%  <other>
#
# The CPU profile says: "your hot path is cpu_burn".
# The wall profile says: "your hot path is time.sleep — the network wait".
# Same process, same time window, completely different diagnoses.

Walk-through. fake_request does 30 ms of compute followed by 170 ms of time.sleep — a 15/85 compute-vs-wait split, which is the realistic shape of a checkout-service request talking to a slow downstream. run_pyspy(..., mode='cpu') runs py-spy without --idle, so it samples only on-CPU threads; run_pyspy(..., mode='wall') runs the same profiler with --idle, which sets py-spy's wall-clock mode and samples every thread regardless of state. The flamegraph headlines are the diagnosis: cpu_burn dominates the CPU view because that is the only time the thread is on-CPU; time.sleep dominates the wall view because that is where the wall-clock time is spent. The numbers match the 15/85 split almost exactly. Run this calibration once on any service you own and the lesson sticks.

The same script with --idle becomes the every-day workhorse: keep wall-clock mode on for production continuous profiling, fall back to pure CPU mode when you specifically want to know "is this hot path CPU-expensive?". The Pyroscope and Datadog continuous profilers in 2026 default to wall-clock; py-spy still defaults to CPU-only, so the --idle flag is the single most important argument the tool takes.

The 15/85 split in the toy example is roughly the shape of a realistic Razorpay payments path: ~30 ms of compute (parse JSON, validate signature, build outbound call payload, decode response, build outbound to bank API) split across two synchronous downstream hops totalling ~170 ms of network wait. A naive flamegraph audit of this service that ran only py-spy record -p $pid (no --idle) would conclude that the service spends almost all of its time in cpu_burn-equivalent compute, which is technically true within the on-CPU view and dangerously misleading at the wall-clock level. Engineers reading such a flamegraph routinely propose optimisations — caching the JSON parse, inlining the signature check — that produce measurable but tiny gains because the actual latency budget is dominated by the downstream wait the CPU view never showed them. The first thing every team should do after standing up a CPU profiler is to run the wall version next to it; the gap between the two is the budget your CPU optimisations cannot touch.

Why CPU profiles still matter — the half they get right

The point of this chapter is not to abandon CPU profiling. It is to file CPU profiling correctly: as the right tool for one half of the problem.

The CPU half is where these chapters keep paying off:

The honest test for whether a CPU profile is the right tool is to ask: what fraction of my request's wall time is spent on-CPU? If the answer is above 50%, CPU profiling is your primary tool and wall profiling is supporting evidence. If the answer is below 30%, wall profiling is your primary tool and CPU profiling is supporting evidence. The Razorpay payments path runs around 35% on-CPU — wall-first. The Zerodha order-matching engine runs around 78% on-CPU — CPU-first. The Hotstar streaming router runs around 8% on-CPU — wall-first by a wide margin. Knowing this number for your own service before the next incident is what separates teams that close incidents in minutes from teams that close them in hours.

There is a temptation to optimise for the on-CPU fraction itself — to refactor a wall-bound service into a CPU-bound one by moving downstream calls into the same process, batching them, or replacing them with cached precomputation. Sometimes this is the right move: the IRCTC Tatkal queue's 2024 redesign moved seat-availability checks from a synchronous Postgres call into an in-process LRU populated by a background syncer, lifting the on-CPU fraction from 22% to 58% and dropping p99 from 1.8 s to 320 ms during the 10:00 IST burst. But the move is architectural, not observability-driven; you do it because the downstream is genuinely the bottleneck, not because you want a prettier CPU flamegraph. Treating the on-CPU fraction as a metric to optimise (rather than as a signal that tells you which profiler to read first) is a cargo-cult that ends with teams pretending their service is faster than it is.

On-CPU fraction across representative Indian services and which profiler is primaryA horizontal bar chart showing the on-CPU fraction of wall time for several services: Hotstar streaming router 8 percent, Swiggy checkout 18 percent, Razorpay payments 35 percent, IRCTC Tatkal queue 42 percent, Flipkart catalogue search 64 percent, Zerodha order matcher 78 percent. A vertical line at 50 percent divides the chart into "wall profiler primary" on the left and "CPU profiler primary" on the right.On-CPU fraction by service: which profiler is primary?Wall time spent on-CPU as a percentage; the rest is blocked (network, lock, syscall, queue)50% thresholdwall primaryCPU primaryHotstar router8%Swiggy checkout18%Razorpay payments35%IRCTC Tatkal42%Flipkart catalogue64%Zerodha matcher78%Methodology: per-request CPU time / wall time, measured with py-spy --idle over a 5-minute steady-state window, p50.
The on-CPU fraction is a property of the workload, not the language. A Python service can be 78% on-CPU (Zerodha's matcher, doing real computation per order) and a Go service can be 8% on-CPU (Hotstar's router, mostly waiting on downstream catalogue calls). Measure your own number once. The right primary profiler follows. Illustrative — numbers are representative of public engineering blog disclosures, not a single audited dataset.

Why this table changes how you stand up monitoring: a wall-primary service gets the wrong on-call alerts when the only profile dashboard is CPU-flamegraphs. The team will repeatedly close "p99 elevated" tickets as "no smoking gun in the flamegraph" — because the smoking gun is in a flamegraph the team does not look at. Stand up the wall-clock dashboard first for any service whose on-CPU fraction is below 50%, and the same alerts close in minutes instead of being closed as inconclusive.

What this means for the rest of the curriculum

Part 5 has equipped you with five techniques: read flamegraphs, generate them with perf, compare them with diffs, sample hardware events with PEBS/IBS, and store them continuously. All five operate on the on-CPU half. The next four parts of the curriculum each take a deliberate run at the off-CPU half, from a different angle.

Before moving on, a sanity check on what Part 5 actually delivered. Reading a CPU flamegraph is no longer a mystery — you can find the fattest bar, follow the call chain into a hot function, and reason about why it is hot from frame-pointer-walked stacks. Generating one in production is no longer a mystery — perf record -F 99 -g for an ad-hoc capture, py-spy record --idle for the wall variant, the continuous profiling agent shipping pprof every 10 seconds. Comparing two of them is no longer a mystery — diff flamegraphs render the deploy-to-deploy regression in red and blue. Asking microarchitectural questions of one is no longer a mystery — PEBS/IBS overlays cache miss, branch mispredict, and frontend-bound classification onto the same stacks. The skills compose. The skill that does not yet exist in this curriculum is the one that turns "this thread is blocked" from a black box into a queryable event stream — that is Part 6's job.

Part 6 — eBPF. The kernel has perfect visibility into context switches, syscall enters and exits, scheduler events, network drops, and disk-queue depth. It just did not expose it cheaply until eBPF. An off-CPU flamegraph is, mechanically, an eBPF program that hooks sched_switch and records the stack at every block-out. Once you have eBPF, the whole "off-CPU half" becomes as observable as the CPU half — at less than 1% overhead. Part 6 makes the kernel into the second profiler.

Part 7 — latency and tail latency. A flamegraph is one signal; a histogram of per-request wall-clock latency is another. Both are needed. Part 7 covers HdrHistograms, p99/p99.9/p99.99 percentile ladders, coordinated omission (the reason naive histograms underestimate the tail), and the "tail at scale" argument from Dean and Barroso. The bridge from Part 5 is: a flamegraph tells you what costs time; a histogram tells you how the costs are distributed across requests. Most production incidents need both readings together.

Part 8 — queueing theory. When the off-CPU time is "waiting in a queue", queueing theory is the only discipline that gives you a closed-form prediction of when latency cliffs at ρ ≈ 0.85. Wall profiles show you that a thread is blocked; queueing theory tells you whether the block is fundamental (you are saturating a resource) or contingent (someone else's bug). The mental shift is from "trace where the time went" to "model where the time had to go".

Part 9 — parallel scaling. A flamegraph that says 35% of wall time is in pthread_cond_wait is one signal. A scaling curve that flattens at 12 cores instead of 32 is the same signal viewed through Amdahl's lens. Part 9 connects the wall-clock view to the architectural ceiling — the serial fraction that no flamegraph can show you directly but that explains why doubling cores rarely doubles throughput.

Part 13 — language runtime. GC pauses, JIT compilation, escape analysis, and inline caches are wall-clock events that often appear in CPU profiles only as their cleanup work. The full picture — "this 12 ms pause is a G1 mixed-collection that will repeat every 30 seconds" — needs runtime-specific tooling that builds on the wall-clock thinking from this chapter.

The arc through these four parts is unified by one mental model: a request's wall-clock budget is a sum across thread states, and every chapter from here forward is a different way of looking at the non-on-CPU states. CPU profiling — every Part-5 tool you just spent eight chapters learning — covers exactly one of those states, well. You needed to learn it well first because the techniques (sampling, flamegraphs, diffs, continuous collection) generalise to every other state once eBPF lets you instrument them. The shape of the tool stays the same; only the trigger changes.

A practical reading of this arc, for the engineer who wants to put the curriculum to work tomorrow: ship a wall-clock continuous profiler alongside the existing CPU one (one flag — --idle for py-spy, wall mode for async-profiler — and a second tag dimension on the ingestion side). Add runqlat from the bcc tools to the per-pod dashboard for scheduler-delay visibility. Wire up an HdrHistogram-backed latency dashboard at p50, p99, p99.9, p99.99 with coordinated-omission-aware tooling. None of this requires waiting for Parts 6–8 to land in your reading queue. The thinking from this chapter — the on-CPU view is half the picture — is enough to motivate the operational changes before the formal machinery arrives.

Edge cases the wall view itself misses

Wall-clock profiling is not a panacea. It has its own blind spots, and reading a wall flamegraph as if it told the whole truth is the same mistake one level up.

The first edge case is micro-blocks. A wall profiler running at 99 Hz samples once every 10 ms. A thread that spends 2 ms blocked on a fast lock contention 50 times per second is on-CPU 90% of the time and off-CPU 10% of the time, but each individual block is shorter than the inter-sample gap. The wall flamegraph captures the aggregate (10% of samples in pthread_mutex_lock) correctly but cannot tell you whether the block was 50× 2 ms or 1× 100 ms — and those two scenarios have different fixes (the first is contention; the second is a deadlock or a slow critical section). For sub-sample-period blocks, you need an event-driven tracer, not a sampler.

The second edge case is kernel-side waits that never schedule the thread out. A thread spinning briefly on an adaptive mutex (the kind used in modern jemalloc or in some Linux futex paths) stays on-CPU during the spin — sometimes for tens of microseconds — before yielding. From the wall profiler's perspective the thread is on-CPU; from the user's perspective the thread is making no progress. PEBS-based memory-stall sampling sees the stall; wall sampling sees a hot CPU function. This is one of the few places where hardware event sampling (/wiki/hardware-event-sampling-pebs-ibs) reads the room better than wall sampling.

The third edge case is time spent in interrupt context, softirqs, and the kernel's own work. A user-space wall profiler attached to a Python process sees only the Python thread's states. The 200 µs per packet that the kernel spent doing softirq RX processing on the same CPU, indirectly slowing your thread's compute — invisible. eBPF-based system-wide profilers see this; user-space wall profilers do not. For services where kernel time is meaningful (high-PPS network paths, heavy disk I/O), user-space wall profiling needs a system-wide sidecar to fill in the gap.

These caveats are not arguments against wall profiling. They are arguments for adding a third and fourth tool — eBPF in Part 6, hardware event sampling within Part 5 — once wall profiling is in place. The general lesson: every profiler has a state space it can see and one it cannot. The diagnostic skill is knowing which states each tool covers and switching tools when the question crosses a boundary.

Common confusions

Going deeper

Why perf record has an off-CPU mode and almost nobody uses it

perf record -e sched:sched_switch --call-graph=fp records a stack at every context switch. Combined with perf report --children or a folded-stack post-process, this is a real off-CPU flamegraph — and it predates eBPF by years. The reason it is rarely used is that the volume of sched_switch events on a busy system is enormous: a 32-core box doing 100k syscalls/sec/core can switch a million times per second, blowing through perf's ring buffer before the user gets a chance to read it. eBPF-based off-CPU profilers solve this by aggregating in-kernel via BPF maps (count by stack-id, not record by event), which reduces the data volume by 100–1000×. The Brendan Gregg offcputime tool in bcc and bpftrace is the canonical example. Once eBPF is normal, perf-based off-CPU sampling is technically possible but operationally pointless.

A useful intermediate technique that bridges the two eras is perf record -F 99 --off-cpu (Linux 6.2+), which uses BPF under the hood to do the in-kernel aggregation while presenting the familiar perf user interface. For teams that have invested heavily in perf workflows but want the off-CPU view without rewriting their tooling around bpftrace, this is the lowest-friction adoption path. Linux 6.2 landed in early 2023 and is the default kernel on Ubuntu 24.04 and RHEL 9.4+, so most production fleets have it available without any additional installation. The output integrates with the same flamegraph generation pipeline (stackcollapse-perf.pl | flamegraph.pl), so existing dashboards keep working.

The async-profiler wall mode and why JVM teams adopted wall first

The JVM ecosystem solved the on-CPU vs wall problem earlier than Python or Go because async-profiler shipped a wall mode in 2018, when wall-clock sampling was still niche on Linux generally. The reason: JVM services were already deeply observability-tooled (JFR, JMX, gradle benchmarks, JMH), and the gap between "JVM CPU% is fine" and "p99 is bad" was visible to every JVM team running any non-trivial backend. async-profiler's -e wall runs the same AsyncGetCallTrace sampler but at a wall-clock frequency on every thread, producing a flamegraph that is directly comparable to its -e cpu output. Java engineers in 2026 default to running both. Python (py-spy --idle) and Go (Datadog's continuous wall profiler, Pyroscope's goroutine and block profile types) caught up later — they are now equivalent in capability, but the cultural muscle of "always run both" is still developing.

The Go ecosystem in particular took a slightly different route: rather than a single "wall" profile, Go's runtime exposes four separate pprof endpoints — CPU, goroutine, block, and mutex. The block and mutex profiles are time-weighted at the runtime layer (the Go scheduler knows exactly how long each goroutine slept), which means they are more accurate for blocked-time attribution than a sampling wall profiler ever can be. The trade-off is that the block profile only captures events the runtime has been told to instrument, controlled by runtime.SetBlockProfileRate. Most production Go services run with the rate at zero by default and turn it on only during incidents, which then defeats the "already had the profile when it happened" property that motivated continuous profiling in the first place. The Pyroscope-Go integration in 2024 fixed this by setting a low non-zero rate (sample 1 in 10,000 blocks) continuously, which approximates a true continuous wall profile at negligible overhead.

Reading wall flamegraphs without falling into the wait trap

A subtle pitfall in wall profiles is that idle threads dominate the picture. A gunicorn worker pool with 32 workers serving 8 concurrent requests has 24 workers sitting in epoll_wait doing nothing. A naive wall flamegraph looks like the entire service is in epoll_wait — because, on a wall-clock basis, it is. The fix is to filter the wall profile to threads that are active in the request lifecycle: tag threads via prctl(PR_SET_NAME) or use the language runtime's request-trace correlation (Datadog's APM, Pyroscope's tag-correlation), and then only render samples whose thread tag matches "active request". The Razorpay payments team's runbook for wall flamegraphs starts with: "filter to thread_name == request_handler before reading anything else". A few minutes of dashboard hygiene up-front saves hours of misdiagnosis later.

A complementary technique, used by the Zerodha matcher team, is to scope the profile to a trace span rather than a thread. When a request traverses three threads — accept loop on thread A, parser on thread B, writer on thread C — filtering by thread name shows only one third of the wall time. Using the OpenTelemetry trace ID propagated via thread-local storage as the filter dimension shows the request's full wall-time budget across all three threads, with the per-thread breakdown still visible. This is the mode Pyroscope's "span profiles" feature targets, and it is the future of wall profiling for distributed-tracing-shaped services. For now, thread-name filtering is the universally supported approximation; trace-ID filtering is the upgrade path.

Reproduce this on your laptop

# Reproduce the CPU-vs-wall calibration on a local Python service
sudo apt install build-essential
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 wall_vs_cpu.py
# Compare /tmp/flame_cpu.svg (cpu_burn dominates) with
# /tmp/flame_wall.svg (time.sleep dominates) — same process, same window.

Where the on-CPU fraction comes from architecturally

The on-CPU fraction of a service is not arbitrary. Three architectural decisions dominate it. First, the synchronous-vs-asynchronous I/O choice: a service that calls Postgres synchronously per request waits in recvmsg; one that uses an async pool can pipeline waits but still spends wall time waiting. Second, the downstream count: a service with N synchronous downstream calls per request has roughly N× the wait budget of one with a single call, regardless of how fast each individual call is. Third, the cache hit rate: a hot Redis cache that serves 95% of reads in < 1 ms moves a service's on-CPU fraction much higher because the wait portion shrinks. The Hotstar router's 8% on-CPU is explained by all three: synchronous catalogue calls, two downstream hops per request, and a cache hit rate that is high in steady state but drops during traffic spikes — exactly when p99 cliffs.

A useful corollary is that the on-CPU fraction moves with load. At 10% offered load, a Razorpay payments-API pod might be 50% on-CPU because most downstream calls hit warm caches. At 70% offered load, the same pod drops to 25% on-CPU as cache pressure climbs and downstream hops slow. At 95% offered load, the pod can sit at 12% on-CPU, with the rest of wall time stuck in epoll_wait for downstream Postgres pools that are themselves saturated. The implication: the right primary profiler can change between a quiet Tuesday morning and a Big Billion Days Friday at 16:00. Continuous profilers that store both views side-by-side (/wiki/continuous-profiling-in-production) let you see the shift happen as load climbs, which is itself a leading indicator of an upcoming saturation incident.

Between the CPU profiler's "on-CPU" and the wall profiler's "blocked-on-syscall" lies a quieter cost: scheduler delay. A thread is TASK_RUNNING, the kernel knows it wants to run, but no CPU is available to run it on. On a saturated host, on a Kubernetes pod hitting its CFS quota ceiling, on a NUMA node where the scheduler is rebalancing, this can add tens or hundreds of milliseconds per request to wall time — none of which appears in either flamegraph because the thread is technically neither running nor sleeping. The kernel exposes the cost in /proc/<pid>/schedstat (the second field, run-delay in ns) and via eBPF runqlat from the bcc tools collection. The number is usually under a millisecond on a healthy host; when it climbs into the 50–500 ms range you are seeing CPU throttling or runqueue saturation, and no amount of CPU or wall profiling will diagnose it without this third measurement. The Flipkart catalogue team's 2025 internal SRE handbook makes runqlat a default panel on every pod-level dashboard — alongside CPU% and wall flamegraph — for exactly this reason.

A particularly nasty variant of scheduler delay on Kubernetes is the CFS bandwidth bug that periodically resurfaces in different kernel versions: when a pod's CPU limit is set to a fraction (say 1.5 cores), the kernel allocates a 100 ms quota refilled every 100 ms, and a multi-threaded workload that briefly bursts above the limit can have all its threads parked until the next 100 ms slice — even on hosts with idle CPUs available. The wall profile shows threads in TASK_RUNNING, no syscall hot frame, no obvious culprit. The diagnosis requires reading /sys/fs/cgroup/cpu.stat for nr_throttled and throttled_time, which is the only place the cost surfaces. Hotstar, Razorpay, and Zerodha all run dedicated Grafana panels for nr_throttled / nr_periods on every pod, precisely because the alternative — debugging from flamegraphs alone — does not work for this class of incident.

Where this leads next

The single sentence to take away from Part 5: a CPU flamegraph is a precise answer to "where is compute being spent" and a misleading answer to "where is wall time being spent". Both questions matter. They are not the same question.

This is the closing chapter of Part 5. Every CPU-profiling tool — perf, py-spy, async-profiler, flamegraphs, differential flamegraphs, hardware event sampling, continuous profiling — now lives in your hands as a means to an end, not the end itself. The end is closing latency incidents in minutes. The CPU half is solved. The next four parts close the wall-time half.

Part 6 (eBPF) makes the kernel observable, which finally puts an off-CPU flamegraph on the same operational footing as a CPU one. Part 7 (tail latency) replaces the implicit "the mean is fine" assumption with HdrHistograms, p99.9, and the coordinated-omission correction. Part 8 (queueing theory) gives you closed-form predictions for the latency cliff at ρ ≈ 0.85 — the answer to "why does adding 10% more load melt p99". Each builds on the wall-clock thinking this chapter forces you to adopt.

The single most useful thing you can do tomorrow morning, before reading any further, is to run wall_vs_cpu.py (or its production equivalent) against the most important service you own, write down the on-CPU fraction, and pin it to your team's wiki. The number changes how you read every flamegraph for the rest of your career. A team that knows its on-CPU fraction debugs incidents in minutes. A team that does not, debugs them in hours.

References