Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: CPU is half the story

At 11:47 IST on a Tuesday, the BhojanBox checkout service starts returning p99 = 1.8 s. The SLO is 400 ms. The on-call engineer Karan opens the runbook he has rehearsed: pull a flamegraph from the continuous profiling store, find the fattest bar, blame the slowest function. He pulls the profile. The fattest bar is _recv_from_socket at 4.1% of CPU samples. The next is json.loads at 3.6%. Nothing is anywhere close to the 1.4 s of latency he is trying to explain. Total on-CPU time across the whole flamegraph adds up to 280 ms of CPU per request. The request takes 1,800 ms. The other 1,520 ms is not in the flamegraph at all because it was not on the CPU when the profiler sampled — the thread was sitting in epoll_wait, or queued behind a pthread_mutex_t, or stuck in a recvmsg waiting for a slow downstream Postgres query, and the CPU profiler does not see threads that are not running. After eight chapters of perf, flamegraphs, differential diffs, hardware event sampling, and continuous profiling — every one of which Karan now knows how to use — the dashboard he needs at 11:47 IST is a wall-clock profile, an off-CPU flamegraph, a tail-latency histogram with a coordinated-omission correction, and a queue-depth time-series. None of those live in Part 5.

CPU profiling answers "where is compute being spent" with high fidelity. Most p99 incidents in modern web services answer to a different question: "where is wall time being spent". A service that is 80% blocked on Redis, Postgres, or a downstream HTTP call has a CPU profile dominated by the 20% that ran. The off-CPU half — locks, syscalls, scheduler delay, queueing — needs different tools (wall profiles, off-CPU flamegraphs, eBPF, HdrHistograms) and a different mental model. Part 5 ends here so Parts 6, 7, and 8 can begin.

A sampling CPU profiler (perf, py-spy, async-profiler in cpu mode) interrupts the CPU at a fixed frequency — say 99 Hz — and records the stack of whichever thread happens to be running on that CPU at that moment. Threads that are not running are not sampled. This is the single most important property of the tool, and it is the one most readers do not internalise until a bad incident teaches them.

The arithmetic is direct. A web service with a 200 ms request budget that spends 30 ms on actual computation and 170 ms blocked on a downstream call uses the CPU for 15% of wall time. A 99 Hz CPU profiler running on that thread captures, on average, 99 × 0.030 = 2.97 samples per request. The other 99 × 0.170 = 16.83 per-request "would-be samples" never fire because the CPU is busy with something else (or idle) while the thread is parked. Aggregate across thousands of requests and the flamegraph still works — it correctly shows where the 30 ms went — but the picture it paints is "this service spends time in json.loads and _recv_from_socket", not "this service spends time waiting for Postgres". The latter is invisible because there is nothing to sample when the thread is asleep.

Why this is not a bug in the profiler: the profiler is doing exactly what it advertises — sampling the running CPU. The flaw is in the reader's assumption that "where the CPU spends time" and "where the request spends time" are the same question. They are the same question only when the service is CPU-bound. In a 2026 web stack — gunicorn fronting Postgres + Redis + a few HTTPS upstreams — almost no request is CPU-bound. A 90/10 wait-vs-compute split is the typical case, not the exceptional one.

An honest framing: the CPU profiler is not lying. It is answering a different question. The reader's job is to know which question they are asking.

The mismatch shows up most cruelly during the kind of incident that has a clear external cause. The SetuStream streaming-router team's 2024 IPL-final incident: a downstream catalogue service started returning p99 = 4 s instead of 80 ms. The router's CPU flamegraph during the incident looked nearly identical to the healthy baseline — the same hot paths, the same percentages, slightly higher absolute sample counts. The router was not doing anything different. It was waiting longer on each call. The CPU profile, the diff against last week, the hardware event counts — all unchanged. The on-call engineer spent 22 minutes reading flamegraphs before realising the answer was not there. A 60-second look at the wall-clock profile (where epoll_wait had ballooned from 12% to 71%) would have ended the incident in 3 minutes.

The same 200 ms request, sampled by two profilers running at the same 99 Hz frequency. The CPU profiler fires only during the 30 ms on-CPU window, producing 3 samples that all land on compute frames. The wall profiler fires across the whole 200 ms window, producing 17 samples — 3 compute, 14 wait — so the off-CPU contribution is finally visible. Illustrative — sample counts are deterministic in this diagram for clarity; in practice they are Poisson-distributed around the expected value.

The fix is not "stop using CPU profilers". CPU profilers are correct, cheap, and the right answer for the part of the problem they cover. The fix is to carry both views, side by side, and to know which view answers which question. CPU when you suspect a hot loop, a regex, a serialisation hot path, a GC cycle, a deserialisation cliff. Wall when you suspect a slow downstream, a lock, a database, a network hop, a scheduler delay.

What "wall time" actually decomposes into

A request's wall-clock latency is a sum, not a single thing. The arithmetic at the top of this chapter — 30 ms compute + 170 ms wait = 200 ms wall — is the simplest possible decomposition. Real services have more terms: scheduler delay, runtime overhead, lock contention, syscall entries, page faults. Each term needs its own measurement.

Once you accept that wall time matters, the next discipline is to know what wall time is made of. A request's wall-clock budget breaks into a small, fixed taxonomy of states. Knowing the taxonomy is what lets you read a wall-clock flamegraph and say "this 60% block is pthread_cond_wait waiting on a mutex" instead of "this 60% block is mysterious wait".

The states a thread can be in, on Linux, with their typical observable signatures:

Running on-CPU (TASK_RUNNING + on a runqueue + currently scheduled) — this is what CPU profilers see. Cost: actual compute.
Runnable but not on-CPU (TASK_RUNNING but waiting for a CPU) — scheduler delay. Counted by the kernel as "schedstat run_delay". Common when CPU is saturated, when CFS quotas are tight in a Kubernetes container, or when isolcpus is misconfigured.
Sleeping on a syscall (TASK_INTERRUPTIBLE / TASK_UNINTERRUPTIBLE) — blocked on I/O, futex, network. The thread is parked until something wakes it. This is by far the largest fraction of wall time in most web services. Subdivisions matter: blocked on disk I/O is different from blocked on a downstream HTTP call.
Stopped or zombie — terminal states, irrelevant for live profiling.

A wall-clock profiler captures (1)+(2)+(3); a CPU profiler captures only (1). The asymmetry is the point of this whole chapter, and it is the reason the next ~30 chapters of the curriculum exist. The interesting decomposition lives almost entirely inside state (3), and getting visibility into the why of state (3) is what off-CPU flamegraphs (/wiki/off-cpu-flamegraphs-the-other-half) are designed for. They sample at every context-switch-out, recording the stack at the moment the thread blocked, so the resulting flamegraph shows "we spent 1.2 seconds blocked under requests.get → urllib3.connection.HTTPConnection.send" rather than "we spent 1.2 seconds in some wait state".

Why state (2) — runnable but not on-CPU — quietly causes the worst incidents: it is invisible to the CPU profiler (the thread is not running) and almost invisible to the wall profiler (the thread is technically TASK_RUNNING, just not scheduled). The kernel exposes it via /proc/<pid>/schedstat field 2 ("time spent waiting on a runqueue, in ns") and via the sched:sched_wakeup → sched:sched_switch latency on tracepoints. A Kubernetes pod with CPU throttling enabled can spend 30–40% of its wall time in state (2) during throttle windows — the request thread is ready to run, the CPU is available on the host, but the cgroup quota is exhausted so the kernel parks the thread until the next 100 ms slice. CPU profile says "fine". Wall profile says "fine". p99 says "definitely not fine".

# wall_vs_cpu.py — measure how much of a request budget the CPU profiler is blind to.
# Spawns a worker that does a small CPU loop then a long network wait, then runs
# both py-spy in CPU mode and py-spy in wall-clock mode against the same process.
# Compares where the samples land. This is the single most important calibration
# you can do on any service before you start trusting its flamegraphs.

import os
import socket
import subprocess
import sys
import threading
import time
from pathlib import Path

WORKER_DURATION = 30        # seconds
CPU_BURN_MS = 30            # per "request"
NET_WAIT_MS = 170           # per "request" (mocks a slow downstream)

def cpu_burn(ms: int) -> int:
    """Tight loop for ~ms milliseconds; returns work done."""
    end = time.perf_counter_ns() + ms * 1_000_000
    n = 0
    while time.perf_counter_ns() < end:
        n += 1
    return n

def fake_request() -> None:
    """One request: a little compute, then a long blocking wait."""
    cpu_burn(CPU_BURN_MS)
    time.sleep(NET_WAIT_MS / 1000)         # stand-in for a Postgres call

def worker() -> None:
    end = time.perf_counter() + WORKER_DURATION
    while time.perf_counter() < end:
        fake_request()

def run_pyspy(pid: int, mode: str, out: Path, secs: int = 20) -> None:
    """mode='cpu' uses --idle off (CPU only); mode='wall' uses --idle on."""
    cmd = ["py-spy", "record",
           "-p", str(pid), "-d", str(secs),
           "-r", "99", "-o", str(out)]
    if mode == "wall":
        cmd.append("--idle")
    subprocess.run(cmd, check=True, capture_output=True)

def main() -> None:
    t = threading.Thread(target=worker, daemon=True)
    t.start()
    time.sleep(2)                          # let the worker reach steady state

    pid = os.getpid()
    cpu_svg = Path("/tmp/flame_cpu.svg")
    wall_svg = Path("/tmp/flame_wall.svg")
    run_pyspy(pid, "cpu", cpu_svg)
    run_pyspy(pid, "wall", wall_svg)
    print(f"cpu  flamegraph: {cpu_svg}  ({cpu_svg.stat().st_size} bytes)")
    print(f"wall flamegraph: {wall_svg}  ({wall_svg.stat().st_size} bytes)")

if __name__ == "__main__":
    main()

# Sample run on a 4-core c6i.xlarge in ap-south-1:
$ python3 wall_vs_cpu.py
cpu  flamegraph: /tmp/flame_cpu.svg  (24817 bytes)
wall flamegraph: /tmp/flame_wall.svg  (28104 bytes)

# Inspecting the two SVGs (or running py-spy with --format speedscope and grep):
# CPU flamegraph top frames:
#   84.1%  cpu_burn        (the tight loop)
#   12.6%  fake_request
#    2.1%  worker
#    1.2%  <other Python overhead>
#
# Wall flamegraph top frames:
#   85.4%  time.sleep      (the off-CPU wait the CPU profile cannot see)
#   12.7%  cpu_burn
#    1.2%  fake_request
#    0.7%  <other>
#
# The CPU profile says: "your hot path is cpu_burn".
# The wall profile says: "your hot path is time.sleep — the network wait".
# Same process, same time window, completely different diagnoses.

Walk-through. fake_request does 30 ms of compute followed by 170 ms of time.sleep — a 15/85 compute-vs-wait split, which is the realistic shape of a checkout-service request talking to a slow downstream. run_pyspy(..., mode='cpu') runs py-spy without --idle, so it samples only on-CPU threads; run_pyspy(..., mode='wall') runs the same profiler with --idle, which sets py-spy's wall-clock mode and samples every thread regardless of state. The flamegraph headlines are the diagnosis: cpu_burn dominates the CPU view because that is the only time the thread is on-CPU; time.sleep dominates the wall view because that is where the wall-clock time is spent. The numbers match the 15/85 split almost exactly. Run this calibration once on any service you own and the lesson sticks.

The same script with --idle becomes the every-day workhorse: keep wall-clock mode on for production continuous profiling, fall back to pure CPU mode when you specifically want to know "is this hot path CPU-expensive?". The Pyroscope and Datadog continuous profilers in 2026 default to wall-clock; py-spy still defaults to CPU-only, so the --idle flag is the single most important argument the tool takes.

The 15/85 split in the toy example is roughly the shape of a realistic PaisaBridge payments path: ~30 ms of compute (parse JSON, validate signature, build outbound call payload, decode response, build outbound to bank API) split across two synchronous downstream hops totalling ~170 ms of network wait. A naive flamegraph audit of this service that ran only py-spy record -p $pid (no --idle) would conclude that the service spends almost all of its time in cpu_burn-equivalent compute, which is technically true within the on-CPU view and dangerously misleading at the wall-clock level. Engineers reading such a flamegraph routinely propose optimisations — caching the JSON parse, inlining the signature check — that produce measurable but tiny gains because the actual latency budget is dominated by the downstream wait the CPU view never showed them. The first thing every team should do after standing up a CPU profiler is to run the wall version next to it; the gap between the two is the budget your CPU optimisations cannot touch.

Why CPU profiles still matter — the half they get right

The point of this chapter is not to abandon CPU profiling. It is to file CPU profiling correctly: as the right tool for one half of the problem.

The CPU half is where these chapters keep paying off:

Hot loops. A regex compiled per-request, a JSON serialiser hand-rolled in pure Python, an O(n²) similarity-score calculator on top of a 200k-row product catalogue — these are CPU-bound by construction, and a CPU flamegraph nails them in seconds. The BharatBazaar 2024 catalogue regression (continuous profiling chapter) was exactly this kind of bug — a re.compile that moved to per-request — and a diff CPU flamegraph caught it.
GC and runtime overhead. Allocator hot paths (PyObject_Malloc, mi_malloc), GC cycles (__pyx_pw_gc_collect, JVM G1ParScanThreadState), JIT compilation events — these run on-CPU and are visible to a CPU profiler. A wall-clock profiler will also see them, but a CPU profiler is sufficient.
SIMD and microarchitectural inefficiency. Branch mispredicts, cache misses, frontend stalls — invisible to a vanilla CPU profiler but accessible via PEBS/IBS hardware event sampling (/wiki/hardware-event-sampling-pebs-ibs) on top of the same on-CPU sampling foundation. This is the entire reason Part 1 covered IPC and pipelines: when the CPU view says "this function is hot", PEBS lets you ask "and why is it hot — is it cache-miss-bound, branch-predict-bound, or actually instruction-bound?".
CPU saturation incidents. When a service's CPU climbs from 40% to 95% and stays there, a CPU profile of the new regime versus the old regime is the textbook diagnostic. Wall profiles can also see it, but CPU profiles are cheaper and cleaner because the signal is already CPU-shaped.
Capacity planning and unit-cost analysis. "How many requests per second per core can this service serve?" is a question about CPU efficiency, not wall time. The CPU profile tells you which functions consume the per-request CPU budget; halving the time spent in json.loads doubles your per-core throughput regardless of what the downstream wait looks like. For unit-economics work — ₹ per million requests, CPU cost per UPI transaction at DigiPaisa scale — the CPU view is what feeds the spreadsheet.

The honest test for whether a CPU profile is the right tool is to ask: what fraction of my request's wall time is spent on-CPU? If the answer is above 50%, CPU profiling is your primary tool and wall profiling is supporting evidence. If the answer is below 30%, wall profiling is your primary tool and CPU profiling is supporting evidence. The PaisaBridge payments path runs around 35% on-CPU — wall-first. The ParakhTrade order-matching engine runs around 78% on-CPU — CPU-first. The SetuStream streaming router runs around 8% on-CPU — wall-first by a wide margin. Knowing this number for your own service before the next incident is what separates teams that close incidents in minutes from teams that close them in hours.

There is a temptation to optimise for the on-CPU fraction itself — to refactor a wall-bound service into a CPU-bound one by moving downstream calls into the same process, batching them, or replacing them with cached precomputation. Sometimes this is the right move: the BharatRail Tatkal queue's 2024 redesign moved seat-availability checks from a synchronous Postgres call into an in-process LRU populated by a background syncer, lifting the on-CPU fraction from 22% to 58% and dropping p99 from 1.8 s to 320 ms during the 10:00 IST burst. But the move is architectural, not observability-driven; you do it because the downstream is genuinely the bottleneck, not because you want a prettier CPU flamegraph. Treating the on-CPU fraction as a metric to optimise (rather than as a signal that tells you which profiler to read first) is a cargo-cult that ends with teams pretending their service is faster than it is.

The on-CPU fraction is a property of the workload, not the language. A Python service can be 78% on-CPU (ParakhTrade's matcher, doing real computation per order) and a Go service can be 8% on-CPU (SetuStream's router, mostly waiting on downstream catalogue calls). Measure your own number once. The right primary profiler follows. Illustrative — numbers are representative of public engineering blog disclosures, not a single audited dataset.

Why this table changes how you stand up monitoring: a wall-primary service gets the wrong on-call alerts when the only profile dashboard is CPU-flamegraphs. The team will repeatedly close "p99 elevated" tickets as "no smoking gun in the flamegraph" — because the smoking gun is in a flamegraph the team does not look at. Stand up the wall-clock dashboard first for any service whose on-CPU fraction is below 50%, and the same alerts close in minutes instead of being closed as inconclusive.

What this means for the rest of the curriculum

Part 5 has equipped you with five techniques: read flamegraphs, generate them with perf, compare them with diffs, sample hardware events with PEBS/IBS, and store them continuously. All five operate on the on-CPU half. The next four parts of the curriculum each take a deliberate run at the off-CPU half, from a different angle.

Before moving on, a sanity check on what Part 5 actually delivered. Reading a CPU flamegraph is no longer a mystery — you can find the fattest bar, follow the call chain into a hot function, and reason about why it is hot from frame-pointer-walked stacks. Generating one in production is no longer a mystery — perf record -F 99 -g for an ad-hoc capture, py-spy record --idle for the wall variant, the continuous profiling agent shipping pprof every 10 seconds. Comparing two of them is no longer a mystery — diff flamegraphs render the deploy-to-deploy regression in red and blue. Asking microarchitectural questions of one is no longer a mystery — PEBS/IBS overlays cache miss, branch mispredict, and frontend-bound classification onto the same stacks. The skills compose. The skill that does not yet exist in this curriculum is the one that turns "this thread is blocked" from a black box into a queryable event stream — that is Part 6's job.

Part 6 — eBPF. The kernel has perfect visibility into context switches, syscall enters and exits, scheduler events, network drops, and disk-queue depth. It just did not expose it cheaply until eBPF. An off-CPU flamegraph is, mechanically, an eBPF program that hooks sched_switch and records the stack at every block-out. Once you have eBPF, the whole "off-CPU half" becomes as observable as the CPU half — at less than 1% overhead. Part 6 makes the kernel into the second profiler.

Part 7 — latency and tail latency. A flamegraph is one signal; a histogram of per-request wall-clock latency is another. Both are needed. Part 7 covers HdrHistograms, p99/p99.9/p99.99 percentile ladders, coordinated omission (the reason naive histograms underestimate the tail), and the "tail at scale" argument from Dean and Barroso. The bridge from Part 5 is: a flamegraph tells you what costs time; a histogram tells you how the costs are distributed across requests. Most production incidents need both readings together.

Part 8 — queueing theory. When the off-CPU time is "waiting in a queue", queueing theory is the only discipline that gives you a closed-form prediction of when latency cliffs at ρ ≈ 0.85. Wall profiles show you that a thread is blocked; queueing theory tells you whether the block is fundamental (you are saturating a resource) or contingent (someone else's bug). The mental shift is from "trace where the time went" to "model where the time had to go".

Part 9 — parallel scaling. A flamegraph that says 35% of wall time is in pthread_cond_wait is one signal. A scaling curve that flattens at 12 cores instead of 32 is the same signal viewed through Amdahl's lens. Part 9 connects the wall-clock view to the architectural ceiling — the serial fraction that no flamegraph can show you directly but that explains why doubling cores rarely doubles throughput.

Part 13 — language runtime. GC pauses, JIT compilation, escape analysis, and inline caches are wall-clock events that often appear in CPU profiles only as their cleanup work. The full picture — "this 12 ms pause is a G1 mixed-collection that will repeat every 30 seconds" — needs runtime-specific tooling that builds on the wall-clock thinking from this chapter.

The arc through these four parts is unified by one mental model: a request's wall-clock budget is a sum across thread states, and every chapter from here forward is a different way of looking at the non-on-CPU states. CPU profiling — every Part-5 tool you just spent eight chapters learning — covers exactly one of those states, well. You needed to learn it well first because the techniques (sampling, flamegraphs, diffs, continuous collection) generalise to every other state once eBPF lets you instrument them. The shape of the tool stays the same; only the trigger changes.

A practical reading of this arc, for the engineer who wants to put the curriculum to work tomorrow: ship a wall-clock continuous profiler alongside the existing CPU one (one flag — --idle for py-spy, wall mode for async-profiler — and a second tag dimension on the ingestion side). Add runqlat from the bcc tools to the per-pod dashboard for scheduler-delay visibility. Wire up an HdrHistogram-backed latency dashboard at p50, p99, p99.9, p99.99 with coordinated-omission-aware tooling. None of this requires waiting for Parts 6–8 to land in your reading queue. The thinking from this chapter — the on-CPU view is half the picture — is enough to motivate the operational changes before the formal machinery arrives.

Edge cases the wall view itself misses

Wall-clock profiling is not a panacea. It has its own blind spots, and reading a wall flamegraph as if it told the whole truth is the same mistake one level up.

The first edge case is micro-blocks. A wall profiler running at 99 Hz samples once every 10 ms. A thread that spends 2 ms blocked on a fast lock contention 50 times per second is on-CPU 90% of the time and off-CPU 10% of the time, but each individual block is shorter than the inter-sample gap. The wall flamegraph captures the aggregate (10% of samples in pthread_mutex_lock) correctly but cannot tell you whether the block was 50× 2 ms or 1× 100 ms — and those two scenarios have different fixes (the first is contention; the second is a deadlock or a slow critical section). For sub-sample-period blocks, you need an event-driven tracer, not a sampler.

The second edge case is kernel-side waits that never schedule the thread out. A thread spinning briefly on an adaptive mutex (the kind used in modern jemalloc or in some Linux futex paths) stays on-CPU during the spin — sometimes for tens of microseconds — before yielding. From the wall profiler's perspective the thread is on-CPU; from the user's perspective the thread is making no progress. PEBS-based memory-stall sampling sees the stall; wall sampling sees a hot CPU function. This is one of the few places where hardware event sampling (/wiki/hardware-event-sampling-pebs-ibs) reads the room better than wall sampling.

The third edge case is time spent in interrupt context, softirqs, and the kernel's own work. A user-space wall profiler attached to a Python process sees only the Python thread's states. The 200 µs per packet that the kernel spent doing softirq RX processing on the same CPU, indirectly slowing your thread's compute — invisible. eBPF-based system-wide profilers see this; user-space wall profilers do not. For services where kernel time is meaningful (high-PPS network paths, heavy disk I/O), user-space wall profiling needs a system-wide sidecar to fill in the gap.

These caveats are not arguments against wall profiling. They are arguments for adding a third and fourth tool — eBPF in Part 6, hardware event sampling within Part 5 — once wall profiling is in place. The general lesson: every profiler has a state space it can see and one it cannot. The diagnostic skill is knowing which states each tool covers and switching tools when the question crosses a boundary.

Common confusions

"A wall-clock profile is just a CPU profile with more samples." No. A CPU profile fires only when the thread is on-CPU; a wall profile fires regardless of state. The samples land in entirely different stack frames — time.sleep, epoll_wait, pthread_cond_wait, recvmsg show up in the wall profile and are absent from the CPU profile. The flamegraphs are different shapes, not different magnifications.
"If my service is fast, I do not need wall profiles." A "fast" service usually has low p50 latency (CPU-dominated requests). The p99/p99.9 tail is almost always wait-dominated — slow Postgres call, slow downstream, lock contention, GC pause. Wall profiling is more important on a fast service than a slow one because the slow tail is exactly the part of the distribution you cannot debug without it.
"Off-CPU flamegraphs and wall-clock flamegraphs are the same thing." Closely related, not identical. Wall-clock profilers (py-spy --idle, async-profiler wall) sample every state at a fixed Hz. Off-CPU flamegraphs (/wiki/off-cpu-flamegraphs-the-other-half) are typically eBPF-driven, hook on sched_switch, and weight by time spent off-CPU rather than sample count. The off-CPU view is more accurate for very long blocks (you see the full duration, not just N samples) but more invasive to set up.
"I can derive the off-CPU time by subtracting CPU profile time from wall time." Mathematically yes, diagnostically useless. Knowing the magnitude of the off-CPU bucket without its breakdown is like knowing your AWS bill without the line items. The wall profile or the off-CPU flamegraph gives you the breakdown.
"top already shows wall vs CPU — I do not need a wall profiler." top shows process-level CPU%, not per-thread per-stack-frame attribution. Knowing a process is using 18% CPU does not tell you that the 82% non-CPU time is split 70% in epoll_wait on a Postgres socket and 12% in pthread_cond_wait on an internal mutex. The breakdown is what makes the diagnosis.
"Wall profiling makes CPU profiling redundant." Two costs argue against this. Wall profiles are noisier (sample count is the same; signal-to-noise for the on-CPU portion is lower). CPU profiles can be paired with PEBS/IBS for microarchitectural attribution; wall profiles cannot — you cannot ask "is this time.sleep cache-miss-bound". Run both. The disk and CPU cost of running both is small enough to be irrelevant for any service big enough to need profiling at all.

Going deeper

Why `perf record` has an off-CPU mode and almost nobody uses it

perf record -e sched:sched_switch --call-graph=fp records a stack at every context switch. Combined with perf report --children or a folded-stack post-process, this is a real off-CPU flamegraph — and it predates eBPF by years. The reason it is rarely used is that the volume of sched_switch events on a busy system is enormous: a 32-core box doing 100k syscalls/sec/core can switch a million times per second, blowing through perf's ring buffer before the user gets a chance to read it. eBPF-based off-CPU profilers solve this by aggregating in-kernel via BPF maps (count by stack-id, not record by event), which reduces the data volume by 100–1000×. The Brendan Gregg offcputime tool in bcc and bpftrace is the canonical example. Once eBPF is normal, perf-based off-CPU sampling is technically possible but operationally pointless.

A useful intermediate technique that bridges the two eras is perf record -F 99 --off-cpu (Linux 6.2+), which uses BPF under the hood to do the in-kernel aggregation while presenting the familiar perf user interface. For teams that have invested heavily in perf workflows but want the off-CPU view without rewriting their tooling around bpftrace, this is the lowest-friction adoption path. Linux 6.2 landed in early 2023 and is the default kernel on Ubuntu 24.04 and RHEL 9.4+, so most production fleets have it available without any additional installation. The output integrates with the same flamegraph generation pipeline (stackcollapse-perf.pl | flamegraph.pl), so existing dashboards keep working.

The async-profiler `wall` mode and why JVM teams adopted wall first

The JVM ecosystem solved the on-CPU vs wall problem earlier than Python or Go because async-profiler shipped a wall mode in 2018, when wall-clock sampling was still niche on Linux generally. The reason: JVM services were already deeply observability-tooled (JFR, JMX, gradle benchmarks, JMH), and the gap between "JVM CPU% is fine" and "p99 is bad" was visible to every JVM team running any non-trivial backend. async-profiler's -e wall runs the same AsyncGetCallTrace sampler but at a wall-clock frequency on every thread, producing a flamegraph that is directly comparable to its -e cpu output. Java engineers in 2026 default to running both. Python (py-spy --idle) and Go (Datadog's continuous wall profiler, Pyroscope's goroutine and block profile types) caught up later — they are now equivalent in capability, but the cultural muscle of "always run both" is still developing.

The Go ecosystem in particular took a slightly different route: rather than a single "wall" profile, Go's runtime exposes four separate pprof endpoints — CPU, goroutine, block, and mutex. The block and mutex profiles are time-weighted at the runtime layer (the Go scheduler knows exactly how long each goroutine slept), which means they are more accurate for blocked-time attribution than a sampling wall profiler ever can be. The trade-off is that the block profile only captures events the runtime has been told to instrument, controlled by runtime.SetBlockProfileRate. Most production Go services run with the rate at zero by default and turn it on only during incidents, which then defeats the "already had the profile when it happened" property that motivated continuous profiling in the first place. The Pyroscope-Go integration in 2024 fixed this by setting a low non-zero rate (sample 1 in 10,000 blocks) continuously, which approximates a true continuous wall profile at negligible overhead.

Reading wall flamegraphs without falling into the wait trap

A subtle pitfall in wall profiles is that idle threads dominate the picture. A gunicorn worker pool with 32 workers serving 8 concurrent requests has 24 workers sitting in epoll_wait doing nothing. A naive wall flamegraph looks like the entire service is in epoll_wait — because, on a wall-clock basis, it is. The fix is to filter the wall profile to threads that are active in the request lifecycle: tag threads via prctl(PR_SET_NAME) or use the language runtime's request-trace correlation (Datadog's APM, Pyroscope's tag-correlation), and then only render samples whose thread tag matches "active request". The PaisaBridge payments team's runbook for wall flamegraphs starts with: "filter to thread_name == request_handler before reading anything else". A few minutes of dashboard hygiene up-front saves hours of misdiagnosis later.

A complementary technique, used by the ParakhTrade matcher team, is to scope the profile to a trace span rather than a thread. When a request traverses three threads — accept loop on thread A, parser on thread B, writer on thread C — filtering by thread name shows only one third of the wall time. Using the OpenTelemetry trace ID propagated via thread-local storage as the filter dimension shows the request's full wall-time budget across all three threads, with the per-thread breakdown still visible. This is the mode Pyroscope's "span profiles" feature targets, and it is the future of wall profiling for distributed-tracing-shaped services. For now, thread-name filtering is the universally supported approximation; trace-ID filtering is the upgrade path.

Reproduce this on your laptop

# Reproduce the CPU-vs-wall calibration on a local Python service
sudo apt install build-essential
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 wall_vs_cpu.py
# Compare /tmp/flame_cpu.svg (cpu_burn dominates) with
# /tmp/flame_wall.svg (time.sleep dominates) — same process, same window.

Where the on-CPU fraction comes from architecturally

The on-CPU fraction of a service is not arbitrary. Three architectural decisions dominate it. First, the synchronous-vs-asynchronous I/O choice: a service that calls Postgres synchronously per request waits in recvmsg; one that uses an async pool can pipeline waits but still spends wall time waiting. Second, the downstream count: a service with N synchronous downstream calls per request has roughly N× the wait budget of one with a single call, regardless of how fast each individual call is. Third, the cache hit rate: a hot Redis cache that serves 95% of reads in < 1 ms moves a service's on-CPU fraction much higher because the wait portion shrinks. The SetuStream router's 8% on-CPU is explained by all three: synchronous catalogue calls, two downstream hops per request, and a cache hit rate that is high in steady state but drops during traffic spikes — exactly when p99 cliffs.

A useful corollary is that the on-CPU fraction moves with load. At 10% offered load, a PaisaBridge payments-API pod might be 50% on-CPU because most downstream calls hit warm caches. At 70% offered load, the same pod drops to 25% on-CPU as cache pressure climbs and downstream hops slow. At 95% offered load, the pod can sit at 12% on-CPU, with the rest of wall time stuck in epoll_wait for downstream Postgres pools that are themselves saturated. The implication: the right primary profiler can change between a quiet Tuesday morning and a Mega Bargain Days Friday at 16:00. Continuous profilers that store both views side-by-side (/wiki/continuous-profiling-in-production) let you see the shift happen as load climbs, which is itself a leading indicator of an upcoming saturation incident.

Between the CPU profiler's "on-CPU" and the wall profiler's "blocked-on-syscall" lies a quieter cost: scheduler delay. A thread is TASK_RUNNING, the kernel knows it wants to run, but no CPU is available to run it on. On a saturated host, on a Kubernetes pod hitting its CFS quota ceiling, on a NUMA node where the scheduler is rebalancing, this can add tens or hundreds of milliseconds per request to wall time — none of which appears in either flamegraph because the thread is technically neither running nor sleeping. The kernel exposes the cost in /proc/<pid>/schedstat (the second field, run-delay in ns) and via eBPF runqlat from the bcc tools collection. The number is usually under a millisecond on a healthy host; when it climbs into the 50–500 ms range you are seeing CPU throttling or runqueue saturation, and no amount of CPU or wall profiling will diagnose it without this third measurement. The BharatBazaar catalogue team's 2025 internal SRE handbook makes runqlat a default panel on every pod-level dashboard — alongside CPU% and wall flamegraph — for exactly this reason.

A particularly nasty variant of scheduler delay on Kubernetes is the CFS bandwidth bug that periodically resurfaces in different kernel versions: when a pod's CPU limit is set to a fraction (say 1.5 cores), the kernel allocates a 100 ms quota refilled every 100 ms, and a multi-threaded workload that briefly bursts above the limit can have all its threads parked until the next 100 ms slice — even on hosts with idle CPUs available. The wall profile shows threads in TASK_RUNNING, no syscall hot frame, no obvious culprit. The diagnosis requires reading /sys/fs/cgroup/cpu.stat for nr_throttled and throttled_time, which is the only place the cost surfaces. SetuStream, PaisaBridge, and ParakhTrade all run dedicated Grafana panels for nr_throttled / nr_periods on every pod, precisely because the alternative — debugging from flamegraphs alone — does not work for this class of incident.

Where this leads next

The single sentence to take away from Part 5: a CPU flamegraph is a precise answer to "where is compute being spent" and a misleading answer to "where is wall time being spent". Both questions matter. They are not the same question.

This is the closing chapter of Part 5. Every CPU-profiling tool — perf, py-spy, async-profiler, flamegraphs, differential flamegraphs, hardware event sampling, continuous profiling — now lives in your hands as a means to an end, not the end itself. The end is closing latency incidents in minutes. The CPU half is solved. The next four parts close the wall-time half.

Part 6 (eBPF) makes the kernel observable, which finally puts an off-CPU flamegraph on the same operational footing as a CPU one. Part 7 (tail latency) replaces the implicit "the mean is fine" assumption with HdrHistograms, p99.9, and the coordinated-omission correction. Part 8 (queueing theory) gives you closed-form predictions for the latency cliff at ρ ≈ 0.85 — the answer to "why does adding 10% more load melt p99". Each builds on the wall-clock thinking this chapter forces you to adopt.

The single most useful thing you can do tomorrow morning, before reading any further, is to run wall_vs_cpu.py (or its production equivalent) against the most important service you own, write down the on-CPU fraction, and pin it to your team's wiki. The number changes how you read every flamegraph for the rest of your career. A team that knows its on-CPU fraction debugs incidents in minutes. A team that does not, debugs them in hours.

References

Brendan Gregg, "Off-CPU Analysis" — the canonical write-up of the off-CPU half, with the original offcputime eBPF tool and the wall-flamegraph methodology.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 6 — CPUs, §6.6 Profiling — the textbook treatment of CPU vs wall sampling with the historical context.
async-profiler wall mode documentation — the JVM-side reference implementation that normalised wall-clock profiling.
py-spy --idle flag and wall-clock semantics — the Python equivalent, with a clear explanation of how it samples blocked threads.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — the foundational argument that tail latency in distributed services is a wall-time, not a CPU-time, problem.
Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that crystallised coordinated omission and why naive latency histograms underreport the tail.
/wiki/off-cpu-flamegraphs-the-other-half — the chapter that builds the off-CPU flamegraph machinery this chapter motivates.
/wiki/continuous-profiling-in-production — the previous chapter, where the wall-clock vs CPU-time decision lives in the agent's --idle flag.