Off-CPU flame graphs (the other half)

Karan is on call for Swiggy's order-placement service at 13:42 IST — peak lunch hour, 2.4M orders in flight across the country. The dashboard shows p99 on POST /order/create at 1.2 seconds, well past the 400 ms SLO. He logs into a pod, the CPU on every replica reads 6%, the runqueue is empty, and top shows the Python workers in the S (interruptible sleep) state most of the time. He runs py-spy record -o /tmp/flame.svg --pid 12 --duration 30 out of habit. The flame graph that comes back is dominated by _PyEval_EvalFrameDefault and epoll_wait — the threads that did run during those 30 seconds, looking healthy. The actual problem is invisible: the threads handling the slow 1% of requests are not on the CPU at all. They are blocked, somewhere, waiting. Karan needs a flame graph of where threads spend their time off the CPU. That is a different graph, built from a different signal, and this chapter is how you get it.

A normal flame graph samples the CPU, so it only shows threads that are running. When a service is slow because of locks, syscalls, I/O, or upstream dependencies, the bottleneck is the time threads spend off the CPU — and the on-CPU flame graph is empty exactly where you most need data. An off-CPU flame graph samples the scheduler (sched_switch tracepoint), records each thread's stack at the moment it goes to sleep, and weights by the wall-clock duration of the sleep. Together with the on-CPU graph, it covers the full second of every thread's life.

The half of the picture you have been missing

Every thread in a modern OS is in exactly one of two regimes at any instant: it is on-CPU (the scheduler has picked it; it is executing instructions), or it is off-CPU (it is sleeping, blocked, waiting). The CPU sampler — perf record, py-spy record, async-profiler — only sees the first regime. Every sample it captures, by construction, is from a thread that was on the CPU at the sample instant. A thread that was blocked for the entire 30-second capture contributes zero samples and is invisible.

For a CPU-bound service this is fine — the on-CPU graph covers ~95% of the relevant time. For an I/O-bound service it is catastrophic. Consider Karan's Swiggy worker: it spends 950 ms of every 1000 ms request blocked on a Redis call, a Postgres query, or a downstream restaurants-service HTTP call. The CPU samples capture the 50 ms of actual computation and tell Karan that JSON serialisation is hot. The 950 ms of blocking — the actual bottleneck — is silently dropped on the floor.

Why the on-CPU sampler structurally cannot see the blocked time: the kernel's perf infrastructure fires a signal on a hardware performance counter overflow (typically cycles or task-clock). Hardware counters do not increment while a thread is off the runqueue — the CPU is running someone else's instructions, accounted to a different PID. So the moment a thread blocks, it stops generating sample events. The structural blindness is not a bug; it is what "sampling the CPU" means.

The signal you actually want lives elsewhere. The Linux kernel exposes scheduler events as tracepoints — sched:sched_switch fires every time the scheduler picks a different task, sched:sched_wakeup fires every time a sleeping task is made runnable. These tracepoints carry the PID of the outgoing and incoming tasks and a timestamp. If you record the outgoing task's stack at every sched_switch, then later observe when that PID is next picked by sched_switch again (with the wake-up implicitly captured by the time delta), you have a perfectly resolved off-CPU interval: a stack, a duration, and the PID/TGID that owns it. Aggregate millions of these intervals by stack, weight each entry by its duration, and you have the input for an off-CPU flame graph.

This recording-and-pairing is exactly what eBPF was built for. Before eBPF, capturing every sched_switch event meant streaming a multi-million-events-per-second firehose through perf record, paying the perf-buffer copy cost on every switch, and post-processing offline. eBPF moved the aggregation into the kernel: the bpftrace program below holds the per-PID state map and the per-stack histogram in BPF maps, never copies events to userspace, and presents userspace with a final aggregate. The overhead drops from "unusable in production" to "1–2% on a 16-core machine" — the same transformation eBPF brought to networking, security, and tracing more generally.

A thread's life — on-CPU and off-CPU intervals over one second (illustrative)A horizontal timeline showing one second of a single thread. Most of the bar is shaded as off-CPU labelled blocked on Redis, blocked on Postgres, blocked on lock. Three small slices are on-CPU labelled compute. Below, two flame graph icons: the on-CPU graph captures only the small slices, the off-CPU graph captures only the long blocked intervals.One thread, one second of life — illustrativet=0t=1son-CPUoff-CPU — Postgres query (280 ms)on-CPUoff-CPU — Redis HGETALL (180 ms)on-CPUon-CPU total: 140 ms (14%)off-CPU total: 860 ms (86%)on-CPU flame graphsamples only the 140 ms of computelooks healthy — JSON serialisation hotmisses the 86% of time spent blockedoff-CPU flame graphsamples scheduler, weights by sleep msPostgres 280 ms / Redis 180 ms plateaustells you the actual bottleneck
One Swiggy worker thread, one second. The CPU sampler captures the three small on-CPU slices and reports a healthy-looking flame graph. The off-CPU sampler captures the two long blocked intervals and reports `psycopg2.execute → libpq → recv` and `redis.Redis.hgetall → recv` as the dominant plateaus. Both views are needed; either alone lies.

The fix is structural. To see the off-CPU time, you cannot sample the CPU — there is nothing to sample. You sample the scheduler instead. Every time the kernel takes a thread off the CPU it fires a tracepoint called sched_switch. Every time it puts a thread back on, it fires another one (sched_wakeup). The pair of events brackets the off-CPU interval: subtract the timestamps, you get the duration the thread slept; capture the stack trace at the moment of sched_switch, you get where in the code the thread went to sleep. Aggregate across thousands of these intervals, fold by stack, render as a flame graph — and you have the off-CPU graph. The X-axis is no longer "fraction of samples", it is "fraction of total off-CPU time"; the unit changed from samples-counted to nanoseconds-slept, but everything else about reading the graph is identical.

Sampling the scheduler — bpftrace in 12 lines

The on-CPU flame graph requires a hardware counter and perf record. The off-CPU flame graph requires a kernel tracepoint and a tiny eBPF program that subtracts timestamps. The eBPF program is what makes off-CPU profiling cheap enough to run in production: instead of streaming every sched_switch event to userspace (millions per second on a busy box), the eBPF program aggregates inside the kernel into a hash map indexed by stack ID, with the value being the cumulative off-CPU nanoseconds. Userspace reads the map once at the end, dumps the per-stack totals, and that's the input to flamegraph.pl. The overhead is typically under 2% on a 16-core machine — measurable but not prohibitive.

Brendan Gregg's offcputime-bpfcc (part of the BCC tools suite) and the bpftrace one-liner below both implement exactly this. The Python driver wraps bpftrace, captures its output, folds it, renders the SVG, and prints the top blocking stacks for an alert payload. Realistic and runnable on any Linux 4.9+ kernel:

# offcpu_flame.py — capture an off-CPU flame graph for a Swiggy worker.
# Drives bpftrace; folds; renders with flamegraph.pl; prints top blockers.
#
# Why bpftrace, not py-spy --idle: py-spy --idle samples GIL-blocked
# Python threads but cannot see kernel-level blocking (futex on a C
# library mutex, recv on a socket, page-fault wait). bpftrace samples
# the kernel scheduler tracepoint and sees every off-CPU event regardless
# of whether the blocked frame is Python, C extension, or kernel.

import re
import subprocess
import sys
import time
from pathlib import Path

# The bpftrace program: on every sched_switch, if prev_pid is in our
# target cgroup, store its kernel+user stack and the timestamp. On the
# next sched_switch where prev_pid is back on, accumulate the delta into
# a per-stack histogram. Filter to off-CPU intervals between 1 ms and 60 s
# (sub-1ms is noise; >60s is a stuck thread, separate problem).
BT_PROG = r"""
#include <linux/sched.h>

kprobe:finish_task_switch
/ args->prev->pid != 0 && cgroup == cgroupid("/sys/fs/cgroup/swiggy.slice") /
{
    @start[args->prev->pid] = nsecs;
    @stack[args->prev->pid] = ustack(perf) + kstack(perf);
}

kprobe:finish_task_switch
/ @start[pid] && nsecs - @start[pid] > 1000000 && nsecs - @start[pid] < 60000000000 /
{
    @offcpu[@stack[pid]] = sum(nsecs - @start[pid]);
    delete(@start[pid]);
    delete(@stack[pid]);
}

interval:s:30 { exit(); }
"""

def run_bpftrace(target_pid: int, seconds: int = 30) -> str:
    """Run bpftrace for `seconds`, return its stdout."""
    prog = BT_PROG.replace("swiggy.slice", f"system.slice/pod-{target_pid}.scope")
    t0 = time.perf_counter()
    res = subprocess.run(
        ["sudo", "bpftrace", "-e", prog],
        capture_output=True, text=True, timeout=seconds + 10)
    print(f"[bpftrace] wall={time.perf_counter()-t0:.1f}s "
          f"events={res.stdout.count(chr(10))} rc={res.returncode}")
    if res.returncode != 0:
        sys.stderr.write(res.stderr)
        sys.exit(1)
    return res.stdout

# bpftrace prints @offcpu[stack...]: <ns>. Reformat to flamegraph.pl folded:
#   frame_root;frame_a;frame_b <count>
STACK_RE = re.compile(r"@offcpu\[\s*([\s\S]*?)\s*\]:\s*(\d+)", re.MULTILINE)

def fold(bt_output: str) -> str:
    """Convert bpftrace's @offcpu[stack]: ns map to folded flamegraph format."""
    folded = []
    for stack_text, ns_str in STACK_RE.findall(bt_output):
        # Each frame is on its own line in bpftrace's output. Reverse so
        # the root frame is first (flamegraph.pl convention).
        frames = [f.strip() for f in stack_text.splitlines() if f.strip()]
        frames = [f for f in frames if not f.startswith("0x")]
        if not frames:
            continue
        ns = int(ns_str)
        if ns < 1_000_000:    # < 1ms — drop
            continue
        folded.append(";".join(reversed(frames)) + f" {ns // 1000}")
    return "\n".join(folded)

def render(folded: str, out_svg: Path) -> None:
    """Pipe the folded stacks through flamegraph.pl, save SVG."""
    p = subprocess.run(
        ["flamegraph.pl", "--bgcolors=blue", "--title=Off-CPU time (microseconds)",
         "--countname=us", "--width=1200"],
        input=folded, capture_output=True, text=True, check=True)
    out_svg.write_text(p.stdout)

def top_blockers(folded: str, k: int = 6) -> list[tuple[str, int]]:
    """Aggregate by leaf frame (the call that put the thread to sleep)."""
    leaves: dict[str, int] = {}
    for line in folded.splitlines():
        stack, count = line.rsplit(" ", 1)
        leaf = stack.split(";")[-1]
        leaves[leaf] = leaves.get(leaf, 0) + int(count)
    return sorted(leaves.items(), key=lambda kv: -kv[1])[:k]

if __name__ == "__main__":
    target = int(sys.argv[1]) if len(sys.argv) > 1 else 12345
    raw = run_bpftrace(target, seconds=30)
    folded = fold(raw)
    out = Path("/tmp/swiggy_offcpu.svg")
    render(folded, out)
    total_us = sum(int(line.rsplit(" ", 1)[1]) for line in folded.splitlines())
    print(f"\n[off-CPU total]  {total_us/1_000_000:.2f}s across all sampled threads")
    print("[top blocking leaves]   us           %     frame")
    for leaf, us in top_blockers(folded):
        print(f"  {us:>12,d}   {100*us/total_us:>5.1f}%   {leaf}")
# Sample run on a c6i.4xlarge (16 vCPU, kernel 6.6, bpftrace 0.20),
# target = a Python worker handling 1500 req/s of POST /order/create:

[bpftrace] wall=30.4s events=8412 rc=0

[off-CPU total]  47.31s across all sampled threads
[top blocking leaves]   us           %     frame
    19,840,210   42.0%   recv (psycopg2 → libpq → kernel)
     8,902,144   18.8%   recv (redis-py → hiredis → kernel)
     6,205,876   13.1%   futex_wait (gil_acquire)
     4,710,902    9.9%   epoll_wait (asyncio event loop idle)
     3,201,448    6.8%   recv (httpx → ssl_read → kernel)
     1,847,200    3.9%   page_fault_kernel (anon mmap, first-touch)

Walk-through. @start[pid] = nsecs in the first probe: when a thread leaves the CPU, save the timestamp and its stack into per-pid maps. The second probe fires when the same pid comes back on (the kernel reuses finish_task_switch for the wakeup side); it computes the delta and accumulates into @offcpu[stack]. interval:s:30 { exit(); } terminates the program after 30 seconds; bpftrace prints the maps on exit. if ns < 1_000_000 in the Python folder filters off-CPU intervals shorter than 1 ms — these are typically scheduler-noise context switches (rebalancing, IRQs, cpu_idle blips) that are uninteresting. flamegraph.pl --bgcolors=blue: convention is on-CPU graphs are warm/orange, off-CPU graphs are cool/blue, so a glance at the colour tells you which view you're reading. The leaf-frame aggregation at the bottom is what produces the alert payload: recv from psycopg2 is 42% of all off-CPU time across the worker fleet, so the diagnosis is "Postgres queries are the long pole" — the optimisation is connection pool sizing, slow-query analysis, or read-replica routing, not anything in the Python code.

Why the leaf-frame aggregation works for off-CPU but is dangerous for on-CPU: in an on-CPU graph, the leaf is whatever instruction the CPU was executing at sample time, which can be any inner loop deep in a library; aggregating by leaf without context loses the call chain that made the leaf hot. In an off-CPU graph, the leaf is always the syscall or futex call that put the thread to sleep — it is a small closed set (recv, read, write, epoll_wait, futex_wait, poll, select, nanosleep, accept, page_fault_kernel). Aggregating by these gives a meaningful taxonomy of why threads were blocked: network I/O, lock contention, scheduler idle, page-fault wait. The full flame graph then tells you which code path led to each kind of block. The leaf summary is the executive headline; the flame graph is the appendix.

A practical caution: the bpftrace stack-aggregation hash table has a default size of 4096 entries, and a busy multi-tenant box can blow past it — entries get evicted, samples get lost, the histogram is biased. Set bpftrace -c 'BPFTRACE_MAP_KEYS_MAX=65536 ...' for any production capture, or pin the map size at the top of the program with a BEGIN block. The other capture-stage failure mode is missing user-space symbols: bpftrace's ustack(perf) requires the target's debug symbols. Strip the binary or run without -g and the user-space frames render as [unknown] or hex addresses; the kernel frames stay legible because the kernel's symbol table is in /proc/kallsyms. For Python, py-spy is sometimes the simpler tool here — its --idle mode samples blocked Python threads and resolves Python frames natively, at the cost of not seeing the kernel-level blocking detail.

One more capture-stage subtlety: the finish_task_switch kprobe fires after the new task is already running, so the stack captured is the outgoing task's stack at the moment it stopped — exactly what off-CPU profiling wants. An older recipe used the sched:sched_switch tracepoint directly, which fires during the switch and gives a slightly different stack (the scheduler's own frames are visible). Both work; the finish_task_switch kprobe is preferred because it gives cleaner application-level stacks. If your bpftrace version is too old for the kprobe form, the tracepoint form is tracepoint:sched:sched_switch /args->prev_pid != 0/ { @start[args->prev_pid] = nsecs; @stack[args->prev_pid] = ustack(perf) + kstack(perf); } — same effect, slightly noisier output.

What an off-CPU graph looks like and how to read it

The reading discipline is the same as for on-CPU graphs (read the previous chapter on plateaus), but the interpretation is different. Three rules of thumb let you go from "blue rectangles" to "this is the bottleneck" in under two minutes.

Rule 1: every column ends in a kernel-mode frame. On an on-CPU graph the leaf is whatever was executing — a numpy BLAS call, a JIT-compiled Java method, a tight Rust loop. On an off-CPU graph the leaf is always a syscall or kernel sleep primitive: __schedule, do_nanosleep, futex_wait_queue_me, tcp_recvmsg, xfs_file_read_iter. If a column does not end in kernel space, something is wrong with the unwinder — the kernel stack got truncated or the symbol table is missing. The expected shape of the top edge is the closed set: ~80% of all blocked time across most services lands on tcp_recvmsg (network I/O), futex_wait_queue_me (lock contention), or __schedule (idle / sleeping).

Rule 2: the user-space parent of the kernel leaf is the diagnosis. tcp_recvmsg alone tells you "the thread was waiting on the network", which you already knew. The frame below tcp_recvmsg is the library that called recvlibpq (Postgres), hiredis (Redis), libssl (TLS), libcurl (HTTP). The frame below that is your application code: psycopg2.execute, redis.Redis.hgetall, httpx.AsyncClient.get. So the diagnosis ladder is: top frame = "I/O wait", second frame = "Postgres", third frame = "the specific call site". The flame graph compresses all three into one column; reading the column top-to-bottom gives you the kind of block, the protocol, and the call site in three glances.

A small taxonomy worth memorising. futex_wait_queue_me under pthread_mutex_lock is application lock contention; under __pyx_* or _PyThread_acquire_lock it is GIL contention; under runtime.gopark it is Go channel/mutex blocking. __schedule under do_nanosleep or hrtimer_nanosleep is an explicit sleep — usually a backoff in a retry loop, occasionally a time.sleep someone forgot in production. __schedule under do_swap_page is a page-fault wait while the kernel reads from disk — symptom of memory pressure or a mmap'd file not yet warm in page cache. tcp_recvmsg under __inet_lookup_listener is accept queue wait — the listen backlog is full, new connections are queueing. Each maps to a different fix: lock sharding, GIL-free C extensions, removing the sleep, more RAM or mlock, raising net.core.somaxconn. A few hours spent learning the kernel-leaf ↔ fix mapping pays back many times in production.

Off-CPU flame graph for the Swiggy order worker (illustrative — not measured data)A blue-tinted flame graph. Bottom row labelled gunicorn worker spans 100 percent. Above is request handler at 100 percent. Above splits: psycopg2.execute at 42 percent, redis hgetall at 19 percent, gil acquire futex at 13 percent, asyncio idle at 10 percent, httpx ssl read at 7 percent. Each plateau extends up through library frames (libpq, hiredis, libssl) to a kernel leaf (tcp_recvmsg, futex_wait_queue_me, schedule). Annotation points to the user-parent rule.Off-CPU graph — Swiggy POST /order/create (illustrative — not measured data)stack depthwidth = fraction of off-CPU time (microseconds)gunicorn_worker_loop (100%)request_handler (100%)psycopg2.execute (42%)redis.hgetall (19%)gil_acquire (13%)asyncio idle (10%)httpx (7%)page_faultlibpq.PQexec → recvhiredis recvfutex_waitepoll_waitssl_read[k] tcp_recvmsg → __schedule[k] tcp_recvmsg[k] futex_wait_queue[k] sys_epoll_wait[k] tcp_recvmsguser-parent of kernel leaf = diagnosislibpq → "Postgres is the long pole"user → kernel transition
Every column ends in kernel space (dashed). The user-parent of the kernel leaf is the actionable diagnosis: `libpq` → Postgres, `hiredis` → Redis, `futex_wait` → GIL contention, `epoll_wait` → idle event loop. Reading top-down, three frames per column tells you the kind of block, the protocol, and the call site. The 13% on `gil_acquire/futex_wait` is what would otherwise look invisible: GIL contention manifests as off-CPU time inside futex, not as on-CPU time anywhere.

Rule 3: idle time is a feature, not a bug — but only up to a point. A worker that handles 1000 req/s with each request spending 200 ms of wall time will show 200 thread-seconds of off-CPU time per wall second of capture. If the off-CPU graph also shows a fat epoll_wait → ep_poll → __schedule plateau, that is the event loop correctly sleeping when there is nothing to do — it is not a bottleneck. The discriminator is whether the idle plateau is blocking forward progress: if request latency is high and the loop is idle at the same time, the idle is a problem (work is queued but not picked up); if request latency is fine and the loop is idle, the idle is the right behaviour. Reading off-CPU graphs without knowing the request-rate / latency context produces false alarms; always pair the graph with the dashboard view of the same window.

There is a useful sanity-check formula. For a service handling R requests per second on T worker threads, with median per-request wall time W seconds, the expected off-CPU time per second of wall-clock is roughly T - R × C where C is the per-request on-CPU time.

If the measured off-CPU total is much larger than the expected number, threads are blocking on something other than the steady-state I/O budget — typically a lock or a slow downstream. If it is much smaller, threads are starving for work, possibly because the request queue has gone empty (under-provisioned upstream, dropped traffic).

For Karan's Swiggy capture: 16 threads × 30 seconds = 480 thread-seconds total; measured off-CPU was 47.3 thread-seconds, and there were ~45,000 requests handled. With per-request on-CPU around 1 ms, the on-CPU budget is 45 thread-seconds; remaining 388 thread-seconds were spent in normal epoll_wait idle, which the bpftrace filter (< 60 s and excluding stacks rooted at epoll_wait only) intentionally trimmed. The 47 thread-seconds the graph did attribute were the actionable blocked time. Always do this back-of-envelope before reading; it tells you whether the graph is showing the bottleneck or the noise floor.

Why the formula matters: an off-CPU graph reports the absolute time spent blocked, but "absolute time" is meaningless without the budget context. 47 seconds of tcp_recvmsg sounds catastrophic until you realise it is across 16 threads over 30 seconds, where the no-bug baseline would be roughly the same — at which point only the distribution shape (which plateau dominates) matters, not the absolute total. Engineers who skip the budget check chase phantom regressions; engineers who do it learn to focus on plateau ranking, not raw seconds.

Wakeup flame graphs — finding who unblocks whom

Off-CPU graphs tell you "thread X spent 280 ms blocked on a Postgres recv". They do not tell you "Postgres took 280 ms because the connection-pool semaphore was held by thread Y, who was blocked on a slow query". For that you need the wakeup graph: a flame graph where each off-CPU interval is attributed not just to where it slept, but to which thread woke it up. Constructed from the sched_wakeup tracepoint paired with sched_switch, the wakeup graph captures the producer-consumer relationships across threads.

The construction: when sched_switch puts thread X to sleep, capture X's stack and timestamp. When sched_wakeup later targets X, capture the waker's stack at that instant. The pair gives you "X was blocked here; thread Y woke X up from there". Aggregate across thousands of these pairs, group by waker stack, and you have a flame graph of "what unblocking work happened during this period". The classic insight: in a service hung on lock contention, the on-CPU graph is empty, the off-CPU graph shows everyone in futex_wait, and the wakeup graph shows one specific thread doing all the unlock() calls — the lock is held by exactly that thread, and it is the bottleneck.

The Hotstar reliability team used a wakeup flame graph in 2024 to find a regression where the order-confirmation thread was getting woken up 50 ms late on every request during the IPL final. The on-CPU graph was healthy — the thread used its CPU well when it ran. The off-CPU graph showed a fat epoll_wait → ep_poll, which looked like normal idleness. The wakeup graph showed the wakers were all coming from a single Redis-pubsub listener thread that was processing a 200-message backlog before fanning out to confirmations. The fix was to give the listener its own dedicated thread and a pre-allocated work queue, dropping the wake latency from p99 = 50 ms to p99 = 1.5 ms. None of the other graphs showed the dependency; only the wakeup graph did.

The cost of the wakeup graph is roughly 2× the off-CPU graph in tracing overhead — both sched_switch and sched_wakeup are dense events on a busy machine. Brendan Gregg's wakeuptime-bpfcc tool implements the capture; the rendering is flamegraph.pl with no special flags. Use it when the off-CPU graph shows a lot of futex / epoll time and the question "who was supposed to unblock these threads" is unanswered.

A subtler use of the wakeup graph is to detect wakeup storms — events where a single broadcast wakes up dozens of threads to compete for one piece of work, only one of which can win. The classic example is the "thundering herd" pattern on accept() in pre-Linux-3.9 kernels: a new connection wakes every accepting thread, all rush to accept, one gets the connection, the rest go back to sleep. The wakeup graph shows this as a single waker source with hundreds of woken pairs, all leading back to a sleep → wakeup → sleep cycle. Modern kernels solved most of these (SO_REUSEPORT, EPOLLEXCLUSIVE), but application-level wakeup storms still happen — condition_variable.notify_all() in C++, Object.notifyAll() in Java, broadcast on a Python asyncio.Condition. The wakeup graph fingers them by showing one thread doing the broadcast and many threads receiving it, where most of the receivers immediately go back to sleep without doing useful work.

The Cleartrip search team caught a wakeup storm in 2024 where every cache invalidation broadcasted to 200 worker threads, of which one would actually re-fetch and the other 199 would re-sleep — the storm itself cost 6% of total CPU in scheduler overhead. The wakeup graph made the diagnosis a 90-second exercise; an on-CPU profile alone showed only the small fraction of CPU consumed in __schedule and missed the cause entirely.

Common confusions

Going deeper

Profile-guided lock optimisation — the closed loop with off-CPU

The off-CPU graph is the measurement end of a closed-loop optimisation discipline. Brendan Gregg's "Locking, Lockless, Wait-free" series, the LLVM contention-profile work, and the Java Flight Recorder lock-contention view all use the same pattern: capture an off-CPU graph, identify the futex/lock with the most blocked time, drill into the user-parent to find the contention site, then either shard the lock (per-CPU, per-shard, hierarchical), reduce the critical section, or replace the lock with a lock-free structure. The Razorpay payments team uses this loop weekly: the off-CPU graph for payments-router showed 38% of off-CPU time in pthread_mutex_lock from a global rate-limiter; sharding the limiter to per-merchant counters dropped lock contention to 4% and improved p99 from 240 ms to 95 ms. None of the on-CPU profiling, the dashboards, or the application logs would have surfaced the global-mutex bottleneck — only the off-CPU graph did, because the slow path was blocked time, not running time.

Continuous off-CPU profiling at scale

Capturing off-CPU graphs ad-hoc with bpftrace is fine for incident response but expensive at fleet scale. Continuous off-CPU profiling — running the eBPF agent permanently on every host, sampling at 1 Hz, uploading per-stack histograms every minute — is what Pyroscope, Polar Signals, and (since 2023) the Linux parca project enable. The trade-off: you pay 1–2% steady-state overhead but get queryable off-CPU data for every host, every minute, indefinitely. The Flipkart Big Billion Days team runs continuous off-CPU profiling on the catalogue tier; the historical archive lets them answer "what did the off-CPU graph look like at 14:32:18 on the second sale day?" three weeks later, when the postmortem actually happens. The alternative — finding a still-running pod that exhibits the bug, attaching bpftrace, and hoping the bug recurs — is the classic distributed-systems failure of being unable to reproduce a transient.

Reproduce this on your laptop

# Reproduce on Linux 4.9+
sudo apt install bpftrace linux-tools-common
git clone https://github.com/brendangregg/FlameGraph
export PATH=$PWD/FlameGraph:$PATH
python3 -m venv .venv && source .venv/bin/activate
pip install psycopg2-binary
sudo python3 offcpu_flame.py $(pgrep -f gunicorn | head -1)
xdg-open /tmp/swiggy_offcpu.svg

Sampling off-CPU at very high context-switch rates

Above ~500k context-switches per second per CPU — heavily-virtualised hosts, services with millions of fine-grained goroutines, or aggressive SCHED_FIFO preemption — capturing the stack on every sched_switch becomes a measurable fraction of total CPU. The eBPF program runs on every switch, walks the stack (30–80 frames), hashes it, and increments a map entry — each step is a few hundred nanoseconds on modern hardware, so 500k × 500 ns = 250 ms of overhead per CPU per second, or 25% of one core. At that point off-CPU profiling itself becomes a perturbation source. The escape hatch is sampled off-CPU profiling: instead of capturing every sched_switch, capture a stratified sample (one in every 10 or 100 switches). This trades resolution for overhead and is the right setting for very high-context-switch workloads. Brendan Gregg's offcputime-bpfcc takes a -S sampling flag for exactly this case; Polar Signals' agent does it adaptively, dropping the sample rate when CPU overhead exceeds a configured threshold.

Differential off-CPU graphs — what changed in the blocked time

The on-CPU differential pattern from the previous chapter (red = got hotter, blue = got colder) extends directly to off-CPU graphs. Capture two off-CPU profiles — one before a deploy, one after — fold both, subtract per-stack, render with flamegraph.pl --differential. Frames where blocked time grew render red; frames where blocked time shrank render blue. The interpretation differs from on-CPU diffs: a red frame on an off-CPU diff means "threads spent more time blocked here after the deploy", which usually points at a regression in a downstream dependency, a new lock, or a connection-pool change — not at the application code itself. The Razorpay team caught a 2024 regression where a Redis client library upgrade switched from a connection pool of 64 to a pool of 16 by default; on-CPU graphs showed nothing, the off-CPU diff showed a fat red recv plateau on the Redis path because connections were now serialising. The fix was a one-line config change; the diagnosis time was four minutes from "p99 went up after the deploy" to "the Redis pool default changed". A pure on-CPU diff would have shown nothing red at all, because the regression added blocked time, not running time.

Where this leads next

Off-CPU profiling closes the loop opened in the previous chapter on flame graphs. With both on-CPU and off-CPU views, every thread's wall second is accounted for: either the thread was running (on-CPU graph) or it was blocked (off-CPU graph), and you have visibility into where in the code each regime spent its time. The pair is the foundation for everything in the next few chapters.

Continuous profiling in production (/wiki/continuous-profiling-in-production) describes the operational pattern of running both on-CPU and off-CPU agents permanently, with a 1–2% overhead budget, so you have queryable history. The pattern is what makes incident-time profiling shift from "SSH in, attach, hope" to "select the time range, render the graph". Off-CPU graphs are particularly hard to capture ad-hoc on transient bugs, so continuous capture is where they shine.

Lock contention deep-dive (/wiki/lock-contention-deep-dive) is the chapter for when the off-CPU graph fingers futex_wait as the dominant leaf and the question becomes "which lock, which holder, what to do". The wakeup-graph variant introduced here is the bridge into that chapter; the lock-specific tools (mutrace, perf lock, JFR's lock view) layer on top.

Async / coroutine profiling (/wiki/async-coroutine-profiling) handles the case where threads are not the unit of off-CPU time — Tokio tasks, Go goroutines, Python asyncio tasks, Java virtual threads. The basic eBPF on sched_switch sees only the OS thread; the runtime's userspace scheduler hides the per-task off-CPU time. Off-CPU profiling for async runtimes needs runtime-specific instrumentation, and that chapter walks through the recipes for each major one.

The mental shift to take into those chapters: most production performance problems are off-CPU problems. CPU-bound bottlenecks are the easy case — the on-CPU graph names them, the fix is in the code, the deploy resolves it. Off-CPU bottlenecks — locks, downstream services, page faults, scheduler issues — are where the long tail of incidents actually lives, and an engineer who can read the off-CPU graph fluently has a strict superset of the diagnostic vocabulary of one who only reads the on-CPU graph.

References