Wall: I/O is where systems actually block

Kiran runs the order-history API at a Bengaluru fintech. The endpoint paginates a customer's last 90 days of trades; on c6i.4xlarge it serves 4,200 req/s with a p99 of 38 ms. He follows Part 9 to the letter — measures Amdahl's serial fraction (3%), confirms there's no false-sharing in the JSON encoder, isolates two P-cores, sets the GOMAXPROCS to match. Throughput climbs to 4,800 req/s. p99 climbs to 41 ms. He doubles the instance to c6i.8xlarge: 4,900 req/s, p99 = 43 ms. The CPU dashboard reads 22% utilisation. Eight cores idle, throughput flat, latency creeping up. The flamegraph shows 71% in runtime.gopark — Go's name for "this goroutine is parked, waiting for something else". That something else is a Postgres SELECT that takes 6 ms median, 28 ms p99, and the connection pool has 32 slots.

This is the wall where Part 9 ends. Every ceiling we named — serial fraction, coherence, bandwidth, heterogeneity — assumed the bottleneck was inside the box. For most production services in 2026 it is not. The bottleneck is on the other side of a syscall: a disk seeking, a TCP socket waiting on an ACK, a database holding a row lock, an inference accelerator dispatch. None of it shows up in top. None of it appears on the on-CPU flamegraph. The CPU is idle, but not free — it is parked waiting for a wakeup, and the budget you spent on more cores buys nothing because the next core would also park.

Most production services are not CPU-bound; they are I/O-bound, and the difference is invisible to the tools that made you good at CPU optimisation. On-CPU profilers, top-style utilisation dashboards, and Amdahl's-law speedup curves all describe what threads do when they are running — not what they do when they are blocked. Diagnosing I/O-bound services requires off-CPU profiling (perf sched, bpftrace on sched_switch, off-CPU flamegraphs), wait-state accounting (D-state in ps, iowait% in top, runqueue latency), and a different mental model where adding cores is the wrong answer.

The on-CPU lie — why your profiler says nothing is wrong

Every flamegraph tool you have used — perf record, py-spy, pprof, Java Flight Recorder, Async Profiler in default mode — is a sampling on-CPU profiler. The sampler fires N times per second (default 99 Hz for perf) and asks: "which function is currently running on this CPU?". The answer goes into a histogram. Stack frames that appear in lots of samples are wide; stack frames that don't are narrow.

The tool has no opinion about a thread that is not running. A goroutine parked in runtime.gopark, a Java thread blocked in Object.wait, a Python coroutine suspended at an await asyncio.sleep, a kernel thread sleeping in io_schedule — none of them are sampled. They contribute zero to the flamegraph. The CPU sample lands on whatever thread happens to be running at that instant, which on an idle CPU is usually the kernel's idle loop (mwait_idle_with_hints) or, more commonly, nothing at all because the sample is dropped on idle CPUs.

The result: a service that spends 80% of its wall-time blocked on a database read shows a flamegraph dominated by the 20% of its time it spends actually computing. The wide frames are real CPU work — JSON encoding, TLS handshakes, hash computation — but they are not the bottleneck. The bottleneck is in the silence between the samples.

The same 32 ms request, three views. The wall-time bar shows where the request actually spent its 32 ms. The sample dots show where the 99 Hz on-CPU sampler landed — only on the running segments. The bottom bar shows what the flamegraph reports: a tidy distribution across encode/parse/join, totalling 100% of sampled time but only 34% of wall time. The bottleneck (database waits) is the 66% the chart cannot draw. Illustrative — typical I/O-bound request shape.

The cure is off-CPU profiling. Instead of sampling running threads, off-CPU profilers record every sched_switch (a thread leaves the CPU) and sched_switch_in (a thread comes back), measuring the time between. Each off-CPU interval is attributed to the stack at the point the thread blocked — which is what you actually want to see for an I/O-bound service. Why off-CPU profiling is harder than on-CPU profiling: every thread switch is an event, not a sample. A busy server can do hundreds of thousands of context switches per second; recording all of them produces gigabytes of trace per minute. The standard fix is filtering at the kernel level via eBPF — only record switches longer than 1 ms, only for the target process, only when the wait reason matches a list. Brendan Gregg's offcputime-bpfcc does exactly this and is the production tool for this category.

Measuring it on your own service — Python harness with off-CPU breakdown

The harness below builds a deliberately I/O-bound Python service (sleeps to simulate database calls, real CPU work for parsing/encoding), runs an offcputime trace via bpftrace, and produces a side-by-side comparison of on-CPU vs off-CPU time per call site. The point is to see — on your own laptop — that the on-CPU view is misleading and that the off-CPU view explains where the wall-time actually went.

# io_wait_breakdown.py — on-CPU vs off-CPU side by side for an I/O-bound service
# Run: python3 io_wait_breakdown.py
# Requires: Linux with bpftrace installed, root access for eBPF.
import os, time, subprocess, threading, json, signal, sys
from collections import defaultdict

# A request handler that is structurally I/O-bound: 4 ms of CPU work, then
# 25 ms of "database" wait, then 3 ms of CPU work. This is the shape of a
# typical paginated read endpoint at any Indian fintech.
def parse_request():
    # Real CPU work — JSON parse, validation. ~4 ms on a P-core.
    s = sum(i * i for i in range(40_000))
    return s

def fetch_from_db():
    # The "database wait". On a real service this is recv() on a socket,
    # the goroutine parking, the kernel io_schedule(). Here we use sleep()
    # which goes through nanosleep -> hrtimer_nanosleep -> schedule().
    time.sleep(0.025)

def encode_response(_):
    # Final CPU work — JSON encode, TLS write. ~3 ms on a P-core.
    return sum(i % 7 for i in range(30_000))

def handle_one_request():
    s = parse_request()
    fetch_from_db()
    return encode_response(s)

# Launch a worker thread that handles requests in a tight loop.
stop = threading.Event()
def worker():
    while not stop.is_set():
        handle_one_request()

t = threading.Thread(target=worker, daemon=True)
t.start()
pid = os.getpid()
print(f"Worker pid {pid} — running for 6 seconds")

# 1) On-CPU sampling profile via perf record (the conventional view)
on_cpu = subprocess.run(
    ["perf", "stat", "-e", "task-clock,cycles,context-switches,cpu-migrations",
     "-p", str(pid), "sleep", "3"],
    capture_output=True, text=True
)
print("=== on-CPU summary (perf stat) ===")
print(on_cpu.stderr)

# 2) Off-CPU profile via bpftrace — record stacks every time the thread
# is taken off-CPU, with the duration of the off-CPU interval.
bpftrace_prog = f"""
tracepoint:sched:sched_switch /pid == {pid}/ {{ @start[tid] = nsecs; }}
tracepoint:sched:sched_wakeup /@start[args->pid] != 0/ {{
    @offcpu_us[args->comm] = sum((nsecs - @start[args->pid]) / 1000);
    delete(@start[args->pid]);
}}
interval:s:3 {{ exit(); }}
"""
off = subprocess.run(["bpftrace", "-e", bpftrace_prog],
                     capture_output=True, text=True)
print("=== off-CPU breakdown (bpftrace) ===")
print(off.stdout or off.stderr)

stop.set(); t.join(timeout=1)

# Sample run on a 13th-gen Core i7 (Bengaluru workstation), kernel 6.5
Worker pid 184273 — running for 6 seconds
=== on-CPU summary (perf stat) ===
 Performance counter stats for process id '184273':

            618.42 msec task-clock                #    0.206 CPUs utilized
       2,134,901,233      cycles                  #    3.452 GHz
              1,038      context-switches         #    1.679 K/sec
                  3      cpu-migrations           #    4.851 /sec

       3.001274821 seconds time elapsed

=== off-CPU breakdown (bpftrace) ===
Attaching 3 probes...

@offcpu_us[python3]: 2386412

Walk through. task-clock reports 618 ms over a 3-second interval — 0.206 CPUs utilized. The on-CPU view says the worker is using ~21% of one CPU. A naive reading: "low utilisation, the box is fine, you can pack 4× more workers". This is the on-CPU lie. context-switches: 1,038 is the early warning sign — a CPU-bound worker would do ~10–50 ctxsw/sec, not 350/sec. Every time.sleep(0.025) causes one switch out + one switch back in, so 1,038 switches in 3 s ≈ 173 switch-pairs ≈ 173 requests, which matches the workload. The off-CPU breakdown reports 2,386,412 µs (≈2.39 s) of off-CPU time over 3 s — exactly the 80% the on-CPU view did not see. The bpftrace one-liner attributes every off-CPU interval to the comm (process name) and sums; in a real service you would also collect the kernel + user stack at the block point to see which call site is blocking, but even the aggregate number is enough to refute the "it's CPU-bound, add cores" diagnosis.

Why context switches are the cheap signal before you reach for off-CPU profiling: every block-on-I/O is one switch out, one switch back in. If pidstat -w 1 -p <pid> shows hundreds of switches per second per worker thread, the worker is blocking that often. A CPU-bound worker shows tens of switches per second (kernel preemption only). The ratio between expected request rate and observed switch rate tells you, before you even run bpftrace, whether the service is structurally I/O-bound. For Kiran's order-history API at 4,200 req/s with each request making one DB call, the expected switch rate is ~8,400/s; if pidstat shows that number, the I/O hypothesis is confirmed and the off-CPU profile just tells you which DB call.

# pidstat: the cheap pre-bpftrace check
$ pidstat -w 1 -p $(pgrep -f order-history)
14:07:01      UID       PID   cswch/s nvcswch/s  Command
14:07:02     1000    184273   8412.00     12.00  order-history
14:07:03     1000    184273   8398.00     11.00  order-history
14:07:04     1000    184273   8421.00     14.00  order-history

8,400 voluntary context switches per second from a single process. Voluntary (cswch/s) means the thread blocked itself — almost always on I/O. Non-voluntary (nvcswch/s) means the kernel preempted it for time-slicing. The ratio (700:1 voluntary-to-involuntary) is the off-CPU story before you trace a single line.

Where the wall-time is — five categories of blocking

I/O bottlenecks are not one phenomenon; they are five, and each shows up differently in the off-CPU profile and demands a different fix. Knowing which category you are in saves you weeks of trying the wrong solution.

Storage I/O (disk, SSD, network-attached block device). The thread calls read() or pread() against a file descriptor, the kernel issues a block layer request, the device DMAs the data into the page cache, the kernel wakes the thread. The block point is io_schedule in the kernel; the symptom in top is wa (iowait) percentage. Latency: 100 µs for an NVMe SSD, 5 ms for a spinning disk, 1–10 ms for EBS gp3 with a tail to 100 ms. This is what Part 10 will expand on, with iostat, fio, queue-depth tuning, and io_uring as the modern API.

Network I/O (TCP socket, gRPC, HTTP). The thread calls recv() or its async equivalent and parks until a packet arrives. The block point is sk_wait_data or epoll-related kernel functions. Latency: 50 µs intra-AZ, 1–5 ms cross-AZ, 50–500 ms cross-region. The diagnostic is ss -tin (per-socket retransmits, RTT) or bpftrace on tcp_recvmsg. Most "slow API" symptoms in microservice fleets are this category — not the called service being slow, but the network round trip plus a queue at the called service.

Database round trips (postgres, mysql, redis, dynamo, mongo). A composite of network I/O and the remote service's own work. The thread issues a query, parks until the response, and the wall-time is network_rtt + db_query_time + network_rtt. The db query time itself decomposes further into lock waits, index seeks, sort/aggregate work, and result serialisation — all of which the calling thread sees as one opaque latency number. The fix surface is huge (connection pool size, prepared statements, batching, read replicas, query plan changes); identifying which part of the DB latency dominates requires server-side profiling.

Inter-process synchronisation (mutex, condvar, semaphore, channel). The thread tries to acquire a lock another thread holds, or waits on a condition variable. The block point is futex_wait (Linux's userspace synchronisation primitive). Latency: from microseconds (uncontended hot lock) to milliseconds (contended) to seconds (priority-inversion or held-by-blocked-thread). This is the only category that is internal to the process — no external service, no kernel I/O — and the fix is usually a redesign of the locking strategy, not a knob.

Voluntary scheduler yields (sleep, timer, GC pause). The thread asks to be paused — either for a fixed duration (time.sleep, setTimeout), waiting on a timer, or because the runtime is doing GC and stopped the world. The block point is hrtimer_nanosleep or runtime-specific (e.g., Go's runtime.gcStart). Latency: whatever the program asked for, plus scheduler granularity (~50 µs on Linux). Most legitimate sleeps in production are GC pauses or rate-limiter back-pressure; spurious sleeps in a request hot path are usually a bug (someone wrote sleep(0.1) to "fix" a race condition and forgot to remove it).

The five blocking categories sorted by where in the stack the fix lives. Latency bars are log-scaled visually; the actual ranges overlap heavily. The diagnostic tool column tells you which command to run when an off-CPU profile says "this stack is blocking" but does not say why. Illustrative — based on typical Linux kernel block paths.

Why this categorisation matters in production: every category has a different fix, and applying the wrong one is the most expensive mistake in performance work. If your service is storage-I/O-bound, increasing the connection pool does nothing — your bottleneck is below the database. If your service is mutex-bound, adding read replicas does nothing — the contention is inside one process. If your service is network-bound on cross-AZ traffic, adding more cores in the same AZ does nothing — the round trip is the floor. Off-CPU profiling tells you the category; the right fix follows from there.

The capacity ceiling — Little's Law gives you the number

The reason adding cores stops helping at the I/O wall is not vague — it is Little's Law, applied honestly. For a stable system: L = λ × W, where L is the average number of in-flight requests, λ is the arrival rate, and W is the average wall-time per request. Rearrange for the throughput ceiling: λ_max = L_max / W. Your service's maximum throughput is the maximum number of requests you can have in flight, divided by how long each one takes.

For a CPU-bound service, L_max is roughly the core count (a request occupies a core for its CPU time, so 32 cores can hold ~32 in-flight requests). For an I/O-bound service, L_max is the concurrency limit of the bottleneck resource — the database connection pool size, the per-pod connection cap to the downstream, the OS file descriptor limit, the kernel epoll set size. Cores are irrelevant once the bottleneck resource is the limit.

# littles_law_ceiling.py — what throughput can my I/O-bound service actually hit?
# Run: python3 littles_law_ceiling.py
import json

# Production parameters from a typical Indian fintech read endpoint.
# Each request makes one DB round-trip, the rest is local CPU.
SCENARIOS = [
    {"name": "current",        "pool": 32,  "db_p50_ms": 6,  "db_p99_ms": 28, "local_cpu_ms": 4},
    {"name": "pool=128",       "pool": 128, "db_p50_ms": 6,  "db_p99_ms": 28, "local_cpu_ms": 4},
    {"name": "+read replica",  "pool": 128, "db_p50_ms": 3,  "db_p99_ms": 12, "local_cpu_ms": 4},
    {"name": "+batched query", "pool": 128, "db_p50_ms": 3,  "db_p99_ms": 12, "local_cpu_ms": 4, "batch": 4},
]
for s in SCENARIOS:
    batch = s.get("batch", 1)
    # Average wall-time per request, in seconds.
    W = (s["db_p50_ms"] / batch + s["local_cpu_ms"]) / 1000.0
    L_max = s["pool"]            # concurrency limit
    lam_max = L_max / W          # Little's Law throughput ceiling, req/s
    p99_W = (s["db_p99_ms"] / batch + s["local_cpu_ms"]) / 1000.0
    print(f"{s['name']:>22}  L={L_max:4d}  W={W*1000:5.1f}ms  lam_max={lam_max:7.0f} req/s  p99_W={p99_W*1000:5.1f}ms")

# Sample run — the four scenarios for Kiran's order-history API
              current  L=  32  W= 10.0ms  lam_max=   3200 req/s  p99_W= 32.0ms
             pool=128  L= 128  W= 10.0ms  lam_max=  12800 req/s  p99_W= 32.0ms
        +read replica  L= 128  W=  7.0ms  lam_max=  18286 req/s  p99_W= 16.0ms
       +batched query  L= 128  W=  4.8ms  lam_max=  26880 req/s  p99_W=  7.0ms

Walk through. The current configuration ceiling is 3,200 req/s, exactly because 32 connections × (1 / 10 ms) = 3,200. Kiran observed 4,800 req/s in production, which exceeds the calculated ceiling — meaning the average wall-time he is paying is closer to 6.7 ms, not 10 ms (probably because the DB p50 is below 6 ms when the pool is not contended). The ceiling rises sharply when you raise the pool: 32 → 128 connections is a 4× increase in concurrency for zero hardware cost. Adding a read replica drops W, which multiplies the ceiling again — note that this is the only category of fix that changes wall-time per request. Batching combines both effects, dropping the per-record wall-time by amortising the round trip across 4 records. Why Little's Law works for I/O-bound services and not just queueing-textbook problems: the law is a steady-state accounting identity — total in-flight = arrival rate × time per request — that holds for any stable system regardless of the service-time distribution. The only assumption is that the system is not blowing up (arrivals don't permanently exceed completions). For a service with a hard concurrency cap (connection pool, semaphore, fd limit), L is bounded by the cap, so λ is bounded by cap / W. This is a calculation, not an estimate.

The number you should walk away with: for an I/O-bound service, the throughput ceiling is the bottleneck-resource concurrency divided by the per-request wall-time at that resource. Cores do not appear in the formula. Memory does not appear. The whole CPU-side optimisation toolkit from earlier in Part 9 changes none of the variables. This is why the wall is named the way it is — Part 9's mental tools end here, Part 10's mental tools start here.

Common confusions

"High iowait% means the disk is the bottleneck." Not always. iowait is the percentage of time the CPU was idle while there was at least one outstanding I/O request. On a multi-core box with one core blocked on disk and seven idle, iowait reads ~12.5% even though only one core has anything to do with I/O. A better signal is the off-CPU profile of your specific service, not the system-wide iowait number. The system-level number is also lossy across kernel versions — its accounting changed in 5.x.
"Low CPU utilisation means the box has spare capacity." Only if the workload is CPU-bound. For an I/O-bound service, the cores are idle because the threads are blocked, not because there is no demand. Adding workers will not help — they will block on the same downstream resource. The capacity ceiling is the downstream's throughput, not the local CPU.
"Async / non-blocking I/O eliminates the wait." No — it changes who waits. The wait is still there in wall-time; it is just not associated with a kernel thread. An await asyncio.sleep(25e-3) is exactly as long in wall-time as a time.sleep(0.025) on the call path; the difference is that other coroutines can run during the wait. Async wins when you have many concurrent I/Os to overlap; it does nothing for a single critical-path call.
"Off-CPU profiling and on-CPU profiling are the same thing with different sampling rates." Different mechanisms entirely. On-CPU is sampling-based (fire every N ms, record what's running). Off-CPU is event-based (record every block/wake transition). Off-CPU produces a stack count weighted by time waited, not by sample frequency. The two profiles answer different questions — "what is the CPU doing?" vs "what is the wall-time spent on?".
"top's %CPU column shows what each process is doing." It shows the percent of one CPU that the process used in the sample interval. A process at 100% might be fully CPU-bound on one core, or it might be one thread of a 32-thread service that has 31 other threads blocked on I/O — top cannot tell you which without -H for per-thread view, and even that doesn't show off-CPU time.
"Profiling I/O-bound services is harder, so most teams skip it." Yes, that is the actual production reality. Most SRE teams reach for top, htop, and on-CPU flamegraphs first because those tools are familiar; they reach for off-CPU profiling only after weeks of "this is weird" debugging. The lesson Part 9 closes on: when the on-CPU view shows utilisation under 50% and throughput is stuck, switch tools — don't add cores.

Going deeper

The off-CPU flamegraph and how to read it

Brendan Gregg's off-CPU flamegraph is the canonical visualisation for this category of problem. The construction: sample the kernel's sched_switch tracepoint, record the user + kernel stack at each switch-out, measure the time until the next switch-in for that thread, and aggregate stacks by total off-CPU duration. The output looks like a normal flamegraph but the x-axis is wall-time spent blocked rather than CPU samples taken.

The reading is the inverse of an on-CPU flamegraph. A wide bar in an on-CPU flamegraph is a hot spot; a wide bar in an off-CPU flamegraph is a cold spot — a place where threads sit and wait. The hot spot for fixing performance is the same in both cases (where the wide bar is), but the fix is different: for on-CPU you optimise the code at the wide bar, for off-CPU you reduce the wait or increase parallelism so other work can happen during the wait.

The production tool is bcc/tools/offcputime or bpftrace's built-in offcpu.bt. A typical run: offcputime-bpfcc -p <pid> 30 > offcpu.txt; ./flamegraph.pl --color=io --title="off-cpu" offcpu.txt > offcpu.svg. The --color=io flag uses the IO color palette (blues) instead of the CPU palette (warm colours) so you don't visually confuse the two profile types.

When off-CPU is the wrong tool — pure CPU-bound services

Not every service is I/O-bound. A video transcoder, a cryptographic verifier, a JSON parser benchmark, a tight numerical loop — these are genuinely CPU-bound and on-CPU profiling is the correct tool. The diagnostic that decides which: look at pidstat -w for voluntary context switches per second per thread. Why this single number tells you which profile to run: a CPU-bound thread blocks only on kernel preemption (~once every 4 ms = 250/s of involuntary switches) and almost never voluntarily; an I/O-bound thread blocks on every external call (hundreds to thousands of voluntary switches per second). The ratio of voluntary to involuntary switches is the cheapest possible classifier — if cswch/s > 100 and dominates nvcswch/s, the thread is I/O-bound and on-CPU profiling will mislead you.

The real answer for many production services is "both": a service is CPU-bound and I/O-bound, and the breakdown changes by workload. A search service might be CPU-bound during query parsing and ranking, then I/O-bound while fetching documents from a backing store. Run both profiles, treat them as complementary views of the same wall-time.

Async runtimes and the off-CPU illusion

Go, Node.js, Python's asyncio, Rust's tokio, Java's project Loom — every modern runtime has some flavour of "user-space scheduler that multiplexes M coroutines onto N OS threads". For these runtimes, the standard kernel off-CPU profiler tells a misleading story: the kernel sees N OS threads, none of which block much because the runtime scheduler keeps them busy with whichever coroutine is ready. The coroutines that are blocked on I/O don't show up in the kernel's view because they are not real threads.

The diagnostic shifts: for Go you use runtime/pprof's "block profile" (go tool pprof -alloc_space block.pb) which records goroutine wait times in user space. For Node, the --prof flag and Async Hooks API. For Python asyncio, the asyncio.tasks.all_tasks() snapshot taken during the wait. The conceptual model stays the same — wall-time spent waiting matters, not just CPU time — but the tool has to live inside the runtime's scheduler, not the kernel's.

A subtle production trap: the runtime's wait profile only sees the waits the runtime knows about. A C extension that calls read() directly via ctypes, or a JNI call that blocks the OS thread, looks like normal CPU work to the runtime profiler — and looks like a blocked thread to the kernel profiler. You need both views to catch every category.

Indian production patterns — three I/O-bound services and how they were diagnosed

Razorpay's UPI callback service had the canonical "low CPU, high latency" symptom during the Diwali payment surge — 18% CPU utilisation across the fleet, p99 climbing from 180 ms to 720 ms. Adding instances did nothing. Off-CPU profiling pointed to recvfrom on the connection back to NPCI's UPI switch — the network round-trip from ap-south-1 to NPCI's data centre in Hyderabad was the floor. The fix was not in Razorpay's code at all: open more concurrent connections to NPCI per pod (raise the per-pod connection cap from 32 to 256) so the per-request wait time was the same but more requests overlapped. p99 returned to 220 ms with the same CPU footprint.

Hotstar's IPL stats API showed iowait=8% system-wide on the data tier during the IPL final, which the dashboard team initially read as "fine, plenty of headroom". The off-CPU profile of the actual service told a different story: 73% of wall-time in pread64 against the SSDs holding the player-stats database. The system-wide iowait was diluted by 11 other cores doing other work; the one core handling stats was effectively pegged on I/O. The fix was switching from gp3 EBS to local NVMe, dropping the per-read latency from 1.8 ms to 110 µs. The story repeats across many "low iowait, high latency" incidents — the system-wide number cannot replace per-service measurement.

Zerodha Kite's order-status endpoint showed the opposite pattern: high CPU (78%), high p99 (95 ms), and an off-CPU profile that was almost flat — there was nothing to optimise off-CPU because the threads were genuinely running, not blocked. The fix lived on the on-CPU side (the JSON encoder was un-vectorised, and a pre-compiled msgpack encoder cut CPU per request by 40%). The lesson: the off-CPU profile is the right starting point because if there is no off-CPU bottleneck, you've just confirmed it's an on-CPU problem and can move on. Off-CPU profiling is not always where the answer is; it is reliably the place to first check, especially when low CPU + high latency is the combination.

Why Part 10 needs its own treatment — I/O is a curriculum, not a chapter

The next part (chapters 68-78) takes I/O performance as the subject. Each blocking category from this chapter expands into multiple chapters: storage I/O alone needs the block layer, scheduler classes (deadline, mq-deadline, bfq, none), queue depths, io_uring semantics, page cache effects, fsync behaviour, and per-device tail latencies. Network I/O needs the kernel TCP stack, congestion control choices (cubic, BBR), and the new generation of zero-copy interfaces (XDP, AF_XDP). Database round trips need a connection-pool design, batching, and read-replica-vs-primary placement.

The wall this chapter establishes is the entry condition: you have proven the on-CPU view is wrong, you have proven the off-CPU view points at I/O, and you are now ready to dig into the I/O subsystem itself. Without that proof, every fix in Part 10 is a guess — you might tune io_uring queue depth on a service that is actually mutex-bound on a userspace lock, and the tuning will produce no change you can measure.

Where this leads next

The next chapter (/wiki/strong-vs-weak-scaling) is the formal closing of Part 9 — the strong-vs-weak-scaling vocabulary that names the difference between "make a fixed problem run faster on more cores" (strong, governed by Amdahl) and "make a bigger problem run in the same time on more cores" (weak, governed by Gustafson). For services that hit the I/O wall, neither scaling regime applies in the way Part 9 assumed; the next part will reframe scaling around the I/O-side ceiling.

Part 10 (/wiki/usdt-and-uprobes-userspace-ebpf anchors the toolchain) takes each blocking category from this chapter and gives it the depth treatment: the storage stack, the network stack, the database round-trip protocol, and the synchronisation primitives. Read this chapter before that part; the framing here — five categories, each with a different fix — is the spine the next part hangs detail on.

Two operational habits this chapter adds to the Part 9 toolkit. First, always run pidstat -w 1 against the suspect process before reaching for any profiler — voluntary context switches per second is the single cheapest classifier of CPU-bound vs I/O-bound, and it takes 5 seconds to get an answer. Second, assume off-CPU profiling is needed for any service with utilisation under 60% and rising p99 — the only services where the on-CPU profile suffices are the genuinely compute-bound ones, and those are a minority of production fleets in 2026.

Reproducibility footer

# Reproduce this on your laptop, ~5 minutes
sudo apt install linux-tools-common linux-tools-generic bpftrace bpfcc-tools
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip   # no third-party packages needed
sudo python3 io_wait_breakdown.py    # bpftrace requires root for eBPF
# To compare on-CPU and off-CPU views explicitly:
sudo offcputime-bpfcc -p $(pgrep -f io_wait_breakdown) 5 > offcpu.txt
sudo perf record -F 99 -p $(pgrep -f io_wait_breakdown) -g -- sleep 5
sudo perf script | head -50

References

Brendan Gregg, "Off-CPU Analysis" (2016, updated 2021) — the canonical introduction to off-CPU profiling, the off-CPU flamegraph, and the production tooling.
Brendan Gregg, Systems Performance (2nd ed., 2020), chapter 5 — Applications, and chapter 6 — CPUs — the full treatment of on-CPU vs off-CPU profiling and the wait-state taxonomy.
Brendan Gregg, BPF Performance Tools (2019), chapter 5 — the bcc/bpftrace recipes for offcputime, wakeuptime, runqlat, and the rest of the off-CPU toolchain.
Linux kernel documentation — scheduler tracepoints (sched_switch, sched_wakeup) — the underlying kernel events every off-CPU profiler hooks.
Gil Tene, "How NOT to Measure Latency" (2015) — the talk that establishes why coordinated omission and on-CPU bias produce the same class of misleading conclusion.
Dean & Barroso, "The Tail at Scale" (CACM 2013) — the production framing for why low average + high tail is the I/O-bound service signature at scale.
/wiki/wall-cpu-is-half-the-story — the companion wall chapter from earlier in the curriculum; together they bracket the on-CPU vs off-CPU framing for the whole curriculum.
/wiki/use-method-utilization-saturation-errors — the USE method, which separates utilisation, saturation, and errors per resource — exactly the discrimination this chapter argues for.