Disk I/O observability: iostat, biolatency, and seeing the device

At 14:32 IST on a Tuesday, Aditi gets a page: Razorpay's payment-status API p99 has crossed 400 ms. Application logs are clean. CPU is at 22%. She SSHs into the primary Postgres node, runs iostat -xz 1, and sees nvme0n1 sitting at %util = 98% with await = 47 ms and r_await = 51 ms. The disk is the suspect, but %util lies on NVMe and await is an average — neither tells her whether one query is unlucky or every query is suffering. She runs biolatency-bpfcc -D 30, and the histogram shows two peaks: one at 200 µs (the healthy normal) and one at 64 ms (the murder weapon). The bimodality is the diagnosis. The fix turns out to be a runaway analytics query saturating the device queue, not a hardware fault.

This chapter is about the two layers of disk-I/O observability — the per-second device summary that iostat gives you, and the per-request latency distribution that biolatency exposes via eBPF. They answer different questions, they fail in different ways, and reading either one without the other is how engineers misdiagnose disk problems for an entire afternoon.

iostat -x shows you the device's average behaviour over a window — IOPS, throughput, queue depth, average latency, utilisation. biolatency shows you the per-request latency distribution as a histogram, which is what you actually need to debug tail latency. The trap most engineers fall into is trusting %util on NVMe (where it saturates at 100% long before the device does) and trusting await as a proxy for tail latency (it is the mean, and the mean hides the tail). Use both — iostat to spot the device, biolatency to characterise the pain.

What iostat actually shows — and what it hides

iostat -xz 1 is the first command you run when something feels disk-shaped. The -x gives you the extended per-device columns; -z hides idle devices; the 1 polls once per second. Here is real output from a Postgres replica during a checkpoint storm:

Device  r/s    w/s    rkB/s     wkB/s    rrqm/s wrqm/s %rrqm %wrqm
nvme0n1 1842   3104   29472     412160   0      198    0.00  6.00
        r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
        0.42    14.80   46.21  16.0     132.8    0.20   98.40

Every column is a clue, and every column has a way of being misleading. Why r_await and w_await tell different stories: r_await is 0.42 ms — reads are landing on cached blocks or fast NVMe paths and returning quickly. w_await is 14.80 ms — writes are queued behind a checkpoint flush, and the device is committing dirty pages in batches. Aggregating them into one "disk is slow" verdict would miss that the read path is healthy and the write path is the actual bottleneck.

The columns are grouped by what they measure. The two highlighted columns — `await` and `%util` — are the ones that mislead the most. The rest are mostly trustworthy.

The aqu-sz column (average queue size) is a stealth diagnostic: by Little's Law, aqu-sz = throughput × await, so a queue of 46 with throughput 4946 IOPS implies await ≈ 9.3 ms — which roughly matches the weighted average of r_await and w_await shown above. When the arithmetic doesn't reconcile, you are looking at a measurement window that contained a regime change (e.g. a checkpoint started halfway through), and you should shorten the polling interval. The Little's Law arithmetic is also the cleanest way to spot when the polling tool itself is lying — every commercial monitoring agent we have seen has a window-aggregation bug somewhere, and the surest test is "do the three columns reconcile against Little's Law?". If they don't, distrust the tool, not the device.

A subtlety on the difference between svctm and await: svctm is the average service time if the requests were serial — derived from %util / IOPS. await is the actual end-to-end latency including queueing. The ratio await / svctm approximates the average queue depth seen by an arriving request, which is a more honest measure of "how much queueing is happening" than aqu-sz alone. Modern iostat versions warn that svctm is no longer meaningful on multi-queue devices — and they are right, because the formula assumes serial service which NVMe explicitly violates. Treat svctm as legacy; trust await.

The rareq-sz and wareq-sz columns matter more than engineers usually credit. A device's IOPS and throughput are coupled — a drive that does 100k IOPS at 4 KB does not do 100k IOPS at 1 MB; it falls to ~4k IOPS because the bandwidth ceiling kicks in. Why request size is a first-class metric: when the workload shifts from 4 KB random reads to 128 KB sequential reads, the IOPS number can fall by 30× even though the device is doing more work. If you alert on "IOPS dropped" you will get paged for a workload change that is the device serving more bytes, not less. Always interpret IOPS in the context of rareq-sz/wareq-sz.

The %util column is the most dangerous number on the line. On a single-spindle HDD, %util = 100% literally means the head was busy the entire interval and the device cannot do more work — the diagnostic intent is correct. On modern NVMe with 32 hardware queues, %util is computed from "any request in flight", so %util = 100% means there was always at least one of 32 possible requests outstanding — which can happen at 3% of the device's actual saturation. Treat %util as a liveness signal on NVMe, not a saturation signal. The real saturation indicator is whether aqu-sz is approaching the device's queue depth and await is climbing.

The rrqm/s and wrqm/s columns (request merges per second) are the column most engineers ignore and shouldn't. The kernel I/O scheduler merges adjacent requests into single larger ones — two adjacent 4 KB reads become one 8 KB read — to reduce the per-request overhead at the device. A healthy sequential workload shows large merge counts: wrqm/s = 800 means 800 small writes per second got coalesced into fewer large ones. A workload that should be sequential but shows wrqm/s = 0 is doing something wrong — perhaps the application is using O_DIRECT (which bypasses the merge) or the I/O scheduler is none (which doesn't merge). The merge count is a free diagnostic for "is my application's I/O pattern aligned with what I think I'm doing?".

biolatency — the histogram is the truth

iostat's averages compress the very thing tail-latency debugging needs to see — the distribution. biolatency, from the BCC tools collection, hooks the kernel's block I/O subsystem with eBPF and emits a per-request latency histogram. This is the tool that lets you tell "every read is slow" apart from "most reads are fast and a few are catastrophic", which is the difference between a hardware fault and a contention problem.

# bio_observe.py — invoke biolatency-bpfcc, parse the histogram, classify the workload.
# Why Python is the wrapper: biolatency emits ASCII histograms, not JSON. We parse them
# into structured percentiles, then make a verdict. Run as root (eBPF needs CAP_BPF).
#
# Setup:
#   sudo apt-get install bpfcc-tools linux-headers-$(uname -r)
#   pip install hdrh                    # for the percentile reduction
#   sudo python3 bio_observe.py /dev/nvme0n1 30

import re, subprocess, sys, statistics
from hdrh.histogram import HdrHistogram

DEV = sys.argv[1] if len(sys.argv) > 1 else "/dev/nvme0n1"
WINDOW_S = int(sys.argv[2]) if len(sys.argv) > 2 else 30

# -D = per-disk, -m = milliseconds, -F = also break out by I/O flags (read/write/sync).
out = subprocess.run(
    ["biolatency-bpfcc", "-D", "-m", "-F", str(WINDOW_S), "1"],
    capture_output=True, text=True, check=True,
).stdout

# Each histogram bucket prints as: "    256 -> 511        : 12     |***  |"
BUCKET_RE = re.compile(r"^\s*(\d+)\s*->\s*(\d+)\s*:\s*(\d+)")

# Build an HdrHistogram per (device, op_flag) so we can compute real percentiles.
hists, current_key = {}, None
for line in out.splitlines():
    if line.startswith("disk ="):
        current_key = line.strip()
        hists[current_key] = HdrHistogram(1, 600_000_000, 3)  # 1 µs to 10 min, 3 sig figs
        continue
    m = BUCKET_RE.match(line)
    if m and current_key:
        lo_us, hi_us, count = int(m.group(1)) * 1000, int(m.group(2)) * 1000, int(m.group(3))
        # Place each count at the bucket midpoint — slight pessimism, fine for 3-digit precision.
        midpoint = (lo_us + hi_us) // 2
        for _ in range(count):
            hists[current_key].record_value(midpoint)

print(f"\nbiolatency report — device {DEV}, {WINDOW_S}s window\n")
for key, h in hists.items():
    if h.get_total_count() == 0:
        continue
    p50, p99, p999 = (h.get_value_at_percentile(p) / 1000 for p in (50, 99, 99.9))
    pmax = h.get_max_value() / 1000
    n = h.get_total_count()
    # Bimodality test: is p99 more than 50× p50? Then the workload has a tail problem.
    bimodal = p99 > 50 * p50
    verdict = "BIMODAL — investigate contention" if bimodal else "unimodal"
    print(f"{key}  n={n:>6}  p50={p50:>6.2f}ms  p99={p99:>7.2f}ms  "
          f"p99.9={p999:>7.2f}ms  max={pmax:>7.2f}ms  [{verdict}]")

A real run on a Postgres node during the same checkpoint window:

biolatency report — device /dev/nvme0n1, 30s window

disk = nvme0n1, flags = Read     n= 31204  p50=  0.18ms  p99=   1.40ms  p99.9=   3.60ms  max=  12.40ms  [unimodal]
disk = nvme0n1, flags = Write    n= 87420  p50=  0.62ms  p99=  44.00ms  p99.9=  88.00ms  max= 124.00ms  [BIMODAL — investigate contention]
disk = nvme0n1, flags = Sync     n=  2104  p50=  4.20ms  p99=  68.00ms  p99.9= 116.00ms  max= 168.00ms  [BIMODAL — investigate contention]

The walkthrough that matters:

The -F flag splits the histogram by operation type. Reads, writes, and fsync()-marked syncs land in separate buckets. Without -F you would see one merged histogram and miss that reads are healthy while writes and syncs are the problem.

HdrHistogram reconstructs real percentiles. biolatency's text histogram has logarithmic buckets — useful for visualising, useless for "what is my p99.9?". By replaying the bucket counts into an HdrHistogram with 3-significant-figure resolution, you get percentiles you can put on a dashboard without lying. Why HdrHistogram instead of just averaging the buckets: an HdrHistogram preserves the shape of the distribution at log-linear resolution. A single average over wide buckets gives you back roughly the mean — exactly the number await already gave you in iostat. The whole point of moving to histograms is that you get the shape, not just the centre.

The bimodality check is the core diagnostic. When p99 > 50 × p50, the distribution has a heavy tail — most requests are fast, a small fraction are catastrophic. Bimodality is the signature of contention (queueing, lock waits, GC, throttling) rather than slowness (a uniformly slow device shows a unimodal distribution at the slow latency). The verdict line tells the operator which class of problem to chase.

The Indian-context production reality. In a Razorpay payment-success flow, a single fsync() on the WAL log lands on the critical path of every committed transaction. A bimodal Sync distribution with p99.9 = 116 ms means roughly 1 in 1000 payment confirms takes 116 ms longer than normal — that is the entire latency budget of the payment-success SLO, blown by disk-tail-latency alone. The histogram tells you not only that there is a problem, but what fraction of users are hit and how badly.

A useful refinement: biolatency -Q (without -D) adds the per-request queueing time as a separate dimension. The total block-layer latency is queue_time + service_time, and the two have completely different fixes. High service time means the device is the bottleneck — buy faster storage. High queue time with low service time means the kernel queue is the bottleneck — increase nr_requests, switch I/O schedulers, or reduce the offered load. Most engineers see "high latency" and assume hardware; the queue-vs-service split tells you which half is actually slow.

The other thing the histogram exposes that no average can: modality over time. If you run biolatency for 5 separate 30-second windows across a 5-minute incident, you can watch the bimodal hump emerge, peak, and recede. The shape of that arc tells you whether the contention is steady (a constantly-running batch job), bursty (a periodic flush), or transient (a one-off scan). Each shape implies a different remediation cadence — steady contention is fixed by isolation, bursty by smoothing, transient by tolerance. The iostat 1-second numbers cannot show you this arc because they don't carry distribution information; the per-window histogram is the only view that does.

The Brendan Gregg observation that motivated biolatency's design is worth restating: in a 30-second window with 100k I/Os, the average latency contains less than 5 bits of information about the workload — barely enough to say "fast", "medium", or "slow". The histogram contains roughly log2(bucket_count) × bucket_count bits, which is several orders of magnitude more, and that is the information needed to debug tail latency. Moving from averages to histograms is a one-time engineering investment that pays off on every incident afterwards. The teams that have done it stop having "the disk seems slow but I can't pin it down" conversations entirely.

Reading the two tools together — the diagnostic ladder

iostat and biolatency are not redundant; they answer two different questions. iostat answers "is the device active and how active?" with one number per second. biolatency answers "what is the shape of the per-request latency distribution?" Combine them in a fixed sequence and you go from "something is slow" to "this exact pattern is the cause" in under five minutes.

Each rung answers one question. Skipping a rung — going straight from `iostat` to `bpftrace`, say — wastes 80% of your time on the wrong process because you never pinned down which device or which op type is the actual culprit.

Step 1 — iostat -xz 1 — picks the device. On a multi-disk system (root on nvme0n1, data on nvme1n1, log on nvme2n1), running diagnostics on the wrong device is the most common time-waster. Pick the device with the suspicious aqu-sz or %util first, then drill in.

Step 2 — interpret the queue depth against the device's known capacity. A consumer NVMe SSD has hardware queue depth 32 and software queue depth 1023; a server NVMe (Samsung PM1733, Intel D7-P5520) has hardware queue depth 64–128. If aqu-sz = 12 on an idle drive, that is fine; on a drive whose r_await is climbing past 5 ms, that queue is the cost. The arithmetic from Little's Law links them: await = aqu-sz / throughput, so doubling either while holding the other constant doubles the latency.

Step 3 — biolatency -D -m -F 30 1 — gives you the distribution. The -D splits by device (matching what step 1 picked). The -m gives millisecond buckets (microsecond buckets exist with -u for very fast devices but produce too many lines). The -F splits by op type, which is the single most useful flag and the one engineers always forget on the first invocation. Why splitting by op type matters: writes go through the kernel's writeback path and accumulate behind dirty pages; reads go through the read-ahead path and hit the page cache. Sync I/O (fsync) waits for the device's volatile write cache to flush. These three paths have completely different latency distributions and conflating them — by running biolatency without -F — averages a healthy fast distribution with a sick slow one and gives you a meaningless bimodal histogram that tells you nothing.

Step 4 — apply the bimodality test. A p99 / p50 ratio above ~50× is the signature of contention. The bimodal histogram looks like two humps on a log-x axis: a short hump at the unloaded latency (the device when it has the queue to itself) and a tall narrow hump at the queue-saturated latency (the device behind 30 other requests). Unimodal-but-slow looks like one hump at a high latency — the device itself is the bottleneck. The two failure modes need different fixes.

Step 5 — for bimodal latency, drop to biosnoop -Q (per-request trace) or a bpftrace one-liner like bpftrace -e 'tracepoint:block:block_rq_complete { @[args->dev, args->bytes] = hist(args->nr_sector); }' to attribute the slow requests to processes, files, or block ranges. This is where you discover that a runaway analytics query is doing 700 MB of random reads on the OLTP NVMe, or that the antivirus daemon is fsync()-ing a 2 GB log file every 30 seconds.

The discipline of climbing the ladder in order matters more than the specific tools. The temptation under incident pressure is to skip rungs — to jump straight to bpftrace because it feels powerful, or to skip biolatency and try iotop because it has a friendlier UI. Skipping rungs trades thoroughness for speed and almost always loses; the engineers who debug disk problems fastest are the ones who refuse to skip. Each rung answers a question the next rung depends on, and answering them out of order produces the "I checked everything but I still don't know what's wrong" feeling that wastes the most time during a live incident. When you find yourself confused, drop back to rung 1 and start over — the four-minute walk through the ladder is much cheaper than another hour of guessing.

A war story from this exact ladder: at Hotstar during the 2024 IPL playoffs, the live-streaming origin's manifest-write latency p99 jumped to 1.8 s during the first innings break. iostat showed nvme0n1 at %util = 100% — alarming but expected on hot media. biolatency -F showed Reads at p99 = 0.4 ms (healthy), Writes at p99 = 18 ms (loaded but fine), and Syncs at p99 = 1700 ms (catastrophic and bimodal). biosnoop -Q attributed the slow syncs to a logging daemon that had been recently upgraded and was now O_DIRECT-ing every line. The fix took 90 seconds — one config flag rolled back. The 90 seconds was the 5th rung of the ladder; the prior four rungs took 4 minutes.

A second story — same ladder, different ending. At a Bengaluru-based fintech, the on-call SRE Kiran got paged at 02:14 IST for "Postgres p99 over 600 ms". Step 1, iostat, showed nvme1n1 (the data volume) idle and nvme0n1 (the WAL volume) at aqu-sz = 8, await = 12 ms, %util = 88%. Step 2 — comparing aqu-sz = 8 to the device's queue depth of 64 — said the device was nowhere near saturation. Step 3, biolatency -F, showed Sync writes with a clean unimodal distribution at p99 = 11 ms. No bimodality, no contention. Step 4 prompted the question: if the disk is fine, why is Postgres slow? Kiran checked pg_stat_activity and found 800 connections all waiting on the same row lock — the I/O wait reported by Postgres was queueing inside the database, not at the block layer. The lesson: iostat and biolatency are necessary but not sufficient; sometimes the disk-shaped problem is not a disk problem at all, and the diagnostic ladder's job is to exonerate the disk as fast as it convicts it.

When iostat lies — the cgroup, virtio, and overlayfs traps

iostat reads kernel statistics from /proc/diskstats, which counts events at the kernel block layer. That is one specific layer, and three common production environments add layers above it that iostat cannot see.

In a containerised environment with cgroup I/O throttling (Kubernetes with io.max limits), the throttling happens above the block layer — the kernel artificially delays I/O requests before they reach the block layer at all. iostat shows a perfectly idle device while the application's read() and write() calls are blocking for 100 ms each, because the throttle is invisible to the block layer. The diagnostic clue is application-reported I/O wait that does not show up in iostat; the fix is cat /sys/fs/cgroup/<cgroup>/io.stat to see the throttle counters directly.

In a virtualised environment (KVM, Xen, AWS EC2 with EBS), the guest's iostat shows the virtual device — a virtio-blk queue — not the underlying physical disk. Virtio adds a hop through the hypervisor, and that hop has its own queue. The guest can see aqu-sz = 4 and await = 12 ms while the host sees the same logical I/O at aqu-sz = 32 and await = 4 ms, because the bottleneck is the virtio queue between them. Reconciling guest and host iostat is the only honest way to debug virtual-device performance.

In an overlayfs or other stacked-filesystem environment (Docker default storage driver), the file-level operations the application sees do not map 1:1 to the block-level operations iostat reports. A single write() to a file in an overlayfs upper layer can trigger a copy-up that reads the entire underlying layer, multiplying the apparent I/O. Why this matters in container deployments: a developer's "the application is reading 10 MB" can become 200 MB at the block layer because of overlayfs copy-ups, and iostat will faithfully report the 200 MB without any hint about why. The fix is to use a bind mount or volume mount for I/O-heavy paths, bypassing the overlay; the diagnostic is to compare application-level bytes (from strace or application logs) against block-level bytes (from iostat or /proc/diskstats) and look for the multiplier.

The failure modes — what "the disk looks slow" actually means

Engineers reach for iostat when something feels disk-shaped, but "disk-shaped" is four different problems. Each one shows a different fingerprint across the iostat and biolatency outputs, and each demands a different fix. The fingerprint table below is the lookup the diagnostic ladder produces.

Each row is a different bug; each column is a different output panel. Pattern-match the row that fits the symptoms before you start running fixes.

Mode 1 — saturated device. The workload is asking for more than the device can give. iostat shows aqu-sz pinned at the device's queue depth, await flat at the saturated value, %util at 100%. biolatency shows a unimodal histogram at the saturated latency (e.g. p50 = p99 = 8 ms). Fix: scale up — bigger device, more devices, or move some of the workload off. There is no clever software fix because the device is doing its best.

The hardest part of mode 1 is admitting it. Engineers will spend two weeks tuning kernel parameters, swapping I/O schedulers, and tweaking application-level batch sizes before accepting that they have hit a hardware ceiling. The biolatency unimodal-at-saturation signature is the diagnostic that ends the speculation: if the histogram is one tight peak at the device's known saturated latency, the conversation is no longer "what should we tune?" but "how do we cost the upgrade?". Skipping the bargaining stage saves weeks.

Mode 2 — queue contention. Two or more workloads share the device, one of them is bursty. iostat's aqu-sz and await spike together, %util is near 100% during bursts. biolatency is bimodal: a fast hump (the requests that won the queue race) and a slow hump (the requests that lost it). Why bimodality is the contention signature: when N workloads share one device, the request-arrival timing is the difference between landing at queue position 0 (fast) and queue position N-1 (slow). The two are separated by an order of magnitude or more, so the histogram has two distinct peaks. A single workload — even at saturation — does not produce bimodality, because every request sees roughly the same competition. Fix: isolate the contending workloads onto different devices, throttle the burst with cgroup I/O limits, or change the I/O scheduler to one that fairs better across writers (bfq for some workloads, mq-deadline for others).

Mode 3 — fsync storm. A logging daemon, a database commit, or a misconfigured journal flushes the device's volatile write cache too often. iostat looks mostly healthy — r_await and w_await are normal, %util is moderate. biolatency -F is the only tool that catches it: the Sync row's p99 is 100× the Read row's p99. Fix: batch the fsync() calls (group commit in databases, O_DSYNC instead of fsync per write), or move the fsync-heavy workload to a device with a non-volatile write cache (NVMe with PLP — Power-Loss Protection — drives reduce fsync latency by 20–50× because they can ack the write from DRAM). PhonePe's database team discovered this when they migrated from PLP-equipped Samsung PM1733 drives to a cheaper consumer NVMe SKU and watched their commit-latency p99 go from 800 µs to 38 ms — the underlying NAND was identical, but the missing supercap meant every fsync had to wait for an actual flash program cycle. The cost difference between PLP and non-PLP NVMe is small; the latency difference is two orders of magnitude. Always check the spec sheet for "PLP" or "supercap" when buying NVMe for a fsync-bound workload, and validate post-install with biolatency -F showing Sync p99 in the sub-millisecond range.

Mode 4 — noisy neighbour. This one is cloud-specific. Your workload is moderate, the device should be fast, but iostat columns oscillate randomly — await is 2 ms one second and 50 ms the next, %util jumps between 30 and 100. biolatency shows a heavy tail without clean modes. The underlying physical device is shared with another tenant who is hammering it. There is no software fix; the only options are: file a support ticket, switch to a dedicated-tenancy SKU (io2 Block Express on AWS, Ultra Disk on Azure), or accept the variance and engineer around it with redundancy.

The fingerprint pattern-matching is also why a junior on-call sometimes calls a Mode-2 contention problem a "saturated device" and orders a hardware upgrade that does not help. The %util = 100% reading is identical between modes 1 and 2, but the biolatency histogram tells the two apart in seconds. A team that runs biolatency as a default second step — not an escalation — catches the misdiagnosis before the procurement order goes out. At Zerodha, a 2024 incident on the order-matching cluster looked like a saturated NVMe; the team almost ordered four io2 Block Express volumes at ₹85,000 a month each before someone ran biolatency -F and saw the bimodality, which led to a 15-minute fix isolating an analytics replica from the OLTP path. The histogram is the cheap insurance against an expensive wrong answer.

Common confusions

"%util = 100% means the disk is saturated." Only on single-queue HDDs. On NVMe with 32+ hardware queues, %util saturates at 100% with one in-flight request, so it can show 100% at 3% of true capacity. Use aqu-sz against the device's queue depth and watch await climb to detect real saturation.
"await is the latency users see." await is the average end-to-end block-layer latency over the polling interval. Users see a per-request distribution. If 99% of I/O is 200 µs and 1% is 80 ms, await ≈ 1 ms — but the 1% is what the user complains about. Use biolatency for the distribution.
"biolatency measures the device." It measures the kernel's block layer — the time between request submission to the block layer and completion. It includes queueing in the kernel queue, the device queue, and the device service time. Use bpftrace against the NVMe driver tracepoints to isolate the device-only portion.
"sysstat is dead, use eBPF for everything." iostat is a 30-year-old tool with a 0% CPU footprint and a stable column format that every dashboard understands. Use it as the per-second pulse; reach for eBPF when you need a distribution or attribution. They are complements, not competitors.
"r_await and w_await add up to await." They don't — await is the total weighted by request count, not a sum. The relation is await ≈ (r/s × r_await + w/s × w_await) / (r/s + w/s). If you read the columns as additive you will misdiagnose every mixed workload.
"Run biolatency always." It uses kprobe-based eBPF, which adds ~200 ns per block-I/O event. On a node doing 200k IOPS that is 4% of one core. Run it for 30–60 second windows during incidents, not as a permanent agent.

Going deeper

Per-disk vs per-partition vs per-LV — picking the right granularity

iostat shows one row per device by default, but your data may live on a partition, an LVM logical volume, or an md-raid array. iostat -xN adds device-mapper names, which is what you want when the actual data path is data_vg/postgres_lv mapped over four NVMe drives. Without -N you see four NVMe rows and have to mentally aggregate.

For Postgres tablespaces or MySQL innodb_data_file_path spanning multiple devices, the right view is the LV row — that is the throughput your database sees. The underlying NVMe rows tell you whether the LV's bandwidth is limited by one slow drive or distributed evenly. A common surprise: a healthy-looking LV throughput hides a single drive doing 90% of the work because the data was loaded into the LV before stripe alignment was fixed.

The other granularity trap is per-partition vs per-disk on the same physical device. iostat -p ALL adds partition rows, which matters when one partition (say nvme0n1p1, the WAL partition) is the actual hot spot but the device-level row averages it with three quiet partitions. The PostgreSQL convention of putting the data directory and the WAL on separate partitions exists precisely so that this attribution is possible from iostat alone, without dropping to eBPF.

biolatency's eBPF cost, and the family — biosnoop, biotop, blktrace

biolatency attaches kprobes at blk_account_io_start and blk_account_io_done (or the equivalent tracepoints on newer kernels). Each event costs roughly 100–200 ns on modern hardware — fine for typical workloads, painful for I/O-heavy ones. Why kprobe overhead matters here specifically: the block layer is one of the hottest kernel paths on a database server. At 200k IOPS, two kprobes per request (start + done) is 80 million events per second, which on a 200 ns-per-event budget consumes 16 ms of CPU per second per core servicing I/O completions. On a 16-core machine with completions pinned to 4 cores, that is 4% of those cores spent on instrumentation alone. Acceptable for a 30-second debug window; unacceptable as a permanent metric agent. The lower-overhead alternative is the kernel's own /sys/block/<dev>/stat file (read every second), which gives you the same data iostat uses but doesn't add per-request hooks. For permanent dashboards, parse /sys/block/<dev>/stat; reach for biolatency only when something looks wrong.

biolatency aggregates; its siblings attribute. biosnoop -Q traces every block I/O with PID, comm, latency, and queueing time separated from service time — the per-request granularity you need to spot a single 800 ms outlier hidden in a 200 µs distribution. biotop is top for I/O: top processes by I/O bytes and IOPS, refreshed every second. blktrace (older, kernel-builtin) gives you packet-capture-level visibility into every block event: queue, merge, insert, issue, complete — the right tool when you suspect the kernel I/O scheduler is making the wrong decisions about merging or reordering. The escalation hierarchy: biolatency to characterise the distribution, biotop to find the noisy process, biosnoop to see the per-request pattern of that process, blktrace if you need to understand a scheduler subtlety. On most production incidents you stop at biosnoop; blktrace is reserved for the rare "I think the scheduler is wrong" investigation.

The relationship to fio — measuring what the device can do

iostat and biolatency measure what your workload is actually doing. fio measures what the device can do — synthetic I/O patterns that establish ceilings. The diagnostic flow when you suspect hardware is: characterise the workload with iostat + biolatency, then run fio with the same I/O pattern (same block size, queue depth, read/write mix) and see whether the device delivers what its spec promises. A healthy NVMe should give you 500k+ random 4 KB read IOPS at queue depth 32; if fio measures 90k, you have a hardware or firmware problem the workload couldn't have caused.

The Indian-context note: if you are on EBS gp3 (the AWS general-purpose SSD widely used by Indian SaaS companies because of price), the per-volume IOPS cap is 16,000 by default and the throughput cap is 1000 MiB/s — both purchasable in larger increments at extra cost. A fio run that flatlines at 16k IOPS on gp3 is not telling you the device is broken; it is telling you the cloud-imposed cap is the ceiling. Always compare fio numbers against the purchased IOPS, not the raw NVMe spec, on cloud storage. The same logic applies to GCP's pd-balanced and Azure's Premium SSD v2 — the fio ceiling matches the SKU's quota, not the physical device. Engineers migrating from on-prem to cloud often discover this the hard way when their carefully-tuned database, which used to do 200k IOPS on bare-metal NVMe, settles into 16k on the cloud-equivalent and the application's commit latency triples. The root cause is not the cloud being slow; it is the SKU being underpaid-for, and iostat will show %util = 100% and await = 25 ms on what looks like a perfectly healthy device because the cap is upstream of the device.

Per-process attribution — pidstat, /proc/PID/io, and the limits of fio

Before reaching for eBPF, two file-system-based tools cover the per-process attribution case for free. /proc/<pid>/io exposes read_bytes, write_bytes, and cancelled_write_bytes per process — the latter being the bytes a process wrote to the page cache that were later evicted without being flushed, a useful signal for "is this process generating real disk I/O or just dirtying memory?". pidstat -d 1 polls /proc and prints per-process I/O rates every second.

The combined workflow when iostat shows the device but you do not yet know the culprit process: run pidstat -d 1 5 for a 5-second window, look for the process at the top of the kB_wr/s or kB_rd/s columns, and you have the suspect without paying the eBPF cost. Why this is often enough: most disk-I/O problems in production are caused by a small number of guilty processes — a misbehaving cron, a backup, an analytics query, a leaking log writer. Per-process accounting via /proc catches all of these at zero cost. eBPF is the correct tool when the problem is fine-grained (per-file, per-syscall, per-block-range) or when the throughput numbers don't add up across processes — both are the minority of cases.

The gotcha with /proc/<pid>/io: read_bytes and write_bytes are block-layer bytes, not application-layer bytes, so a process that reads 1 GB from a file already cached in the page cache shows read_bytes = 0. This is the right number for "what is this process costing the disk" but the wrong number for "what is this process actually reading". For application-level I/O accounting, use strace -c -e trace=read,write,pread,pwrite -p <pid> for a short window. The two views together — block bytes from /proc, syscall bytes from strace — let you compute the page-cache hit rate per process, which is sometimes the actual answer to "why is this process slow?".

A 2 AM cheat sheet for the on-call SRE: open three terminals on the suspect host, run iostat -xz 1 in the first, pidstat -d 1 in the second, and sudo biolatency-bpfcc -D -m -F 30 1 in the third. While biolatency collects, eyeball iostat for the device with the climbing aqu-sz and pidstat for the process at the top of kB_wr/s. By the time biolatency prints its first histogram, you usually already know the answer and the histogram is the confirmation. The four patterns to memorise: aqu-sz near device queue depth means saturation; aqu-sz × await not matching reported throughput means a regime change inside the polling window; bimodal biolatency with p99 > 50× p50 means contention; Sync row much worse than Read/Write means an fsync storm. These four cover roughly 80% of disk-I/O incidents without looking anything up.

Sampling intervals and the regime-change trap

The polling interval you give iostat shapes what you can see, and the wrong choice will hide the very anomaly you're trying to catch. A 1-second window can show a clean snapshot of a steady-state workload but completely miss a 200 ms latency spike that happened inside the second; the 800 ms of normal I/O around the spike averages it down to a barely-elevated await. A 10-second window smooths out the per-second jitter but averages across regime changes — a checkpoint that started at second 4 and ended at second 7 of a 10-second window appears as a uniform "moderately busy" result rather than the sharp transient it was. The right interval is workload-dependent: 1-second polling for incident debugging when you want to see transients, 10-second polling for capacity-planning measurements where you want the trend, 60-second polling for permanent dashboards where storage cost matters.

The deeper problem is what the time-series literature calls regime change: when the underlying system shifts behaviour partway through a measurement window, the average across that window is a meaningless number — neither the old behaviour nor the new one. A Postgres checkpoint is the canonical regime change. Before the checkpoint, the device is doing maybe 200 IOPS of WAL writes; during the checkpoint, it is doing 30,000 IOPS of dirty-page flushing; after, it returns to 200 IOPS. A iostat -xz 30 window that contains the checkpoint shows ~10,000 IOPS — neither steady state nor peak, and misleading for capacity planning. The fix is either to shorten the interval until each window is steady-state, or to log the per-second time series and compute percentiles offline (the same logic that drives the move from await to biolatency).

Reproduce this on your laptop

# Reproduce this on your laptop (Linux only — eBPF is a Linux feature)
sudo apt-get install sysstat bpfcc-tools linux-headers-$(uname -r) fio
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh
# Step 1 — generate a baseline workload (10 GB random-read, queue depth 16):
fio --name=baseline --rw=randread --bs=4k --iodepth=16 --size=2G --time_based --runtime=30 \
    --filename=/tmp/iotest --direct=1 &
# Step 2 — watch iostat in another terminal:
iostat -xz 1
# Step 3 — capture the latency histogram:
sudo python3 bio_observe.py /dev/nvme0n1 30

Where this leads next

Disk-I/O observability is the measurement half of the storage performance story; without it, every "the disk seems slow" conversation devolves into guesswork and the wrong fix. The next chapters in Part 10 cover the mechanisms — how the kernel's I/O paths are actually built, why blocking I/O wastes CPU on syscall-bound workloads, and how io_uring rebuilds the syscall layer to amortise that cost.

Once you can read iostat and biolatency, the next question is what is on the other side of those numbers — the page cache, the I/O scheduler, the device's own internal queues. The diagnostic ladder in this chapter ends at "the disk is slow because process X is doing pattern Y"; the chapters that follow tell you what to do about it.

A useful exercise once you have read the next few chapters is to come back to this one and reinterpret the iostat columns through the lens of what you have learned. aqu-sz is the materialisation of the kernel's request queue plus the device's hardware queues — once you have seen those queues in the io_uring and block-multiqueue architecture, the column means more. wrqm/s is the I/O scheduler doing its merging work — once you have seen the scheduler choices (mq-deadline, bfq, none), the column tells you which scheduler you are running. The two-tool combination of iostat and biolatency is the tip of an instrument that includes the entire kernel block layer; the deeper you understand that layer, the more diagnostic value the same two tools deliver. The tools do not change; your reading of them does.

/wiki/disk-performance-iops-throughput-latency — the underlying device-level numbers iostat is reporting against.
/wiki/o-direct-async-i-o-io-uring — the syscall surface that determines whether your application can keep the device queue full.
/wiki/zero-copy-sendfile-splice-mmap — once iostat shows wkB/s saturating but CPU is high, this is the next place to look.
/wiki/bpftrace-the-awk-of-production — the toolkit you reach for at rung 5 of the diagnostic ladder above.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), ch.9 — the canonical chapter on disk performance and observability tools, including the %util caveat on multi-queue devices.
Brendan Gregg, BPF Performance Tools (2019), ch.9 — the design and source of biolatency, biosnoop, biotop.
Linux kernel block-layer documentation — Documentation/block/stat.rst — the underlying /sys/block/<dev>/stat fields that iostat reads.
sysstat manual — iostat(1) — full column reference for iostat -x.
bcc-tools repository — biolatency — the upstream tool with example output and option reference.
Jens Axboe, "Efficient IO with io_uring" (2019) — context for why the block layer and its observability surface evolved the way they did.
/wiki/disk-performance-iops-throughput-latency — internal cross-reference for the device-level performance ceilings the tools in this chapter measure against.
Linux Documentation: Block IO Statistics — Documentation/iostats.rst — the kernel's own description of every counter exposed in /proc/diskstats that iostat parses.