Disk performance: IOPS, throughput, latency

Aditi runs the nightly reconciliation batch for a Bengaluru lending startup. The job reads 220 GB of payment events from an EBS gp3 volume, joins them against a customer table, and writes 14 GB of corrected ledger entries back. On the current volume the job takes 38 minutes. The CFO wants it under 25 so that the morning regulator file is generated before the 09:00 IST cutoff. Aditi reads the AWS docs, sees that gp3 lets her dial throughput up to 1,000 MB/s, and pays for the upgrade. The next night the job takes 41 minutes. Throughput utilisation on the volume sits at 18% the entire run. The CFO is annoyed; the dashboard says the disk has plenty of headroom; nothing in the application code changed.

The dashboard is not lying — it is just answering the wrong question. Disk performance is not one number. It is at least three numbers (IOPS, throughput, latency), and which one bottlenecks you depends on the shape of your workload, not the rated capacity of the device. Aditi's batch issues lots of small random reads against an OLTP-shaped index — it is IOPS-bound, and her gp3 still ships at the same default 3,000 IOPS regardless of the throughput tier she pays for. Throughput is irrelevant because she never had a throughput problem; she has a "too many small reads" problem.

Disk performance is three numbers held in tension — IOPS (operations per second), throughput (bytes per second), and latency (time per operation) — and they trade off against each other in ways your dashboard rarely shows. Which number bottlenecks you depends on your I/O size and access pattern: small random reads run out of IOPS, large sequential reads run out of throughput, and queue depth determines whether you see the device's raw latency or its saturated tail. Diagnosing disk problems means measuring the right number with the right tool (fio with the right --bs, --iodepth, --rw knobs) before paying for the wrong upgrade.

Three numbers, not one — and the geometry that links them

Every disk device — spinning HDD, SATA SSD, NVMe SSD, EBS gp3, GCP pd-balanced — is rated by three numbers that the manufacturer and your cloud bill both quote. IOPS is the maximum number of distinct I/O operations the device can complete per second. Throughput is the maximum number of bytes the device can move per second. Latency is the time from when an I/O request is submitted to when its completion is acknowledged. The three are not independent: a single I/O moves block_size bytes and takes latency seconds, so for a stream of sequential, single-threaded I/Os you get throughput = IOPS × block_size and IOPS = 1 / latency.

This is the hidden geometry: a device rated at 16,000 IOPS and 250 MB/s is implicitly telling you it expects 16-KB block sizes (16,000 × 16 KB ≈ 250 MB/s). Run a workload with 4-KB blocks and you cap at 250 MB/s ÷ 4 KB = 62,500 IOPS — but the device hits its 16,000 IOPS ceiling first, so you get 16,000 × 4 KB = 64 MB/s and the throughput meter reads 25%. Run a workload with 1-MB blocks and you cap at 16,000 × 1 MB = 16 GB/s in IOPS terms — but the device hits its 250 MB/s throughput ceiling first, so you get 250 IOPS and the IOPS meter reads 1.5%. The same device looks under-utilised on whichever metric you are not bottlenecked on.

IOPS, throughput, latency geometry — which one bottlenecks depends on block sizeA two-axis plot. X-axis is block size from 4KB to 1MB on log scale. Y-axis is achieved throughput in MB/s. Two curves: an IOPS-limited diagonal rising from bottom-left and a throughput-limited horizontal at the top right. The intersection is the device's design point. Three workload markers (OLTP small random, batch sequential, log-structured) are placed on the curves with annotations.A device's two ceilings, and where workloads landblock size (log scale)achieved MB/s4 KB16 KB64 KB256 KB1 MB0125250IOPS-limited (16k IOPS × bs)throughput-limited (250 MB/s)design point16 KB blocksOLTP small-random4 KB → 64 MB/s — IOPS-boundlog-structured commit16 KB → 250 MB/s — at the kneebatch sequential scan256 KB → 250 MB/s — throughput-bound
The same gp3 volume seen as a function of block size. Below 16 KB the device is IOPS-limited and throughput is far below the rated 250 MB/s. Above 16 KB the device is throughput-limited and IOPS is far below the rated 16,000. Three workloads sit at three different places on the curve; only the one at the design-point knee uses the device fully. Illustrative — geometry holds for any rate-and-bandwidth-bounded device.

Latency is the third axis and it interacts with both. Each I/O takes some minimum time t_min set by physics — for an NVMe SSD that is the time for the controller to look up the logical-to-physical mapping, issue the NAND read, and DMA the result back, typically 50–110 µs. Why latency floors and IOPS ceilings are the same number divided differently: with queue depth 1 (one I/O outstanding at a time), throughput collapses to block_size / latency because nothing else is in flight. To hit a device's rated IOPS you need queue depth high enough that the device controller is always busy — typically iodepth ≈ rated_IOPS × t_min. An NVMe SSD rated at 600,000 IOPS with t_min = 100 µs needs iodepth ≈ 60 to saturate. Run it at iodepth = 1 and you get 10,000 IOPS, not 600,000 — you are paying for hardware you cannot reach.

The three numbers are also coupled by Little's Law applied to the I/O queue: concurrent_IOs = IOPS × latency. If a device sustains 16,000 IOPS at 1 ms average latency, there are 16 I/Os in flight on average. To hit the same IOPS at lower latency you need fewer in-flight I/Os; to hit it at higher latency you need more. This is why the same device looks "fast" at light load and "slow" at high load — the latency grew because the queue grew, not because the device changed.

Measuring on your own laptop — fio driven from Python

The right way to measure a disk is fio (Flexible I/O Tester), the canonical tool for this category since the early 2000s. The wrong way is to run dd if=/dev/zero of=test bs=1M count=1024 and divide bytes by seconds — dd writes through the page cache, the kernel returns instantly while writeback continues in the background, and the number you measure is the speed of memcpy, not the disk. The Python harness below drives fio four ways (small random read, small random write, large sequential read, mixed OLTP) and parses the JSON output to produce a four-row IOPS/throughput/latency table.

# disk_three_numbers.py — measure IOPS, throughput, and latency the right way
# Run: python3 disk_three_numbers.py /tmp/fio_test_file
# Requires: fio (apt install fio); a 2 GB scratch file path on the device under test.
import json, subprocess, sys, os, shutil

target = sys.argv[1] if len(sys.argv) > 1 else "/tmp/fio_test_file"
size = "2G"   # 2 GB working set — well above page cache for most laptops

# Four canonical I/O shapes. Each is one fio job that runs for 20 seconds.
SCENARIOS = [
    # name,           bs,    rw,         iodepth, comment
    ("4K random read",   "4k",   "randread",  32, "OLTP point-read shape"),
    ("4K random write",  "4k",   "randwrite", 32, "OLTP point-write shape"),
    ("256K seq read",    "256k", "read",       8, "batch scan shape"),
    ("OLTP mix 70/30",   "8k",   "randrw",    32, "Postgres OLTP shape"),
]

def run_fio(name, bs, rw, iodepth):
    cmd = ["fio", "--name=t", f"--filename={target}", f"--size={size}",
           f"--bs={bs}", f"--rw={rw}", f"--iodepth={iodepth}",
           "--ioengine=libaio", "--direct=1", "--time_based=1",
           "--runtime=20", "--output-format=json"]
    if rw == "randrw":
        cmd.append("--rwmixread=70")
    out = subprocess.run(cmd, capture_output=True, text=True, check=True)
    j = json.loads(out.stdout)["jobs"][0]
    # fio reports separate stats per direction; combine read + write.
    iops = j["read"]["iops"] + j["write"]["iops"]
    bw_mb = (j["read"]["bw_bytes"] + j["write"]["bw_bytes"]) / (1024 * 1024)
    # clat percentiles in ns
    p50_us = max(j["read"]["clat_ns"]["percentile"].get("50.000000", 0),
                 j["write"]["clat_ns"]["percentile"].get("50.000000", 0)) / 1000
    p99_us = max(j["read"]["clat_ns"]["percentile"].get("99.000000", 0),
                 j["write"]["clat_ns"]["percentile"].get("99.000000", 0)) / 1000
    return iops, bw_mb, p50_us, p99_us

print(f"{'workload':>22}  {'IOPS':>8}  {'MB/s':>7}  {'p50 µs':>8}  {'p99 µs':>8}")
print("-" * 64)
for name, bs, rw, qd in SCENARIOS:
    iops, mb, p50, p99 = run_fio(name, bs, rw, qd)
    print(f"{name:>22}  {iops:8.0f}  {mb:7.1f}  {p50:8.0f}  {p99:8.0f}")

# Cleanup
if os.path.exists(target):
    os.remove(target)
# Sample run on a Samsung 990 Pro NVMe SSD attached to a 13th-gen Core i7
# (Bengaluru workstation, kernel 6.5, ext4, no encryption)
              workload      IOPS     MB/s    p50 µs    p99 µs
----------------------------------------------------------------
        4K random read    412318    1610.6        72       180
       4K random write    298104    1164.5       104       290
         256K seq read     12840    3210.1      2480      3970
       OLTP mix 70/30     186420    1456.4       150       420

Walk through. The 4 KB random read row hits 412,318 IOPS at 72 µs p50 — the device is delivering the IOPS the spec sheet promises, but only because iodepth=32 keeps the NAND channels busy. Drop to iodepth=1 (rerun the harness to verify) and the same device gives ~13,000 IOPS — fewer than 4% of the rating — because the device's parallelism is unused. The 256 KB sequential row hits 3,210 MB/s at 12,840 IOPS — the throughput is now the ceiling and IOPS dropped 32× because each I/O moves 64× more data. Why p99 is 2,480 µs for the sequential read but 180 µs for the small random read: each 256 KB I/O internally fans out to 64 4 KB-page reads in the SSD controller, so the per-I/O latency is the time to complete the slowest of 64 internal operations. With 8 outstanding 256-KB I/Os and 64-way internal parallelism per I/O, the controller is processing 512 internal operations concurrently — beyond the device's NAND parallelism budget, so individual I/Os queue inside the controller. The latency you see is queueing delay, not flash physics.

# Quick reproduction at iodepth=1 to see the latency-collapse story
$ fio --name=t --filename=/tmp/x --size=2G --bs=4k --rw=randread \
      --iodepth=1 --ioengine=libaio --direct=1 --time_based=1 --runtime=10
  read: IOPS=12.8k, BW=50.0MiB/s (52.4MB/s)(500MiB/10001msec)
   clat (usec): min=58, max=412, avg=76.84, stdev=8.92

12,800 IOPS at queue depth 1 vs 412,000 IOPS at queue depth 32. Same device, same access pattern, 32× difference from the queueing parameter alone. This is the single biggest mistake in disk benchmarking — measuring at queue depth 1 because that "feels normal" and reporting that the device is slow. Production workloads have many concurrent threads / async tasks, each contributing one in-flight I/O, so the device sees a deep queue. You must measure at the queue depth your application generates, not the queue depth that's convenient to script.

The latency-vs-utilisation curve — where the knee lives

The number that surprises people is not how fast a disk runs at 50% utilisation; it is how badly it degrades past 80%. Disk latency vs utilisation is the same hockey-stick curve as M/M/1 queueing — at low utilisation latency is roughly the device's service time, and as utilisation ρ approaches 1 latency goes to infinity as 1 / (1 - ρ). The practical effect on an NVMe device whose service time is 100 µs: at ρ = 0.5 you see ~200 µs latency. At ρ = 0.8 you see ~500 µs. At ρ = 0.95 you see ~2 ms. At ρ = 0.99 you see ~10 ms. The cliff is real, it is geometric, and it is the reason the operational rule "size to 80% utilisation" exists.

# disk_latency_curve.py — sweep load against the device, plot p50 vs p99
# Run: python3 disk_latency_curve.py /tmp/fio_test_file
import json, subprocess, sys, statistics

target = sys.argv[1] if len(sys.argv) > 1 else "/tmp/fio_test_file"
# Sweep iodepth from 1 to 128 — each run measures the device at that load
RATES = [1, 2, 4, 8, 16, 32, 64, 128]
print(f"{'iodepth':>8}  {'IOPS':>8}  {'p50 µs':>8}  {'p99 µs':>8}  {'p99.9 µs':>10}")
print("-" * 56)
for qd in RATES:
    cmd = ["fio", "--name=lat", f"--filename={target}", "--size=2G",
           "--bs=4k", "--rw=randread", f"--iodepth={qd}",
           "--ioengine=libaio", "--direct=1", "--time_based=1",
           "--runtime=15", "--output-format=json"]
    out = subprocess.run(cmd, capture_output=True, text=True, check=True)
    j = json.loads(out.stdout)["jobs"][0]["read"]
    iops = j["iops"]
    p50 = j["clat_ns"]["percentile"]["50.000000"] / 1000
    p99 = j["clat_ns"]["percentile"]["99.000000"] / 1000
    p999 = j["clat_ns"]["percentile"]["99.900000"] / 1000
    print(f"{qd:>8}  {iops:8.0f}  {p50:8.0f}  {p99:8.0f}  {p999:10.0f}")
# Sample run on the same Samsung 990 Pro NVMe — 4 KB random read at varying load
 iodepth      IOPS    p50 µs    p99 µs    p99.9 µs
--------------------------------------------------------
       1     12810        70       110         180
       2     24320        78       130         230
       4     47410        82       150         320
       8     91200        85       180         480
      16    178400        88       210         720
      32    412300        72       180         900
      64    438100       145       380        2400
     128    441200       290       780        6800

Walk through. From iodepth 1 to 32 the IOPS scales almost linearly — 12,810 → 412,300 is a 32× rise for a 32× rise in concurrency, exactly what Little's Law predicts when the device is below its capacity ceiling. From iodepth 32 to 128 the IOPS rises only 7% — the device has saturated. Past saturation, the additional I/Os do not get serviced faster; they queue inside the device's submission queue and inside the kernel's dispatch queue, and the latency you observe is mostly queue waiting time. p99 grows from 110 µs to 780 µs across the same sweep — a 7× growth for a 4× growth past the knee — and p99.9 grows 38×. The tail moves faster than the median because tail latency is dominated by queue depth, and queue depth grows superlinearly past saturation. Why p99.9 grows so much faster than p50 past the knee: at saturation the device processes I/Os at a fixed rate, so each new I/O waits behind everything submitted before it. The p50 latency is approximately (qd / 2) × service_time because the median I/O has half the queue ahead of it; the p99.9 latency is approximately qd × service_time × 1.5 because the unlucky 1-in-1000 I/O sits behind almost the full queue plus a controller hiccup. The ratio of p99.9 to p50 grows from ~2× at low load to ~25× at saturation.

The operational consequence: the disk's IOPS rating is the wrong number to provision against if you care about latency. If your service has a p99 SLO of 200 µs against a device whose rated IOPS is 600,000, you cannot run the device at 600,000 IOPS — you can run it at maybe 200,000 IOPS and stay below the latency cliff. The "useful capacity" of a disk under a latency SLO is typically 30–50% of its rated capacity. Capacity-planning models that ignore this overprovision their throughput numbers by 2–3×. Aditi's reconciliation batch does not have a strict latency SLO so she can run the disk at saturation; Razorpay's payments path absolutely does, and they target 60% disk utilisation for that reason.

Where the workload shape comes from — five canonical patterns

Devices have one capacity curve; workloads have many shapes, and the shape is what lands you on a particular point of the curve. Five patterns show up in production often enough to memorise.

OLTP point-read (small random read). A Razorpay payment lookup by payment_id traverses a B-tree index — the tree's internal pages plus the leaf page plus the heap row, four to six 8 KB random reads per query. At 100,000 queries per second, that is ~500,000 IOPS of random 8 KB reads against the index files. Block size is fixed by the database page size; queue depth is whatever the connection pool is. The bottleneck number to track is IOPS at p99 latency under the working query depth, not throughput. Throughput will read 5–15% of the device's rating; that is fine.

OLTP point-write (small random write). A Zerodha order insert writes one row to the orders table and one row to the WAL. The WAL write is sequential (all WAL writes append to the current segment), but the heap write is random — and an fsync at commit forces the device to flush both. Block size: the WAL block size (8 KB Postgres default) plus the heap page. The bottleneck is fsync latency — how long the device takes to acknowledge a flush — not raw IOPS or throughput. NVMe SSDs with PLP (power-loss protection) ack fsync from the controller's NV-cache in ~30 µs; consumer SSDs without PLP ack only after the actual NAND program, which takes ~600 µs–2 ms. The 30× difference shows up as a 30× difference in commit throughput.

Batch sequential scan. Aditi's reconciliation batch reads 220 GB sequentially. Block size is whatever the application requests (typically 64 KB or larger for batch tools); queue depth grows with the application's I/O parallelism. The bottleneck is throughput, and a sequential pattern gives the device its best case — the controller can prefetch, the NAND channels stripe, the page cache absorbs read-ahead. You will see throughput close to the device's rating. If you don't, the bottleneck is somewhere upstream (CPU, network, downstream sink) rather than the disk.

Log-structured commit. Kafka writing to a topic, ScyllaDB's commit log, Postgres WAL — all are sequential writes at block sizes between 4 KB and 1 MB. The "log" pattern is the friendliest to disks because it gives the controller maximum sequentiality. The bottleneck is typically the fsync cadence: a log that fsyncs every record sees its throughput collapse to 1 / fsync_latency, while a log that batches multiple records per fsync (group commit) recovers throughput. Tuning the batch window is the lever.

Mixed read/write OLTP. Most Indian fintech workloads are roughly 70/30 read/write: 70% point-reads, 30% point-writes. The mixed pattern interacts badly with the device — writes invalidate the read cache, reads stall behind writes inside the controller, and write amplification can push the effective write IOPS down by 2–10× depending on the SSD's spare-area headroom. The bottleneck is the lower of the read-IOPS and write-IOPS ceilings, weighted by mix. The numerical surprise: if your read p99 is 180 µs in a pure-read benchmark and 420 µs in a 70/30 mix on the same device, the writes are eating ~240 µs of read latency budget through the controller's queue — not visible on any throughput graph.

Five canonical I/O patterns, where each lands on the IOPS-throughput planeA scatter plot. X-axis is throughput in MB/s on log scale, Y-axis is IOPS on log scale. Five labelled markers: OLTP point read (high IOPS, low MB/s, top-left), OLTP point write (medium IOPS, low MB/s, middle-left), batch sequential scan (low IOPS, high MB/s, bottom-right), log commit (medium-high IOPS, medium-high MB/s, middle), mixed OLTP (medium IOPS, medium MB/s, middle). A diagonal line shows the device's capacity envelope.Where each I/O pattern sits on the IOPS × throughput planethroughput (MB/s, log scale)IOPS (log scale)101001k10k100k10010k100k1Mdevice envelope (NVMe)OLTP point-read412k IOPS, 1.6 GB/sOLTP point-write298k IOPS, 1.2 GB/sbatch seq scan12.8k IOPS, 3.2 GB/slog commit~80k IOPS, ~600 MB/smixed 70/30 OLTP186k IOPS, 1.5 GB/s
Five workload shapes plotted on the IOPS × throughput plane for the same NVMe device. Each shape touches the device envelope at a different point; pattern shapes capacity, capacity does not shape pattern. Numbers from the harness in the previous section.

The diagnostic discipline this enables: when a disk is slow, classify the workload first. If it is small-random-read-dominated, look at IOPS and queue depth. If it is large-sequential-read-dominated, look at throughput and the upstream consumer. If it is OLTP commit-dominated, look at fsync latency and the device's NV-cache. The first 5 minutes of a disk-perf incident should be iostat -xm 1 to identify the pattern, not staring at the throughput dashboard.

Common confusions

Going deeper

The fsync ladder — why "write" latency means nothing without a flush

A write() syscall returns when bytes are in the page cache, not when they are on disk. For durability — the property that the data survives a power cycle — you must call fsync(), which forces the kernel to flush the dirty pages to the device and waits for the device to acknowledge the flush has reached non-volatile storage. The fsync latency is what determines the throughput of every transactional system on top of the device.

The fsync ladder, fastest to slowest: NVMe SSD with PLP (Intel Optane, Samsung PM-series datacentre SKUs) — the controller acks from its battery-backed cache in 20–40 µs. NVMe SSD without PLP (Samsung 990 Pro, WD Black) — the controller must complete the actual NAND program before acking, 200 µs–2 ms. SATA SSD — same NAND-program wait plus the SATA protocol overhead, 400 µs–5 ms. Spinning HDD — a half-rotation seek plus the platter write, 4–12 ms. EBS gp3 with fs_journal_async — the network ack then a remote write, 1–3 ms. EBS io2 Block Express — sub-millisecond fsync via the device's own PLP equivalent, 200–500 µs.

This is where Postgres benchmarks rank databases by — the gap between PostgreSQL on a consumer SSD without PLP and PostgreSQL on the same database design with a PLP-equipped enterprise SSD is 10–30× in commit throughput, with no software change. Why this matters for Indian-cloud-cost decisions: provisioning a single io2 Block Express volume at $0.125/GB-month for a 100 GB Postgres write-heavy workload costs ₹1,000/month. Provisioning a comparable gp3 with the maximum IOPS allocation costs ₹400/month. The throughput difference under fsync-heavy load is 5×. Most teams default to gp3 because it is the documented baseline; for high-write-IOPS workloads the io2 numbers justify the bill within one billing cycle.

iostat -xm 1 — the columns that actually matter

iostat -xm 1 is the per-second device-level dashboard you should leave running in a tmux pane during any incident that touches storage. The columns most people glance at (%util, tps) are the wrong ones; the columns that diagnose the workload shape are:

The combination of aqu-sz and await is the bottleneck classifier: low queue depth + low latency = under-utilised. High queue depth + low latency = at the design point. High queue depth + high latency = saturated, queueing dominates. Low queue depth + high latency = device is slow per-I/O (often the wrong device for the workload, e.g. spinning rust under random reads).

The cloud-disk twist — burst credits and rate limiting

Cloud block storage devices (EBS gp3, GCP pd-balanced, Azure Premium SSD) are not raw disks — they are network-attached virtual devices with rate limiters. AWS gp3 lets you pay for baseline IOPS up to 16,000 and baseline throughput up to 1,000 MB/s, with credits accumulating during low usage and spent during bursts. The "I am bottlenecked on the disk" diagnosis on cloud changes shape: you might be hitting a rate limit, not a physical limit.

The signal: iostat shows the volume happily processing requests, but await jumps to 50–500 ms periodically, and aqu-sz climbs into the hundreds. This is not the SSD being slow — this is the EBS rate limiter throttling new requests because you exceeded the provisioned IOPS budget. AWS exposes the BurstBalance CloudWatch metric for gp2/io1; gp3 has VolumeIOPSExceededCheck. If the metric is non-zero, the device is throttled and no amount of tuning the kernel I/O stack will help. The fix is paying for more provisioned IOPS, switching to a larger volume tier, or striping multiple volumes (RAID-0 / LVM) to multiply the per-volume limit.

This is also why "fast SSD on cloud" benchmarks can vary 10× between identical instances — burst credits depend on the volume's prior load, and an instance freshly launched has a full burst balance while an instance that has been running for hours has whatever it has accumulated. Always run the benchmark for at least 30 minutes to drain burst credits and measure the steady-state baseline. Sub-1-minute benchmarks on cloud volumes are the canonical lie.

Page cache, write-back, and the "phantom IOPS" problem

Linux interposes the page cache between every read() / write() syscall and the device. Reads served from the page cache never touch the disk and never appear in iostat; writes are buffered in the page cache and flushed asynchronously by the kernel's flush workers. The consequence: the IOPS the device sees can be wildly different from the IOPS your application issues, and the relationship is workload-dependent.

For a read-mostly workload with a hot working set that fits in RAM (a Postgres OLTP database where the entire pg_buffer_cache plus page cache holds the hot indexes), the device sees ~5% of the application's read IOPS. Application reports 200,000 reads/s; iostat shows 10,000 IOPS at the device; the other 95% are page-cache hits. This is the best case — the page cache is doing its job.

For a write-heavy workload with bursts (a Kafka broker accepting 50,000 messages/s), the device sees write bursts every dirty_writeback_centisecs (default 5 seconds) of size dirty_ratio × RAM (default 20%). On a 64 GB box that is up to 12.8 GB written in one burst — the device sees 0 IOPS for 4.9 seconds, then 12.8 GB / (device throughput) seconds of saturation. iostat averaged over 1 second shows the bursts; iostat averaged over 10 seconds shows a smooth load that is misleading. The fix is sysctl vm.dirty_background_ratio = 5 and vm.dirty_ratio = 10 to flush more frequently and avoid the "save-it-up-then-explode" pattern.

For applications that bypass the page cache (O_DIRECT, common in databases, the topic of ch.71), the application's IOPS equals the device's IOPS. There is no caching, no buffering, no asynchronous flush — the read returns when the device returns and the write is durable when the device acks (with O_DSYNC or explicit fsync). This is the honest case: what iostat shows is what the application is doing. Most database tuning guides recommend O_DIRECT plus the application's own buffer pool precisely because it removes the kernel's intermediation and makes the disk numbers map cleanly to application behaviour.

Why io_uring changes the IOPS-vs-throughput picture (preview of ch.71)

The traditional Linux I/O paths (read/write blocking, libaio async) all impose syscall overhead per operation — at high IOPS rates this overhead becomes the bottleneck. A device that can sustain 600,000 IOPS at the hardware level might cap out at 200,000 IOPS through the kernel because the syscall + context-switch cost per I/O eats the rest. io_uring (kernel 5.1+, mature in 6.x) reduces this by batching submission and completion through ring buffers in shared memory, dropping syscall overhead from per-I/O to per-batch.

For an IOPS-bound workload the gain is large — fio with --ioengine=io_uring typically reports 1.5–2.5× more IOPS than --ioengine=libaio on the same NVMe device, just because the syscall path is fatter on libaio. For a throughput-bound workload the gain is smaller (5–15%) because each I/O moves enough data that the per-I/O overhead is already amortised. This is the next chapter's territory; the lesson here is that "the disk's IOPS rating" assumes the kernel I/O path can sustain it, which is not always true on older kernels or older I/O engines.

Where this leads next

The next chapter (/wiki/ssd-vs-hdd-vs-nvme-vs-persistent-memory) takes the three-numbers framework here and applies it to the four media types you actually buy in 2026 — spinning rust, SATA SSD, NVMe SSD, and persistent memory. Each has a different IOPS/throughput/latency profile and the choice depends on your workload's shape, exactly the classification you learned to do in this chapter.

The chapter after that (/wiki/filesystem-overhead) covers the layer between your read() syscall and the device — how ext4 / XFS / Btrfs add latency through journaling, allocation, and metadata operations. The numbers you measured with fio here are the device numbers; the numbers your application sees are the device numbers plus the filesystem tax, which is sometimes 2× or more on small-file workloads.

Two operational habits this chapter adds to your toolkit. First, always identify your workload's I/O size before tuningiostat -xm 1 and look at rareq-sz / wareq-sz. If you cannot tell whether you are IOPS-bound or throughput-bound, every fix is a guess. Second, always benchmark at the queue depth your application generates — measuring at iodepth=1 for a 32-thread service produces numbers that are off by 10–30×, and the misleading numbers always go in the direction of "the disk is fine".

A third habit, harder to internalise but worth the effort: distrust any single-number disk benchmark. The marketing number ("1.2 million IOPS!") is the peak achievable under a contrived load — typically all reads, ideal queue depth, ideal block size, fresh device with full burst credits. Your workload is none of those. The honest comparison between two devices is a four-cell table — small-random IOPS at production queue depth, large-sequential throughput, mixed-OLTP IOPS, and fsync latency. Each cell involves a different fio configuration; no single benchmark answers the procurement question. Vendors who supply only the peak number are making a sales claim, not an engineering claim. The Aditi-with-the-wrong-upgrade story repeats endlessly because the procurement decision was made on the marketing number rather than the workload-shaped measurement.

The fintech production lesson worth one extra paragraph: at Razorpay's transaction-processing tier, the operations team maintains a per-service "I/O fingerprint" — the four-cell table above measured against the production workload trace replayed through fio --read_iolog. When a service starts breaching its latency SLO and the suspect is storage, the on-call compares the current iostat output against the fingerprint instead of guessing. The fingerprint takes 30 minutes to capture per service and saves multi-hour incidents about once a quarter. The fingerprint discipline is what separates teams who can answer "is the disk the problem?" in 5 minutes from teams who spend the entire incident bisecting.

The contrast with the IRCTC Tatkal pattern is instructive. IRCTC's booking system at the 10:00 IST Tatkal opening sees 18 million sessions arrive in 90 seconds — a write-heavy spike that overwhelms the standard fingerprint because the workload shape during the spike is fundamentally different from steady state. The fingerprint approach must be extended with a "spike fingerprint" captured at peak load, otherwise the off-peak measurements look healthy and the spike measurements look catastrophic and the team has no shared vocabulary for the gap. Indian production systems with sharp daily spikes (Tatkal, IPL toss, NSE 09:15 market open, mutual-fund 15:00 cutoff) need at least two fingerprints per critical service, not one. The cost is one more half-hour benchmark per service per quarter; the value is the ability to distinguish "the disk got slower" from "the workload got harder", which is the question every storage-related incident eventually reduces to.

Reproducibility footer

# Reproduce this on your laptop, ~10 minutes total
sudo apt install fio sysstat
python3 -m venv .venv && source .venv/bin/activate
# fio is the workhorse; no Python deps needed beyond the stdlib
# 1) The four-shape fingerprint:
python3 disk_three_numbers.py /tmp/fio_test_file
# 2) The latency-vs-utilisation sweep:
python3 disk_latency_curve.py /tmp/fio_test_file
# Watch the device live in another pane:
iostat -xm 1
# To compare against the cloud-disk rate-limit story, do the above on
# an EBS gp3 volume and again on a local NVMe instance store.
# A 30-minute baseline run drains EBS burst credits and shows steady-state
# numbers rather than the burst-inflated peaks new instances start with.

References