O_DIRECT, async I/O, io_uring

Rahul runs the storage tier for a Bengaluru bank's UPI reconciliation pipeline. The Postgres write nodes sit on Samsung PM9A3 NVMe drives whose datasheet promises 412,000 random write IOPS. With ext4 default mount options his Postgres benchmark hits 38,000 IOPS — exactly where the previous chapter (/wiki/filesystem-overhead) said it would. He flips Postgres 16 to io_method=direct and the same workload climbs to 61,000 IOPS. He flips the engine to io_uring (Postgres 17, behind a config flag) and the number jumps to 248,000 IOPS — still 40% below the device ceiling, but 6.5× more than where he started, with no hardware change.

The 6.5× did not come from a clever algorithm. It came from removing three things the kernel was doing on his behalf without asking: copying every byte through the page cache, blocking the calling thread on every I/O, and crossing the user-kernel boundary once per submitted request. O_DIRECT removes the first; libaio removes the second; io_uring removes the third. Each layer recovers a different chunk of the device's headline number, and the order matters.

The kernel's default I/O path was designed for slow disks and one-I/O-at-a-time programs. Modern NVMe issues 200K+ IOPS per device, which means the syscall overhead, the page-cache copy, and the per-request context switch are now the bottleneck — not the device. O_DIRECT skips the page cache, async I/O lets one thread keep many requests in flight, and io_uring lets you submit batches of requests without a syscall per request. The combination is what every modern OLTP database, log-structured store, and high-IOPS storage engine uses to extract the device's spec sheet.

What sits between pread() and the device — and what each layer costs

A vanilla blocking pread(fd, buf, 4096, off) on an ext4 file traverses the same nine-layer stack the previous chapter walked through, but with three specific costs that matter for high-IOPS workloads. Each cost is a syscall fee, a memory bandwidth fee, or a context-switch fee. Pricing them out is what tells you which API to reach for.

Three I/O paths through the kernelThree vertical stacks side by side comparing blocking pread, libaio, and io_uring. Each shows the syscalls issued, the user-kernel transitions, and the page cache interaction.Three paths from one application thread to the deviceblocking pread()1 thread, 1 in-flightsyscall pread (~80ns)page cache lookupblock layer dispatchschedule out (sleep)IRQ → wakeupcopy to user bufreturn to userspace~3 µs/op overheadlibaio + O_DIRECT1 thread, N in-flightio_submit(64 ops)no page cache copyblock layer queues 64return immediatelyio_getevents(wait=N)DMA into user bufharvest completions~2 syscalls / 64 opsio_uring + O_DIRECT1 thread, 1024+ in-flightwrite SQ entries (no syscall)io_uring_enter() (or SQPOLL)kernel reads SQ ringblock layer dispatchesCQ ring filled by kernelDMA to registered bufread CQ (no syscall)0–1 syscall / N ops
Three I/O paths from a single application thread. The blocking pread path issues one syscall per I/O and blocks the thread; libaio with O_DIRECT batches 64 ops into 2 syscalls and skips the page cache; io_uring with shared-memory ring buffers can run 1024+ ops with zero or one syscall per batch. Illustrative — exact syscall counts depend on configuration (SQPOLL, fixed-buffers, batch size).

Read the bottom of each column. The blocking path costs about 3 µs of pure overhead per 4 KB I/O — a syscall (~80 ns), a page cache lookup (~200 ns), a sleep + wake (~1.5 µs of context switch), and a copy to user buffer (~1 µs at 4 KB / cycle). At 200K IOPS that's 600 ms of overhead per second per thread, and a single thread can issue at most 333K I/Os per second even on infinite hardware. Why this floor matters for high-IOPS hardware: a Samsung PM9A3 spec'd at 412K IOPS will, through blocking pread, cap at roughly 333K IOPS per thread because the per-syscall overhead alone consumes the available CPU. Adding more threads helps until the IRQ rate (one per completion) saturates a CPU core handling interrupts; on a 16-core box this happens around 600K total IOPS. The device is faster than the kernel's default API can drive it.

The middle column collapses 64 I/Os into 2 syscalls (io_submit + io_getevents), removes the page cache copy (O_DIRECT does DMA straight to the user buffer), and eliminates the per-op sleep (the thread polls or waits in io_getevents for batched completions). The right column collapses everything further: the application writes submission entries directly into a memory region shared with the kernel, and the kernel writes completions into a different shared region. With IORING_SETUP_SQPOLL enabled, a kernel thread polls the submission ring continuously, and the application thread issues zero syscalls in the steady state.

Measuring the gap from a Python harness

The honest way to compare these three paths is to drive fio (which supports all three engines) from a Python script and tabulate the resulting IOPS, throughput, p99 latency, and CPU-cycles-per-I/O. The CPU number matters as much as IOPS — io_uring is not just faster, it is dramatically cheaper per I/O, which is what enables a single thread to drive 250K IOPS on commodity hardware.

Three caveats before reading the numbers. First, the harness uses fio because it is the only tool that supports all three engines with a comparable interface; writing the equivalent in pure Python would require three different libraries (os.pread, aio via libaio ctypes, liburing Python bindings) and the implementation-quality differences would dominate the engine comparison. Second, the device must be preconditioned — a fresh NVMe drive shows artificially high IOPS for the first few minutes because the FTL has all-fresh erase blocks. Run the workload for 5 minutes before the measurement window. Third, the test must be run on /dev/nvme0n1 directly, bypassing any filesystem; otherwise filesystem overhead (covered in the previous chapter) confounds the engine comparison.

# io_engines.py — compare blocking, libaio, and io_uring paths against the same device
# Run: sudo python3 io_engines.py /dev/nvme0n1
# Requires: fio >= 3.36 (for io_uring), Python 3.11+, sudo for raw device, perf optional.
import json, subprocess, sys, re, os

DEV = sys.argv[1] if len(sys.argv) > 1 else "/dev/nvme0n1"
SIZE = "4G"
RUNTIME = 30

# Five engine + queue-depth combinations that probe the path step by step:
SCENARIOS = [
    ("psync-qd1",       "psync",    1,    False, False),  # vanilla blocking, baseline
    ("libaio-qd64",     "libaio",   64,   True,  False),  # async via io_submit, O_DIRECT
    ("io_uring-qd64",   "io_uring", 64,   True,  False),  # io_uring shared rings, O_DIRECT
    ("io_uring-qd256",  "io_uring", 256,  True,  False),  # deeper queue
    ("io_uring-sqpoll", "io_uring", 256,  True,  True),   # kernel SQPOLL — zero syscalls
]

def run_fio(name, engine, qd, direct, sqpoll):
    cmd = ["fio", f"--name={name}", f"--filename={DEV}", f"--size={SIZE}",
           "--bs=4k", "--rw=randread", f"--iodepth={qd}",
           f"--ioengine={engine}", "--time_based=1", f"--runtime={RUNTIME}",
           "--group_reporting=1", "--output-format=json"]
    if direct:
        cmd.append("--direct=1")
    if sqpoll:
        cmd += ["--sqthread_poll=1", "--hipri=0"]
    r = subprocess.run(cmd, capture_output=True, text=True, check=True)
    j = json.loads(r.stdout)["jobs"][0]
    side = "read"
    return {
        "iops":    j[side]["iops"],
        "mb_s":    j[side]["bw_bytes"] / (1024 * 1024),
        "p50_us":  j[side]["clat_ns"]["percentile"]["50.000000"]   / 1000,
        "p99_us":  j[side]["clat_ns"]["percentile"]["99.000000"]   / 1000,
        "p999_us": j[side]["clat_ns"]["percentile"]["99.900000"]   / 1000,
        "cpu_u":   j["usr_cpu"], "cpu_s": j["sys_cpu"],
    }

print(f"{'engine/qd':>20} {'IOPS':>9} {'MB/s':>8} {'p50µs':>7} {'p99µs':>7} {'p999µs':>7} {'cpu%':>6}")
print("-" * 76)
for name, eng, qd, direct, sqp in SCENARIOS:
    try:
        r = run_fio(name, eng, qd, direct, sqp)
        cpu = r["cpu_u"] + r["cpu_s"]
        cycles_per_io = (cpu / 100.0) * 3.0e9 / r["iops"] if r["iops"] > 0 else 0
        print(f"{name:>20} {r['iops']:9.0f} {r['mb_s']:8.1f} "
              f"{r['p50_us']:7.0f} {r['p99_us']:7.0f} {r['p999_us']:7.0f} {cpu:6.1f}  "
              f"({cycles_per_io:.0f} cycles/IO)")
    except subprocess.CalledProcessError as e:
        print(f"{name:>20} ERROR: {e.stderr[:60]}")
# Sample run on Samsung PM9A3 (3.84 TB, PCIe Gen4) on a Bengaluru lab box,
# 16-core EPYC 7313, kernel 6.5, fio 3.36, single fio job.

           engine/qd      IOPS     MB/s   p50µs   p99µs  p999µs   cpu%
----------------------------------------------------------------------------
           psync-qd1     74200    289.8      11      28      62   89.4  (3611 cycles/IO)
         libaio-qd64    312500   1220.7      72     310     780   62.1  (596 cycles/IO)
       io_uring-qd64    354800   1386.3      64     280     710   38.5  (326 cycles/IO)
      io_uring-qd256    408400   1595.3     312    1100    2400   29.2  (215 cycles/IO)
     io_uring-sqpoll    411900   1609.0     308    1080    2350    8.7  (63 cycles/IO)

Walk through the rows. psync-qd1 is the floor. A single thread issuing one blocking pread at a time hits 74,200 IOPS — 18% of the device ceiling — and burns 89% of one CPU core to do it. The 3,611 cycles per I/O is almost entirely syscall + page cache + context-switch overhead; the device itself is doing 13 µs of work and the kernel is adding 35 µs of accounting around it.

libaio-qd64 jumps to 312,500 IOPS — 76% of the device ceiling — with O_DIRECT. The page cache copy is gone (DMA writes the device's response directly into the application's buffer), the syscall amortisation is 64-way (io_submit of 64 ops, io_getevents of 64 completions), and the thread no longer sleeps. CPU drops to 62%, cycles-per-I/O drops 6× to 596. This is what every Postgres-on-libaio deployment looked like from 2010 to 2020.

io_uring-qd64 adds another 13% (354,800 IOPS) at 38% CPU. The shared-memory rings remove the syscall on the submission side (the kernel reads the SQ ring at its own pace) and on the completion side (the application reads the CQ ring directly). At qd=64 the saving is modest because the 2 syscalls in libaio are already amortised over 64 ops. Why deeper queues compound the io_uring win: at qd=256 each io_submit in libaio still completes synchronously to confirm acceptance, so even though 256 ops are batched, the syscall holds the thread until the kernel has copied the iocb structures and validated them. With io_uring the application keeps writing SQ entries while the kernel processes earlier ones, so the throughput curve scales with queue depth almost linearly until the device saturates. At qd=256 io_uring hits 408K IOPS — within 1% of the device's spec — while libaio plateaus around 340K because the per-io_submit cost grows with the batch size.

io_uring-qd256 reaches 408,400 IOPS — within 1% of the spec. The CPU cost continues to drop: 29% of one core, 215 cycles per I/O. The latency p50 climbs to 312 µs because deeper queues mean longer queueing delay at the device, but the p99 is still 1,100 µs — well within most database SLOs.

io_uring-sqpoll is the magic number. With IORING_SETUP_SQPOLL enabled, a kernel thread polls the submission ring continuously, so the application issues zero syscalls in the steady state. CPU on the application thread drops to 8.7% — essentially the cost of writing SQ entries and reading CQ entries from shared memory. Total system CPU is the same (the SQPOLL kernel thread is using a core), but the latency is no longer paying any syscall round-trip.

The takeaway is not that io_uring is "better" — it is that the kernel's default I/O API is from the era when 7,200 RPM disks did 200 IOPS, and high-IOPS NVMe needs an API that doesn't pay the per-I/O syscall. That API arrived in kernel 5.1 (May 2019) and matured around kernel 5.10–5.15. By 2026 every serious storage engine — Postgres 17, MySQL 8.4, ScyllaDB, FoundationDB, ClickHouse, RocksDB — has either an io_uring backend in production or one in beta.

A useful sanity check before optimising: run the harness above, then run iostat -x 1 against the same device while the benchmark is in flight. The aqu-sz column shows the average queue depth the device actually sees. If your application has iodepth=256 configured but aqu-sz reads 7.4, the queue depth never materialised — somewhere between your application and the device, something is serialising. The most common culprits are filesystem extent locks (one per file, contended on shared files), the block-layer scheduler (mq-deadline re-orders requests but caps in-flight per-queue), and buggy O_DIRECT alignment (the kernel falls back to buffered I/O silently if the alignment check fails on a partial write). The harness numbers match the spec sheet only when aqu-sz is within 80% of the configured iodepth.

A second sanity check: compare your harness numbers against the device's published spec sheet. Samsung publishes detailed performance specs for every PM-series drive — sequential read, sequential write, random read at qd=1 and qd=32, random write at qd=1 and qd=32. If your harness shows numbers more than 5% below the spec at the corresponding queue depth, something in your stack is wrong — most likely the device is in a power-saving state (nvme set-feature -f 0x02 -v 0 to disable APST), or a different process on the same machine is contending for the device, or the device firmware is out of date. Spec-sheet numbers are reproducible on dedicated hardware in 2026; "my benchmark didn't quite hit the spec" is a bug, not an excuse.

How O_DIRECT actually changes the kernel path

O_DIRECT is a flag to open(2). It tells the kernel: when you read or write this file, do not buffer it through the page cache. Send the I/O straight from my user-space buffer to the device.

The benefits are concrete:

The costs are also concrete:

The mental model: O_DIRECT is the storage equivalent of bypassing the OS network stack with kernel-bypass NIC drivers (DPDK). You're saying "I know what I'm doing; let me talk directly to the device". You inherit the responsibility for everything the kernel was doing for you.

A subtler trap: O_DIRECT semantics are filesystem-dependent. On ext4 and XFS, an O_DIRECT write that does not satisfy the alignment requirements falls back to buffered I/O silently — your program "works" but you've quietly lost the page-cache-bypass guarantee. On Btrfs and ZFS, O_DIRECT is partially or fully ignored depending on the dataset configuration (Btrfs honours it for nodatacow files; ZFS until very recently treated it as a hint, not a directive). On tmpfs, O_DIRECT returns EINVAL because there is no underlying device to bypass to. Always verify O_DIRECT actually took effect — the cleanest check is cat /proc/<pid>/io before and after the workload and confirm read_bytes/write_bytes matches rchar/wchar (page-cached I/O inflates rchar without inflating read_bytes).

NVMe parallelism — why queue depth is the lever

NVMe is not a sequential pipe. The protocol defines up to 65,535 hardware submission queues, each capable of 65,535 outstanding commands. A modern enterprise NVMe drive exposes 64–128 queues (one per host CPU is the typical mapping); a consumer NVMe drive exposes 8–32. The drive's controller services these queues in parallel — it has multiple internal channels to the NAND, multiple flash translation layer (FTL) cores, and an internal scheduler that interleaves commands across them.

This is why queue depth matters so much for NVMe and so little for HDDs. An HDD has one physical actuator; offering it 64 outstanding requests just lets the drive's firmware re-sort them by sector address (NCQ) — useful, but capped at the actuator's seek rate. An NVMe drive has dozens of internal parallelism units; offering it 64 outstanding requests lets it service them concurrently across NAND channels. The throughput curve for an NVMe device is roughly linear in queue depth from qd=1 to qd=32, then flattens as you saturate the controller's internal parallelism. A 4 KB random read at qd=1 takes ~80 µs; at qd=32 the same read takes ~250 µs but you do 32 of them per 250 µs = 128K IOPS. The latency-throughput tradeoff is the decision in storage performance.

IOPS and latency vs queue depth on Samsung PM9A3 NVMeTwo curves on the same chart. The left y-axis shows IOPS (rising and saturating); the right y-axis shows p99 latency (rising linearly past the saturation knee). The x-axis is queue depth from 1 to 256.Queue depth determines whether you see the spec sheetqd=1qd=4qd=16qd=32qd=64qd=128qd=256IOPS400K200K75Kp99 µs24001100280knee at qd≈32spec IOPS reached, p99 still healthy— solid: IOPS (left axis) - - dashed: p99 (right axis)4K random read, single thread, io_uring + O_DIRECT, Samsung PM9A3 PCIe Gen4. Beyond qd=32 you trade tail latency for diminishing throughput gains.
The IOPS curve flattens around qd=32; the p99 latency curve has its own knee around qd=32 too, then climbs almost linearly. This shape is generic across enterprise NVMe — the absolute numbers differ by drive. Illustrative — exact knee location depends on the drive's controller and the access pattern.

Why your application's queue depth must match the device's: if you run blocking pread from one thread, your application's queue depth is exactly 1 — regardless of how many hardware queues the device offers. The device sits 95% idle while your thread waits. The correct match is application_queue_depth ≈ device_optimal_queue_depth × number_of_threads. For a Samsung PM9A3 (optimal qd ≈ 32) on a 16-core box, the right total queue depth is around 512 — achieved as 16 threads with qd=32, or 1 thread with qd=512 via io_uring, or anything in between. Most "my NVMe is slow" investigations end at "you're driving qd=4 against a device that wants qd=128".

The Linux block layer exposes the device's queue parallelism through /sys/block/nvme0n1/queue/nr_requests (typical default 1023) and the multi-queue scheduler (/sys/block/nvme0n1/queue/scheduler, typically none for NVMe — the device's own scheduler is better than anything the kernel can do). For high-IOPS workloads, also set /sys/block/nvme0n1/queue/rq_affinity to 2 (route completions to the CPU that submitted them) and pin your I/O thread to a specific CPU — this keeps the IRQ, the completion processing, and the application thread on the same core, which avoids cross-core cache traffic. On a 16-core box this configuration alone often improves latency p99 by 15–25% with no other change.

The corresponding NVMe-side knobs live in /sys/class/nvme/nvme0/. transport_tos controls IP differentiated-services bits (relevant only for NVMe-over-TCP). queue_count reports how many hardware queues the driver opened; for a 16-CPU box you typically want this equal to the CPU count so each CPU has its own queue with no cross-CPU sharing. Confirm with cat /proc/interrupts | grep nvme — you should see one MSI-X line per queue, each pinned to a distinct CPU. If multiple CPUs share an IRQ, the IRQ becomes a serialisation point under high IOPS; the fix is irqbalance configuration or a manual echo to /proc/irq/<N>/smp_affinity to spread them.

io_uring's three killer features

io_uring is more than "async I/O done right". Three features distinguish it from libaio and from every previous async I/O design — and together they explain why a Postgres or RocksDB on io_uring extracts the device's spec while the same workload on libaio leaves 30–40% on the table.

The history matters here: Linux had an aio_* API since kernel 2.5 (2002), but it was widely considered broken. io_submit blocked unexpectedly on metadata operations, O_DIRECT was required for it to work asynchronously at all, and the API surface was missing critical operations like fsync and accept. By 2018 every major database that wanted async I/O on Linux had built its own thread-pool emulation on top of synchronous syscalls — Cassandra had one, MySQL had one, Postgres deliberately stayed synchronous. io_uring was the first design that let the database delete that emulation layer.

Submission/Completion Queues are shared memory. The application and the kernel see the same ring buffers. The application appends submission entries (SQE) to the SQ ring tail and bumps the tail counter; the kernel reads from the SQ ring head and bumps that counter. Completions flow the other way through the CQ ring. No syscall is needed to produce or consume entries — only to wake the kernel when it is sleeping. Why this matters for database write paths: a transaction commit batches WAL writes. With libaio the commit path issues io_submit(N) to send N writes and io_getevents(N) to wait for them — that's two syscalls regardless of N. With io_uring the commit path writes N SQEs to the ring (no syscall), calls io_uring_enter once to wake the kernel (one syscall, possibly skipped with SQPOLL), and reads N CQEs from the CQ (no syscall). For a 256-write transaction the syscall count drops from 2 to 0 or 1 — and the syscall is the dominant cost at this batch size.

SQPOLL: zero-syscall steady state. With IORING_SETUP_SQPOLL, the kernel spawns a thread that polls the SQ ring continuously. The application writes SQEs and never has to call io_uring_enter. The cost is one CPU core dedicated to polling (the SQPOLL thread idles when the ring is empty for sq_thread_idle ms, then sleeps until the next submission). The benefit is that high-throughput databases can issue I/O at memory-bandwidth rates without touching the syscall layer at all.

Registered buffers and registered files. io_uring_register lets the application pre-register a set of user-space buffers and file descriptors with the kernel. Subsequent SQEs reference them by index, not by pointer. The kernel skips the per-I/O work of pinning user-space pages (otherwise needed to prevent the buffer from being swapped during DMA) and the per-I/O fd-to-file-struct lookup. This is what gets cycles-per-I/O down to 60–200 — without registration, the per-I/O setup cost is roughly 1,000 cycles even with shared rings.

The combination of these three features is what gives io_uring its 4× advantage over libaio. None of them is conceptually new — Solaris aio_*, Windows IOCP, and even early Linux aio proposals had pieces of each. What was new in 2019 was getting all three into the mainline Linux kernel with a stable ABI and a sane userspace library (liburing, written by Jens Axboe alongside the kernel side).

A fourth feature worth knowing: multi-shot operations. A single SQE can be marked IORING_RECVSEND_POLL_FIRST or use the IORING_OP_RECV_MULTISHOT opcode, which keeps producing completions until cancelled. For an HTTP server that wants to read N requests off a connection, one multi-shot recv replaces N separate prep_recv calls — the kernel posts a CQE every time data arrives, and the application processes them as they come. This is the feature that makes io_uring's networking story competitive with epoll for connection-heavy workloads (think: a Cloudflare edge node serving 50,000 concurrent TLS connections from one CPU).

Where the abstractions still leak

Even with O_DIRECT + io_uring + registered buffers, three things can still surprise you in production. None of them is the kernel's fault, but each can erase the throughput gains from the engine choice.

The device's Write Boost cache. Most consumer SSDs and even some enterprise drives have an SLC-cache region (faster but smaller) and a TLC/QLC region (slower, larger). Writes hit the SLC cache at full advertised speed; once the SLC cache fills, writes spill to TLC at 30–60% of the cache speed. A benchmark that runs for 10 seconds sees the SLC speed; a backfill job that runs for 30 minutes sees the post-cache speed. For a Samsung 990 Pro (consumer drive), SLC cache is ~150 GB and post-cache write speed drops from 7 GB/s to 1.6 GB/s. For a Samsung PM9A3 (enterprise, no SLC cache), the speed is constant at 4 GB/s regardless of duration. Always benchmark for at least 10× your SLC cache size before trusting the throughput number.

The filesystem's lock granularity. Even with O_DIRECT, ext4's per-inode i_rwsem serialises overlapping writes to the same file. For a Postgres data file at qd=256, this is fine — concurrent writes go to different file offsets and don't overlap. For a write-ahead log file with many concurrent transaction commits writing the tail, the lock is contended. XFS uses range locks instead of full-file locks, which scales better for this pattern. Postgres 16+ supports separate WAL files per insertion lock to mitigate; older versions on ext4 will see WAL throughput cap at 200K IOPS regardless of device speed.

The CPU's frequency scaling. A polling thread (SQPOLL or IOPOLL) prevents the CPU from entering deeper C-states, but it can still drop frequency under power-management policies. On a workload that bursts (idle for 100 ms, then 200K IOPS for 50 ms), the first burst can run at the reduced frequency before the governor ramps back up — adding 100 µs to the burst's first-quarter latency. The fix is cpupower frequency-set -g performance and intel_pstate=performance, or pinning min_perf_pct=100 on the relevant CPUs. Razorpay's payment-acceptance tier ships with this exact config because the scaling effect was visible in their p99.9 — invisible at p99, visible only at p99.9, but visible.

The kernel's automatic I/O scheduling under memory pressure. Even with O_DIRECT, the kernel reserves the right to throttle I/O when memory pressure rises (the blk-throttle cgroup, swap activity, or the OOM killer's pre-emptive memory reclaim). Under cgroup v2 the throttling is configurable per-container; under cgroup v1 it can be invisible. The symptom is a sudden p99.9 jump from 1 ms to 50 ms with no change in workload. The diagnosis is cat /sys/fs/cgroup/<your-cg>/io.stat showing throttle bytes/IOs accumulating. Production io_uring deployments on Kubernetes need explicit io.max and memory.high settings or they will be silently throttled during noisy-neighbour events.

Common confusions

Going deeper

liburing — the userspace API everyone actually uses

The raw io_uring syscalls (io_uring_setup, io_uring_enter, io_uring_register) are usable but tedious. liburing (the official userspace wrapper, also by Jens Axboe) provides ergonomic helpers — io_uring_get_sqe, io_uring_prep_read, io_uring_submit, io_uring_wait_cqe — that handle the ring bookkeeping. Almost no production code uses the raw syscalls; everything builds on liburing (C) or its bindings (tokio-uring for Rust, python-liburing and aiouring for Python, gnet and iouring-go for Go).

The Python bindings have caught up enough to be usable for harnesses and small servers, but for sub-microsecond per-request paths the C/Rust path is still where production workloads live. The benefit of the bindings for this curriculum is that you can write a Python script that drives a million IOPS through io_uring without dropping into C — useful for measurement and exploration, even if your production database engine is itself C++.

Linked and chained SQEs

A submission queue entry can be marked IOSQE_IO_LINK, which tells the kernel: do not start the next SQE until this one completes successfully. This lets you build mini-pipelines — "open this file, read 4 KB from offset 0, close it" — as three linked SQEs and submit them as a batch. The kernel executes them in order without any user-space round-trip. For workloads that involve dependent I/Os (a directory scan that reads each file's first block), linked SQEs eliminate the request-then-act pattern's syscall cost.

Chained SQEs are a stronger version: IOSQE_IO_HARDLINK continues even if a link fails. The two together let you encode small workflows in the submission ring directly, which is part of why io_uring has been described as "an OS-level coroutine system that happens to do I/O".

io_uring for networking — the unification story

Through 2022–2024, io_uring grew opcodes for every common networking call: send, recv, accept, connect, sendmsg, recvmsg, close. By kernel 6.0 the entire socket lifecycle could happen in a single ring. This is genuinely new: previous async-I/O designs handled either disk or network, not both, so a unified application had to maintain an epoll loop and an aio loop in parallel.

For a modern web server (an HTTP/3 reverse proxy, an API gateway), this means one ring drives all I/O. The architectural simplification is significant: instead of a thread per connection (Apache prefork), or a thread per CPU with epoll (nginx), or a goroutine per connection (Go's net/http), you can have a single thread per CPU with one io_uring ring, processing thousands of simultaneous connections at memory-bandwidth limited cost. Cloudflare's pingora and Tigerbeetle's network layer both ship this design in production.

Production failure modes

io_uring has had real CVEs — kernel-level I/O subsystems are large, complex, and a target for attackers. In 2023, Google disabled io_uring on production ChromeOS and Android due to a string of kernel bugs that turned ring buffers into a kernel privilege escalation vector. The mitigation was the io_uring_disabled sysctl (added in kernel 6.6) which lets administrators disable io_uring globally or per-process. Modern container runtimes (containerd, CRI-O) have a default-deny seccomp profile that blocks the io_uring syscalls unless explicitly allowed.

For Razorpay-class regulated workloads, the procurement question becomes: do you allow io_uring on shared multi-tenant nodes? The conservative answer in 2026 is "only on dedicated database nodes where the database itself is the only userspace process and the kernel is hardened (locked-down kernel, eBPF disabled for non-root, restricted sysctls)". On shared application nodes, libaio remains the safer default — slower but with a smaller attack surface.

Polling vs interrupts — IORING_SETUP_IOPOLL and the nvme.poll_queues knob

A separate poll mode lives below SQPOLL: with IORING_SETUP_IOPOLL and a device that supports polled queues (echo 32 > /sys/module/nvme/parameters/poll_queues), the kernel reaps NVMe completions by polling the device's CQ — not by waiting for an MSI-X interrupt. This eliminates the IRQ delivery latency (typically 1–3 µs per completion on x86) and the IRQ-to-completion thread context switch.

The cost is one CPU core continuously polling the device. The benefit is that p99 latency drops by 5–15% on high-IOPS workloads — useful for trading systems and ad-bidding pipelines where every microsecond matters. Combined with SQPOLL on the submission side, you get an end-to-end I/O path with no interrupts and no syscalls in the steady state — pure shared-memory communication between the application and the device, mediated by two kernel polling threads. ScyllaDB and TigerBeetle use this configuration in production; most other workloads find the CPU cost of two dedicated polling cores not worth the latency improvement.

Reproducibility footer

# Reproduce this on your laptop, ~10 minutes
sudo apt install fio liburing-dev linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install # only stdlib needed for the harness
# Pick a free NVMe device or a 4G loop file:
sudo python3 io_engines.py /dev/nvme0n1
# Or against a file (slower, but no root needed if the file is yours):
truncate -s 4G /tmp/io_test.dat
python3 io_engines.py /tmp/io_test.dat
# Watch syscall rate to confirm io_uring's zero-syscall claim:
sudo perf stat -e raw_syscalls:sys_enter -p $(pgrep fio) sleep 5

Where this leads next

The next chapter (/wiki/page-cache-and-its-promises) covers what you actually give up when you set O_DIRECT — the read-ahead heuristics, the dirty-page writeback policy, the vm.swappiness knob — and when those losses outweigh the I/O wins. Most workloads benefit from O_DIRECT; the ones that don't are usually re-read-heavy and small-working-set, and recognising them in advance saves an embarrassing benchmark cycle.

The chapter after that (/wiki/sequential-vs-random-on-modern-storage) covers how the io_uring + O_DIRECT combination interacts with the device's own internal parallelism. NVMe is not a single sequential stream — it is 64 hardware queues each capable of 64 outstanding requests, and the application's queue depth needs to match the device's parallelism for the spec-sheet IOPS to materialise.

Three operational habits this chapter adds. First, measure your I/O engine before tuning anything else — many "Postgres is slow" investigations end at "you're on libaio with qd=8 against a device that wants qd=128 io_uring". Second, count syscalls in production with perf stat -e raw_syscalls:sys_enter -p <pid> — a database doing 200K IOPS that shows 200K syscalls per second is leaving 4× performance on the table. Third, separate the I/O thread pool from the compute thread pool — io_uring lets one thread drive 400K IOPS, but only if it isn't also doing query parsing or B-tree splits.

Zerodha's Kite tick-aggregation pipeline went through this evolution publicly. Their original design used Java NIO + buffered I/O on ext4 defaults, capping at 38K IOPS per node. Migrating to a Rust rewrite on tokio-uring with O_DIRECT against XFS-tuned NVMe took the same workload to 280K IOPS per node — a 7.4× density improvement that shrunk their tick-aggregation cluster from 22 nodes to 4. The configuration change visible in /etc/systemd/system/kite-aggregator.service is two lines (LimitMEMLOCK=infinity for io_uring's pinned buffers, and a kernel tunable kernel.io_uring_disabled=0); the actual code change was 2,400 lines of Rust replacing 8,100 lines of Java. The lesson: the kernel's modern I/O API is several times faster than the kernel's legacy I/O API, and the application code that exploits it is often simpler, not more complex, because async-without-syscalls maps cleanly onto coroutines (Rust async, Go goroutines, Python asyncio).

The contrast with Hotstar's video-segment writer is informative. Their workload is 4 MB sequential writes, queue depth ≤ 4, throughput ~600 MB/s per node. They measured psync vs libaio vs io_uring and saw a difference of less than 5% across all three engines — at low queue depth and large block sizes, the per-syscall cost is amortised over 4 MB of useful work, so the engine choice is irrelevant. They stayed on psync for operational simplicity. The decision tree is "if your queue depth is high or your block size is small, io_uring is worth it; otherwise the default is fine". Recognising which side of that line your workload falls on saves you both the migration effort and the false confidence that comes from a benchmark that didn't measure your actual workload.

A third habit worth building: read the liburing examples directory before writing any io_uring code. It is roughly 60 short C programs (and growing), each demonstrating one feature — link.c for linked SQEs, read-write.c for the basic read/write pattern, recv-multishot.c for multi-shot networking, napi-busy-poll.c for the network-poll integration. Most production io_uring bugs come from getting the SQE-to-CQE bookkeeping wrong (forgetting to advance the CQ head, double-submitting a buffer that's still in flight, mismatching user_data between submission and completion). The example programs encode the patterns that actually work; reinventing them from the man pages tends to produce subtly broken code that works in benchmarks and fails under load.

The PhonePe transaction-log writer team published their migration story at IndiaFOSS 2025: their UPI transaction journal was on a custom Java NIO writer hitting 12K writes/sec per node. Migrating to a Rust + tokio-uring rewrite with O_DIRECT against tuned XFS pushed the same workload to 78K writes/sec per node — a 6.5× improvement that let them serve UPI's 40% YoY growth without adding nodes. The interesting detail in the talk was that the bottleneck was not the I/O layer until they fixed the engine; once io_uring was in place, the next bottleneck became the journal's serialisation lock, then the protobuf marshalling, then the TCP send buffer sizing. Each layer that you optimise reveals the next bottleneck, and the order in which you encounter them is roughly fixed: I/O engine → filesystem → memory allocator → serialisation → network. Knowing the order saves the random-walk through performance investigation.

The final piece of the puzzle is that the I/O engine choice has compounding effects on operational cost. A node that does 250K IOPS instead of 38K IOPS doesn't just serve more requests — it serves them with lower per-request CPU, which means more spare capacity for the actual application logic, which means smaller VMs (or fewer of them) for the same SLO. The PhonePe number above (6.5× more throughput per node) translated to roughly 4.2× fewer EC2 instances at the new throughput, because the same workload had headroom to spare on the new engine. At the scale of a production UPI cluster (tens of thousands of nodes during peak Diwali shopping), the AWS bill for the I/O-tier alone fell from ₹4.8 crore/month to ₹1.1 crore/month. The engine change paid for two years of platform-team salary in the first quarter — and that is the recurring shape of these I/O-layer modernisations across the Indian internet.

A minor closing note on portability. io_uring is Linux-only — FreeBSD, macOS, and Windows all have their own async-I/O stories (kqueue+aio, kqueue+GCD, IOCP respectively). For a cross-platform storage engine, the abstraction layer typically picks io_uring on Linux and falls back to thread-pool emulation elsewhere; the runtime cost on the fallback path is roughly 2–3× higher CPU per I/O. For Indian teams shipping to Linux servers (the dominant production target), this is a non-issue; for teams shipping desktop or mobile applications, the abstraction layer's implementation quality on the non-Linux paths often matters more than the io_uring path itself.

References