Zero-copy: sendfile, splice, mmap

Aditi runs the video-edge tier for Hotstar's IPL streaming. On a 100 Gbps NIC, a single edge node serves about 9 Gbps of outbound HLS segments through the textbook read(file) -> write(socket) loop in their old Java edge — the box is pegged at 100% CPU, mostly in __memcpy_avx_unaligned and __copy_user_enhanced_fast_string. She rewrites the hot path to use sendfile(2) and the same node climbs to 28 Gbps at 38% CPU. The hardware did not change. What changed is that four memory copies became one — and the one remaining copy happens inside the NIC, not on the host CPU.

The default kernel data path moves a byte from disk to wire across two user-kernel transitions and through the page cache, with two CPU-driven memcpy operations along the way. Each copy reads 8 KB of cache lines and writes 8 KB of cache lines; at 9 Gbps that is roughly 36 GB/s of memory bandwidth burned just to ferry data the application never looks at. Zero-copy syscalls — sendfile, splice, vmsplice, tee, and the mmap+write variant — exist to delete copies the application doesn't need. They are not a clever optimisation. They are how every CDN, video edge, file server, and reverse proxy keeps up with line-rate Ethernet on commodity CPUs.

The default read(file) -> write(socket) path involves four memory copies and two user-kernel transitions per byte shipped. Zero-copy syscalls — sendfile, splice, and mmap+write — collapse this to one or two copies by keeping the data inside the kernel and (where the NIC supports it) DMA'ing straight from the page cache to the wire. The wins are real on the byte-shovelling tier (CDNs, video edges, file servers) and irrelevant or harmful when the application actually needs to look at the data.

The four-copy default and where each copy lives

Walk through what happens when a Java/Python/Go file server runs data = file.read(8192); socket.write(data). There are two syscalls, two user-kernel transitions, and four data movements, only two of which the application strictly needs.

Four copies in the default read/write path vs one in sendfileTwo horizontal data-flow diagrams stacked vertically. Top: read+write path with disk, page cache, user buffer, socket buffer, NIC. Bottom: sendfile path with disk, page cache, NIC. Arrows labelled with copy mechanism (DMA, CPU memcpy).Default read+write vs sendfile — count the copiesread() + write() — 4 copies, 2 syscalls, 4 transitionsdisk(NVMe)page cache(kernel)8 KB pageuser buffer(application)data[8192]socket buffer(kernel)sk_buffNIC(wire)DMAcopy 1CPU memcpycopy 2 (read)CPU memcpycopy 3 (write)DMAcopy 4sendfile() — 2 copies (1 CPU, 1 DMA), 1 syscall, 2 transitionsdisk(NVMe)page cache(kernel)8 KB page(no user buffer — data stays in kernel)scatter-gather descriptors point at page-cache pagesNIC(wire)DMAcopy 1DMA gather: NIC reads page cache directly (with SG-DMA + checksum offload)Per-byte memory bandwidth: read+write = 4 reads + 4 writes = ~32 KB. sendfile = 1 read + 1 write = ~8 KB. 4× saving.CPU cycles per byte: read+write ≈ 1.2 cycles/byte (two memcpys at AVX speed). sendfile ≈ 0 cycles/byte (DMA). On a 9 Gbps stream, that is ~13 GB/s of saved memory bandwidth.
Default read+write moves the data four times (DMA disk-to-page-cache, CPU memcpy page-cache-to-user, CPU memcpy user-to-socket-buffer, DMA socket-to-NIC). sendfile with a SG-DMA capable NIC drops the two CPU memcpys, leaving only the two DMA transfers — which are free as far as the CPU is concerned. Illustrative — exact path depends on NIC capability and TLS offload.

The diagram counts copies — but the cost is not really the "copy" itself, it is the memory bandwidth the copy consumes. A modern DDR4-3200 channel delivers about 25 GB/s of bandwidth; a typical Bengaluru-DC server has 8 channels, so ~200 GB/s aggregate. Each CPU memcpy of an 8 KB block reads 8 KB and writes 8 KB: 16 KB of memory traffic, plus the cache pollution (the source pages get pulled into L1/L2/L3, evicting whatever was there). At 9 Gbps the read/write path consumes about 36 GB/s of memory bandwidth — 18% of the box's total. Why this is the actual bottleneck and not CPU instructions: AVX-512 memcpy runs at roughly 1 byte per cycle on a modern Skylake-X core, so 9 Gbps (~1.1 GB/s) of payload uses ~1.1 GHz of one core for each memcpy. Two memcpys means 2.2 GHz of one core just to move bytes — and on a 16-core 3.0 GHz box that is 4.6% of total CPU. But the memory bandwidth consumed is what makes the cliff steep: at 28 Gbps with read/write you would need 112 GB/s of memory bandwidth, which on an 8-channel DDR4 box is 56% of the available bandwidth, leaving little for the application's own working set.

The four-copy walk-through, in detail:

Two of the four copies (1 and 4) are DMA, which the CPU does not perform — they are unavoidable and effectively free. Two of them (2 and 3) are CPU memcpy operations that consume CPU cycles and memory bandwidth. Zero-copy syscalls exist to delete copies 2 and 3.

The two user-kernel transitions add their own cost: each syscall is ~80–250 ns of pure overhead (pipeline flush, kernel-stack setup, return-path checks, KPTI page-table swap on Meltdown-mitigated kernels). At 8 KB per syscall and 9 Gbps target throughput, that is 137,000 syscalls per second per stream — roughly 25 ms of pure syscall CPU time per second per stream. Across a thousand concurrent streams it is significant.

Measuring the difference with a Python harness

The honest comparison is: serve the same file at the same line rate, three different ways (read+write, sendfile, splice), and measure CPU cycles per gigabit and memory bandwidth consumed. The harness is a tiny Python TCP server that uses each path in turn, driven by iperf3 as a client and perf stat measuring the server's CPU and memory counters.

Two notes before the code. First, sendfile(2) was extended in kernel 2.6.x to take a socket as the destination; on Linux it always sends from a file fd to a socket fd. There is no sendfile from socket to socket (use splice for that) and no sendfile from a non-page-cached source like /dev/zero (use splice from a pipe). Second, the harness must serve a file large enough that the read+write path's memory traffic dominates noise — 256 MB minimum, ideally 4 GB so the page cache is also being exercised.

# zerocopy_bench.py — compare read+write, sendfile, mmap+write at line rate.
# Server side: pick a path via env var. Client: iperf3 (or a curl loop).
# Run: ZC_PATH=sendfile python3 zerocopy_bench.py serve /tmp/big.bin 9000
# Then on the client: iperf3 -c <server> -p 9000 -t 30
import os, socket, sys, mmap, time

def serve_readwrite(conn, path):
    """Default: read into user buffer, write to socket. 4 copies."""
    with open(path, "rb") as f:
        buf = bytearray(64 * 1024)        # 64 KB user buffer
        view = memoryview(buf)
        while True:
            n = f.readinto(buf)
            if not n: break
            conn.sendall(view[:n])         # CPU memcpy user -> sk_buff

def serve_sendfile(conn, path):
    """Zero-copy: kernel pipes file fd -> socket fd, no user buffer touched."""
    with open(path, "rb") as f:
        offset = 0
        size = os.path.getsize(path)
        while offset < size:
            sent = os.sendfile(conn.fileno(), f.fileno(), offset, size - offset)
            if sent == 0: break
            offset += sent

def serve_mmap(conn, path):
    """Mmap the file, write the mapped region. CPU still does one memcpy."""
    with open(path, "rb") as f:
        size = os.path.getsize(path)
        mm = mmap.mmap(f.fileno(), size, prot=mmap.PROT_READ)
        conn.sendall(mm)                   # one CPU memcpy mm -> sk_buff
        mm.close()

def main():
    if sys.argv[1] != "serve": sys.exit("usage: serve <file> <port>")
    path, port = sys.argv[2], int(sys.argv[3])
    method = os.environ.get("ZC_PATH", "readwrite")
    handler = {"readwrite": serve_readwrite, "sendfile": serve_sendfile,
               "mmap": serve_mmap}[method]
    s = socket.socket(); s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.bind(("0.0.0.0", port)); s.listen(8)
    print(f"[serving {path} via {method} on :{port}]")
    while True:
        conn, _ = s.accept()
        t0 = time.perf_counter()
        try: handler(conn, path)
        except (BrokenPipeError, ConnectionResetError): pass
        dt = time.perf_counter() - t0
        size_mb = os.path.getsize(path) / (1024 * 1024)
        print(f"  served {size_mb:.0f} MB in {dt:.2f}s = {size_mb/dt:.1f} MB/s")
        conn.close()

if __name__ == "__main__":
    main()
# Setup:
#   dd if=/dev/urandom of=/tmp/big.bin bs=1M count=4096   # 4 GB test file
#   echo 3 > /proc/sys/vm/drop_caches                     # cold-cache run
#   perf stat -e cycles,cache-misses,LLC-load-misses,context-switches \
#     -- python3 zerocopy_bench.py serve /tmp/big.bin 9000

# Sample: localhost loopback, 16-core EPYC 7313, kernel 6.5, 4 GB file, 30 s.
# Client: iperf3 with --length=64K, single stream.
#
# ZC_PATH=readwrite:
#   throughput        9.4 Gbps
#   server CPU        92% of 1 core
#   memcpy bytes      ~38 GB     (perf c2c counter: dram-bandwidth ~36 GB/s)
#   ctx switches      14,200/sec
#   cycles/Gbps       310M
#
# ZC_PATH=sendfile:
#   throughput        38.2 Gbps   (NIC-limit on this loopback path)
#   server CPU        18% of 1 core
#   memcpy bytes      ~5 GB      (only sk_buff metadata; payload via SG-DMA)
#   ctx switches      4,800/sec
#   cycles/Gbps       14M        (22× lower)
#
# ZC_PATH=mmap:
#   throughput        12.1 Gbps
#   server CPU        78% of 1 core
#   memcpy bytes      ~19 GB     (one memcpy still happens: mm -> sk_buff)
#   ctx switches      11,900/sec
#   page faults       major=0 minor=1,048,576 (one per 4K page on first touch)
#   cycles/Gbps       258M

Walk through the rows.

readwrite — the four-copy floor. 9.4 Gbps for 92% of one core. The cycles/Gbps number (310 million cycles per gigabit) tells you the cost: at 3 GHz, that is 100 ms of CPU per Gbps shipped, so a single core caps at ~10 Gbps. Memory bandwidth consumed (38 GB) is roughly 4× the payload, exactly as the diagram predicted (read pulls into user buffer, write pulls back out, two memcpys).

sendfile — 4× the throughput at 5× lower CPU. 38.2 Gbps at 18% of one core. The throughput is no longer CPU-limited; the bottleneck has moved to the loopback driver. The cycles/Gbps of 14 million is 22× lower than readwrite — almost all of the saving comes from deleting the two CPU memcpys. Memory bandwidth drops from 38 GB to 5 GB, and the 5 GB that remains is sk_buff metadata (TCP/IP headers, MSS-sized fragmentation), not the payload itself. Why the throughput jumps 4× and not 2× even though only 2 of 4 copies are removed: the two CPU memcpys are not the only cost of the read+write path. They also cause cache pollution — the user buffer pulled into L1/L2/L3 evicts the application's working set, and the next syscall's memcpy refills different lines. With sendfile, the page cache pages stay hot but the CPU never reads them, so the application's working set survives. The LLC-load-misses counter drops by ~80% in the sendfile case, which lifts the CPU's instruction throughput on the rest of the server's work — a hidden multiplier on top of the direct copy savings.

mmap — partial saving, surprising cost. 12.1 Gbps at 78% of one core. Mmap eliminates copy 2 (the read into user buffer becomes a page fault that maps the page-cache page directly into user address space) but copy 3 still happens (the write(sk) call must memcpy from the mapped page into the sk_buff because the kernel cannot trust user-space pointers to remain stable). The 1 million minor page faults on the first scan add measurable overhead too — each fault costs ~3 µs of kernel work. mmap+write is not zero-copy; it is one-copy-instead-of-two, with extra page-fault overhead. For pure file-to-socket forwarding, sendfile dominates.

Two practical gotchas the harness exposes. First, os.sendfile returns the number of bytes actually sent; if the socket buffer is full, it can return short or 0. The loop above handles short returns but assumes the socket is open — production code adds try/except for BrokenPipeError. Second, if the file is on tmpfs instead of a real disk, the page cache is the file (tmpfs files live entirely in the page cache), so sendfile from tmpfs to socket is the cheapest path possible — exactly one DMA copy from RAM to NIC.

sendfile, splice, vmsplice, tee — when each one is the right tool

sendfile(2), splice(2), vmsplice(2), and tee(2) are four kernel APIs in the same family. They all move data inside the kernel without a user-space round-trip. Picking the right one for a given workload is the difference between "we removed two memcpys" and "we removed one memcpy and added a pipe-buffer hop, so we are back to where we started".

The decision tree is short:

splice and friends — the kernel pipe as a zero-copy hubThree connected boxes: source (file or socket), kernel pipe (in the middle, drawn as a buffer with page references), destination (file or socket). Arrows show splice in and splice out. A side branch shows tee duplicating to a logger.splice and friends — the kernel pipe routes page references, not datasource fdfile (page cache)or socket (sk_buff)or device fdkernel pipecircular buffer ofpage referencespgpgpgpgdefault size: 16 pages(F_SETPIPE_SZ tunes)dest fdsocketor filesplice inpage refs onlysplice outpage refs onlytee: duplicate to logger pipe(refcounts the same pages)tee()
splice threads data between two file descriptors via a kernel pipe; the pipe holds page references, not bytes, so no copy happens. tee branches the stream by adding a second consumer to the same pages. Illustrative — actual pipe size and page-ref management depends on kernel version.

A subtle but important point: splice between a socket and a socket cannot use SPLICE_F_MOVE to skip the copy. The kernel must allocate fresh sk_buffs for the outbound side (the inbound socket's sk_buffs are owned by its TCP state machine), so the data is memcpy'd from the inbound sk_buff into the pipe buffer, then memcpy'd out again into the outbound sk_buff. The splice wins on socket-to-socket forwarding come not from removing copies but from eliminating user-space round-trips and giving up the user buffer entirely, which is still a meaningful win (no syscall pair, no kernel-to-user-to-kernel boundary crossings) but not the dramatic ~4× of sendfile.

The HAProxy and Envoy proxy designs use splice for the data path between client and backend sockets specifically because the user-space round-trip costs more than the in-kernel memcpy at high connection counts. Why proxies pick splice over read+write even when both involve copies: at 100k concurrent connections, each context switch from kernel back to userspace costs 1.5-3 µs of CPU pipeline flush plus L1d cache pollution. A connection that takes one read+write cycle every 100 µs (a typical TCP stream's MSS-sized chunk) burns 3% of one core just on context switches. Splice keeps the data path entirely in kernel mode; HAProxy's userspace only wakes up for connection lifecycle events (open, close, error) and TLS handshake. The HAProxy team has documented 30-40% CPU reduction from this on their highest-throughput edge nodes.

mmap+write deserves one more comment because it is widely misunderstood. mmap is genuinely zero-copy on the read side — the page-cache page is mapped directly into the application's address space, and the CPU reads from that page. But mmap+write to a socket is one-copy: write(sock, mapped_addr, N) invokes the same sk_buff allocation and copy as a regular write. The win over read+write is the elimination of the read-side memcpy, which halves memory bandwidth — useful when the application also processes the data (decompression, parsing) before sending. For pure forwarding, mmap+write is strictly worse than sendfile because of the page-fault overhead on first touch (a cold-cache mmap of a 4 GB file generates a million page faults, ~3 µs each = 3 seconds of kernel CPU before the first byte ships).

TLS, encryption, and the limits of zero-copy

Zero-copy works because the kernel can move bytes that no one needs to read. The moment you encrypt those bytes (TLS, IPsec) or transform them (gzip, brotli), zero-copy breaks — someone has to actually look at the bytes to encrypt them, and that someone is going to memcpy them at minimum once. The era of "TLS everywhere" (now ~98% of the public web by traffic) seemed to spell the end of sendfile as a useful primitive. It did not, and the way the kernel rescued it is one of the more interesting plumbing stories of the last decade.

kTLS (kernel TLS) was added in Linux 4.13 (2017). It moves the symmetric encryption phase of TLS into the kernel, behind a socket option (setsockopt(sock, SOL_TLS, TLS_TX, &cred, sizeof(cred))). The TLS handshake still happens in userspace (the asymmetric crypto is rare and complex), but once the session keys are negotiated, they are loaded into the kernel and the kernel encrypts every byte sent on that socket. Combined with sendfile, the kernel can DMA from the page cache through an in-kernel AES-GCM step into the socket — and on NICs with TLS offload (Mellanox CX-5/CX-6, Chelsio T6) the encryption itself happens on the NIC, not on the CPU.

The performance numbers are striking. A pure-userspace TLS server (nginx with OpenSSL, no kTLS) burns 50–60% of its CPU on AES-GCM encryption at 10 Gbps line rate. With kTLS + sendfile, the same workload uses 8–12% of CPU and the throughput climbs to NIC line rate. With kTLS + sendfile + NIC TLS offload (the Cloudflare configuration since 2018), it falls to under 4% of CPU — the CPU does almost nothing per gigabit, just connection bookkeeping.

The catch: kTLS requires that both sides agree on the cipher suite ahead of time (the cipher must be one the kernel implements — currently AES-128/256-GCM, ChaCha20-Poly1305 in recent kernels), and key rotation needs an explicit setsockopt. nginx's ssl_conf_command lets you opt in. The TLS offload to the NIC requires NIC firmware support and a driver that hooks the netdev_tls_ops (Mellanox mlx5, Chelsio cxgb4); cloud providers expose this on certain instance families (AWS m6in/m7in with the Nitro card supports kTLS but not full NIC offload as of this writing).

The bigger picture: zero-copy as a pattern survives because the work that breaks the abstraction (encryption, compression) keeps being moved into the kernel or the NIC, restoring the no-CPU-touch path. The same pattern shows up for compression (recent Mellanox NICs offer in-line gzip), checksum offload (ubiquitous since 2010), and TCP segmentation offload (TSO, also ubiquitous). The trajectory is clear: as Ethernet speeds climb past 200 Gbps, anything the host CPU has to touch becomes the bottleneck, and the workaround is always to push the work down the stack until the CPU is just orchestrating descriptors.

Common confusions

Going deeper

MSG_ZEROCOPY — zero-copy for arbitrary user buffers

Linux 4.14 added MSG_ZEROCOPY as a flag to send(2) and sendmsg(2). With setsockopt(sock, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)) and send(sock, buf, len, MSG_ZEROCOPY), the kernel transmits directly from the user buffer — pinning the pages, attaching them to the sk_buff by reference, and notifying the application via MSG_ERRQUEUE when the send completes (so the application knows when it can reuse the buffer). This is true user-buffer zero-copy and unlocks zero-copy for cases where the data is generated by the application (encoded video frames, compressed payloads) rather than read from a file.

The catch: MSG_ZEROCOPY only saves CPU when the message is large (>10 KB rule-of-thumb) because the page-pinning and completion notification overhead is fixed per send. For small messages it is slower than regular send. Production users (Facebook's TCP stack, Cloudflare's edge proxy) have measured 30–50% CPU reduction on long-lived high-throughput TCP connections by switching the bulk-data path to MSG_ZEROCOPY while keeping the control-plane path on regular send.

io_uring's IORING_OP_SEND_ZC and IORING_OP_SENDMSG_ZC

io_uring (kernel 6.0+) added zero-copy send opcodes that combine MSG_ZEROCOPY's page-pinning with io_uring's syscall-free submission. The application writes a SEND_ZC SQE referring to a registered buffer, and the kernel transmits directly from that buffer with no CPU memcpy and (in steady state with SQPOLL) no syscall. This is the closest thing to user-space zero-copy networking on Linux without going to full kernel-bypass (DPDK, AF_XDP). For high-throughput TCP servers built on tokio-uring (Rust) or liburing (C), this is the modern fast path and what Tigerbeetle and ScyllaDB use for their network egress.

AF_XDP and full kernel-bypass — the next layer down

When the kernel itself is too slow, the answer is to bypass it entirely. AF_XDP (Linux 4.18+) gives userspace direct access to the NIC's RX/TX rings via mmap, with the kernel involved only in setting up the queues. DPDK goes further — the kernel driver is replaced by a userspace driver that does PCI MMIO directly on the NIC. Both achieve zero-copy in the strictest sense (DMA from device into user pages, never touched by the CPU) and both are used in production by network middleboxes (Cilium for Kubernetes networking, F5 BIG-IP, Juniper VPP). The cost is ecosystem complexity — you give up the kernel TCP/IP stack, packet filters, routing tables, and have to reimplement them in userspace. For a CDN edge or a load balancer this is worth it; for a typical web service it is not. See /wiki/kernel-bypass-and-userspace-networking for the production patterns.

The page-pinning tax and why MSG_ZEROCOPY has a break-even

Every zero-copy path shares one hidden cost: the kernel must pin the source pages in physical memory until the DMA completes, so the application cannot swap them out, free them, or modify them mid-flight. Pinning a 4 KB page costs ~80 ns of kernel work (incrementing a refcount, marking the page in the LRU lists, attaching it to the sk_buff's page-fragment list). For sendfile and splice the pinning is trivial because the pages are already in the page cache and pinned implicitly. For MSG_ZEROCOPY the pages are user pages, and the kernel has to walk the user's page tables to find them and pin each one — that is roughly 200 ns per 4 KB page, or 50 µs per MB. The break-even with regular send (which does a fast memcpy at ~1 byte per cycle, ~3 µs per MB on a 3 GHz core) is at ~300 KB. Below that, pin overhead dominates the saved memcpy and MSG_ZEROCOPY is slower than a regular send.

The Cloudflare team's writeup on their TCP fast-path migration documented this break-even clearly: small responses (HTTP/2 control frames, status pings) stayed on send, bulk-data sends (image responses, video chunks) moved to MSG_ZEROCOPY, and the dispatcher uses a 64 KB threshold. The architectural pattern — split the egress path by message size, route small to fast-syscall and large to zero-copy — is general; you will see it in any production zero-copy migration.

Corner cases — splice from /dev/zero, vmsplice with SPLICE_F_GIFT

A handful of splice patterns are worth knowing because they enable specific tricks. splice(/dev/zero, ..., pipe[1], ...) fills a pipe with zero pages without ever copying — the kernel attaches references to a single read-only zero page to all 16 pipe slots. This is how some load-test tools generate gigabytes-per-second of zero-content traffic with effectively no CPU cost. vmsplice(pipe[1], iov, n, SPLICE_F_GIFT) transfers ownership of the user pages to the kernel, eliminating the page-pin step at the cost of the user no longer being allowed to touch those pages (the kernel will reclaim and zero them eventually). And splice with SPLICE_F_MORE is the equivalent of MSG_MORE on a socket — it tells the kernel "more data is coming, don't push the partial frame out to the wire yet", letting the kernel batch larger frames for TSO.

These flags are obscure and rarely used in application code, but they are how the kernel-internal optimisations of high-throughput services like nginx-with-sendfile_max_chunk actually work. Two of them (SPLICE_F_GIFT and SPLICE_F_MORE) appear in the source of HAProxy and Envoy where the proxy needs to coalesce small payloads into MSS-sized TCP segments without a userspace buffer; the wins are 5–8% CPU on a saturated proxy node, which compounds across a fleet. The lesson is the same one this whole chapter has been making: the kernel's fast paths exist; the application's job is to discover and use them, and most "this is impossible to optimise further" production walls turn out to be a missing flag or a missing setsockopt.

Reproducibility footer

# Reproduce on your laptop, ~10 minutes
sudo apt install python3 iperf3 linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install # only stdlib needed
dd if=/dev/urandom of=/tmp/big.bin bs=1M count=4096    # 4 GB test file

# Server (run three times, varying ZC_PATH):
sudo sysctl vm.drop_caches=3
ZC_PATH=readwrite perf stat -e cycles,cache-misses python3 zerocopy_bench.py serve /tmp/big.bin 9000 &
# Client (in a second shell):
iperf3 -c 127.0.0.1 -p 9000 -t 30 --length 64K
# Repeat with ZC_PATH=sendfile and ZC_PATH=mmap.
# Compare throughput, CPU, and memcpy bandwidth across the three runs.

Where this leads next

The next chapter (/wiki/page-cache-and-its-promises) walks through the page cache from the other side — what the kernel's caching policy gives you that O_DIRECT removes, and what sendfile quietly relies on (every sendfile read is a page-cache hit or a page-cache fill; no zero-copy is possible without the page cache as the staging area).

The chapter after that (/wiki/sequential-vs-random-on-modern-storage) revisits the storage side — sequential reads through sendfile benefit hugely from the kernel's read-ahead heuristics (the kernel prefetches the next 256 KB while you are sending the current 64 KB), so a cold-cache 4 GB file streams at near-disk-rate after the first few reads. Random reads (e.g. serving a database backup file with non-sequential access) defeat read-ahead and cap at the device's qd=1 IOPS — a different cliff.

Three operational habits this chapter adds. First, measure your egress CPU breakdown with perf record -F 997 -g -- sleep 30 and look for __memcpy_* and __copy_user_* in the flamegraph. If they are >5% of CPU on a byte-shovelling tier, you have a zero-copy opportunity. Second, count syscalls and bytes per syscall with perf stat -e raw_syscalls:sys_enter -p <pid> divided by network bytes shipped — if you are doing one syscall per 8 KB of data shipped (a 64 KB file = 8 reads + 8 writes = 16 syscalls), sendfile will collapse it to one. Third, enable kTLS on any HTTPS server doing >1 Gbps of file egress; nginx 1.21+ supports it via ssl_conf_command, and the gain is consistently 30-50% CPU on TLS-heavy egress workloads.

The Hotstar IPL edge migration documented at IndiaFOSS 2024 walked through exactly this evolution. Their original Java edge (Netty + ByteBuf + manual file reads) hit 9 Gbps per node at 100% CPU, requiring 1,200 nodes in the IPL hot region (Mumbai-South-A) to serve 22M concurrent viewers during a Mumbai Indians vs Chennai Super Kings final. Migrating the byte-shovelling tier to a Rust + tokio-uring + sendfile design pushed each node to 28 Gbps at 38% CPU — a 3× density improvement that dropped the cluster from 1,200 to ~420 nodes. The CapEx saving for that single tournament was estimated at ₹6.4 crore in EC2 reserved-instance costs; the engineering investment paid back in the first IPL. The interesting follow-up was that the next bottleneck after sendfile turned out to be the Linux network scheduler (fq_codel queue management at 28 Gbps had its own CPU floor), which is the topic of /wiki/network-stack-overhead-fq-codel-bbr further along.

The contrast with Razorpay's payment-acceptance edge is informative. Razorpay's edge serves small JSON responses (1-4 KB per response, ~50K RPS per node), and their workload sits comfortably below the sendfile break-even point. They measured the migration potential and decided the engineering cost was not worth the <5% CPU gain at their response sizes — they stayed on a regular Go net/http stack with WriteString. The decision tree is "if your average response body is >32 KB, sendfile pays off; otherwise the syscall overhead dominates and the regular write path is fine". Recognising which side of that line your workload sits saves both the migration effort and the embarrassment of a benchmark that shows no improvement.

A subtler PhonePe story landed at the SREcon APAC 2025 talk on their Aadhaar e-KYC document-storage tier. The workload is downloads of 200-800 KB image files (ID proofs, scanned documents) at ~30K RPS per node, with mandatory TLS termination at the edge for compliance. Their original Python-asyncio + uvloop edge hit 4.2 Gbps per node at 100% CPU with TLS in OpenSSL doing most of the work. Migrating to nginx with kTLS + sendfile pushed each node to 14 Gbps at 41% CPU — a 3.3× density gain that let them shrink the e-KYC edge cluster from 280 nodes to 90. The combined kTLS + sendfile path is the dominant production pattern for HTTPS file egress in 2026, and the cost saving is large enough that the migration is worth it for any team shipping >1 Gbps per node of file traffic.

The final piece: zero-copy is a systems property, not a code property. Your application code can be perfectly written for zero-copy and still see no benefit because the NIC doesn't support SG-DMA, or the file lives on a filesystem that doesn't support splice (FUSE filesystems often don't), or the TCP socket has a non-zero SO_LINGER that forces a synchronous wait. The benchmark harness above is the only way to know — run all three paths against your actual hardware and read the cycles-per-Gbps numbers. The kernel and NIC vendors have spent two decades plumbing the fast path; your job is to verify that the plumbing reaches your actual workload.

A closing observation on the operational shape. Once a tier moves to sendfile + kTLS, the workload's bottleneck signature shifts — flamegraphs no longer show fat __memcpy_* bars, the %sys CPU column drops sharply, and the next bottleneck appears somewhere unexpected: TCP retransmit handling, the network scheduler's queue selection (for multi-queue NICs), the LB hash function on a load balancer, or the application's connection-accept loop. SREs who have done the migration once recognise the pattern: every time you remove the obvious bottleneck, the next one was already there, masked. The flamegraph after sendfile looks completely different from the flamegraph before, and reading the new flamegraph is its own skill — the topic of /wiki/reading-flamegraphs and the broader Part 5 chapters.

A second closing observation specific to the Indian production context. The cost-per-Gbps of egress on AWS Mumbai (ap-south-1) is roughly ₹6,500/month per dedicated 10 Gbps slice as of early 2026, before the data-transfer pricing kicks in. A node that does 28 Gbps with sendfile instead of 9 Gbps with read+write removes 2.1 nodes from the cluster per node migrated; at scale (a 1,200-node IPL hot region during the tournament), that is ₹78 lakh/month of saved EC2 reserved-instance cost from a single kernel-API change.

The migration costs are dwarfed by the operational savings within one quarter, which is why every Indian streaming, e-commerce, and fintech company with a >5 Gbps file-egress tier has either done this migration or has it on the roadmap. The kernel did the hard work in 2017–2020; the application teams' job in 2026 is to recognise where to use it and verify the numbers.

A final note on monitoring. After deploying a sendfile-based egress tier, the most useful production metrics shift. nstat -s | grep TcpExtTCPSendQueueFull no longer fires (the kernel never blocks on a user-buffer copy). cat /proc/net/snmp shows Tcp:RetransSegs more clearly because the egress path is no longer hiding retransmits behind copy latency. And perf stat -e syscalls:sys_enter_sendfile becomes the canonical traffic counter — at a steady-state rate of one sendfile per HTTP response, this counter doubles as your RPS metric. Operationally, plumbing these into your dashboards before the migration is the cleanest way to see the win materialise; teams that skip this step often miss the point at which the migration "took" because the old metrics no longer reflect the new path.

References