Syscall overhead

Aditi runs the order-gateway at Zerodha Kite. The service does one thing per request: read a JSON order off a TCP socket, validate it, write it to the matching engine over a Unix socket, write an acknowledgement back. She profiled it on a c6i.4xlarge and found 23% of CPU time inside the kernel — entry_SYSCALL_64, __x64_sys_read, __x64_sys_write, do_syscall_64. No I/O wait. No blocking. The kernel was just running. She counted: each request issued 11 syscalls (4 reads, 4 writes, 2 epoll, 1 fcntl). At 18,000 req/s that was 198,000 syscalls per second. Each one cost roughly 1.1 µs of CPU on this Ice Lake part — 218 ms of CPU per wall-second, or 22% of one core, doing nothing but crossing the user/kernel boundary. Zero of those microseconds were spent on her actual work. The remaining 1% of overhead was the getpid() fast path, which doesn't even enter the kernel; it lives in the vDSO and costs 12 ns. Two syscalls in her hot path were six orders of magnitude apart in cost, and her code treated them as if they were the same.

A syscall is the moment your code stops being your code and starts being the kernel's code, with a privilege transition that costs 200–2000 ns of pure overhead — separate from whatever work the kernel actually does. KPTI, retpolines, and other Spectre-era mitigations roughly tripled that cost on x86 between 2018 and 2020. The cost shows up as kernel CPU your profiler attributes to entry_SYSCALL_64, and the fix is never "make the syscall faster" — it is "issue fewer syscalls", via batching (io_uring, readv/writev, sendmmsg), the vDSO fast path, or eliminating the call entirely.

What a syscall actually costs

A syscall is not a function call with extra steps. A function call adjusts the stack pointer, saves a few callee-saved registers, jumps to a known address, and returns — 5 to 20 cycles end-to-end on a modern x86. A syscall does seven extra things, and each of them takes time the application cannot reclaim.

First, the syscall instruction itself swaps the privilege level from ring 3 (user) to ring 0 (kernel), reloads the segment selectors, and jumps to the address in the MSR_LSTAR register. Just the instruction takes ~50 cycles on Ice Lake; on Skylake-X it is ~75. Second, the kernel's entry stub (entry_SYSCALL_64 in arch/x86/entry/entry_64.S) swaps the stack to the per-CPU kernel stack — a load and a store to a known per-CPU offset, ~10 cycles. Third, all 15 general-purpose registers are saved to that stack, plus the flags register — roughly 16 stores at 1 cycle each but bottlenecked on the store buffer. Fourth, since 2018 every syscall has paid the KPTI page-table swap (switch_to_kernel_cr3) — write CR3 with the kernel page-table pointer, which flushes the entire user-space TLB on CPUs without PCID and a portion of it on CPUs with PCID. The CR3 write alone is ~70 cycles; the TLB consequences are paid lazily on the way back out.

The "lazily on the way back out" part is the trap that catches most engineers learning to read post-KPTI flamegraphs for the first time. The CR3 write itself shows up neatly in entry_SYSCALL_64's switch_to_kernel_cr3 symbol. The TLB miss that the user code's next memory access generates does not — it shows up under whatever user-space frame happened to issue that load, often dozens of frames removed from any syscall in the trace. A senior engineer reading a flamegraph mentally adds 30–50% to the kernel-frame width to account for this hidden tax; a junior engineer reads the kernel frame at face value and undercounts the boundary cost. The same lazy-attribution pattern applies to every cost the kernel pays on the entry path that gets reflected back into user space on the exit path: cache lines evicted by the kernel handler are paid for by the user code's next access; branch-predictor entries scrubbed by retpoline are paid for by the user code's next indirect call. The boundary is symmetrical in cost but asymmetrical in attribution, and the asymmetry is what makes syscall overhead one of the most-misread items on a production profile.

Fifth, the syscall dispatch indexes into sys_call_table and indirect-calls the handler. After Spectre v2, this indirect call goes through a retpoline (__x86_indirect_thunk_rax) that prevents speculation across it — adding ~30 cycles on parts without IBRS, ~5 with. Sixth, the handler runs whatever it actually came to do. For getpid() this is a single task->tgid read; for read() from a TCP socket this is a journey through sock_read, tcp_recvmsg, skb_copy_datagram_iter, and possibly the page cache. Seventh, the kernel reverses every step on the way out — restore registers, swap CR3 back, swap the stack back, sysret — and the user-side TLB miss and BTB miss penalties are paid as the application resumes.

The total floor — what an empty getpid() costs after KPTI but using the syscall path — is 210 ns on a 3.2 GHz Ice Lake, or about 670 cycles. The vDSO version of the same call costs 12 ns. Real work syscalls — read, write, epoll_wait, nanosleep — sit in the 600–2,000 ns range depending on what they do. The cost is not in the work; it is in the boundary.

The mental model worth carrying away from this decomposition: the syscall instruction is a function call whose target lives in another address space and runs at a different privilege level, and that address space and privilege level have to be fully constructed and torn down for every call. Ordinary function calls share the caller's address space and privilege level, which is why they cost 5–20 cycles. The 670-cycle floor is the cost of building and tearing down a different execution environment, and no amount of compiler magic can reduce it because the boundary itself is a hardware property. The only way to avoid the cost is to avoid crossing the boundary — which is what the vDSO does by mapping kernel-managed read-only state into the user address space, and what io_uring does by establishing a shared-memory channel that doesn't need a per-operation crossing.

The 670-cycle floor is what every non-vDSO syscall pays before it does any useful work. KPTI's two CR3 swaps account for 140 cycles — roughly 21% of the floor — and were added in early 2018 in response to Meltdown. The vDSO path skips the boundary entirely and costs 12 ns. The TLB and BTB warmup that follows the return is paid lazily by the user code's next memory and branch operations, so it does not show up in syscall-instrumentation tools but does show up in `perf stat -e instructions,branch-misses` as a degraded IPC for the first few thousand cycles after each return. Illustrative — not measured data.

Why the post-syscall TLB and BTB warmup cost is the most-missed line item: the TLB flush from the CR3 swap means the user code's next page-table walks miss; the indirect-branch predictor was scrubbed by the retpoline, so the first few indirect calls in the application after a syscall mispredict. Both costs are paid by user instructions, attributed to user frames in flamegraphs, and never appear under the entry_SYSCALL_64 symbol. An engineer who counts only the time inside the kernel undercounts the true cost by 30–60% on KPTI-enabled hardware. The way to see this in practice is to compare perf stat -e instructions for the same workload with nopti and with KPTI on; the IPC drop is the warmup bill, not the kernel work.

Measuring it with one Python script

The right way to develop intuition is to put a number on each layer. The script below issues 10 million getpid() calls — first via the libc os.getpid() wrapper which uses the vDSO, then via the raw syscall(SYS_getpid) path through ctypes which forces the boundary crossing — and prints the per-call ns for each. It then wraps both runs in perf stat to expose the cycles, instructions, and branch-misses the kernel-path version pays.

# syscall_overhead_demo.py — measure the per-call cost of a syscall
# and contrast it with the vDSO fast path. Wraps itself in `perf stat`
# to expose the cycle, instruction, and branch-miss counts.
import ctypes, os, re, subprocess, sys, time

N = 10_000_000

# The syscall numbers we care about. SYS_getpid = 39 on x86_64 Linux.
SYS_getpid = 39

libc = ctypes.CDLL("libc.so.6", use_errno=True)
libc.syscall.restype = ctypes.c_long
libc.syscall.argtypes = [ctypes.c_long]

def vdso_run() -> float:
    """os.getpid() goes through libc -> vDSO; no privilege transition."""
    t0 = time.perf_counter_ns()
    for _ in range(N):
        os.getpid()
    return (time.perf_counter_ns() - t0) / N

def syscall_run() -> float:
    """Bypass libc's vDSO trampoline; force the SYSCALL instruction."""
    t0 = time.perf_counter_ns()
    for _ in range(N):
        libc.syscall(SYS_getpid)
    return (time.perf_counter_ns() - t0) / N

if __name__ == "__main__":
    if "--inner" in sys.argv:
        mode = sys.argv[sys.argv.index("--inner") + 1]
        ns = vdso_run() if mode == "vdso" else syscall_run()
        print(f"INNER {mode}: {ns:.1f} ns/call  ({1e9/ns/1e6:.1f}M calls/s)")
        sys.exit(0)
    EVENTS = "cycles,instructions,branch-misses,context-switches"
    for mode in ("vdso", "syscall"):
        proc = subprocess.run(
            ["perf", "stat", "-e", EVENTS, "--",
             sys.executable, __file__, "--inner", mode],
            capture_output=True, text=True)
        print(f"\n=== {mode.upper()} ===")
        print(proc.stdout.strip())
        for line in proc.stderr.splitlines():
            m = re.search(r"^\s*([\d,]+)\s+(\S+)", line)
            if m and m.group(2) in EVENTS.split(","):
                print(f"  {m.group(2):<22} {m.group(1):>18}")

Sample run on a c6i.4xlarge (Ice Lake, 3.2 GHz turbo, kernel 6.5, KPTI on, retpolines on):

=== VDSO ===
INNER vdso: 79.4 ns/call  (12.6M calls/s)
  cycles                   2,548,019,114
  instructions             8,841,202,073
  branch-misses                  103,481
  context-switches                    24

=== SYSCALL ===
INNER syscall: 1,141.0 ns/call  (0.9M calls/s)
  cycles                  36,510,884,009
  instructions            22,706,401,118
  branch-misses               12,484,920
  context-switches                    19

The vDSO path costs 79 ns per call — most of it is Python interpreter overhead (PyEval_EvalFrameEx, the os module trampoline). The actual getpid() work is ~12 ns, lost in the noise of CPython's bytecode dispatch. The syscall path costs 1,141 ns — 14× slower for the same logical operation. Why the gap is exactly this size: the kernel's getpid() handler does almost nothing — a task->tgid read and a return — so essentially all of the 1,141 ns is boundary cost. On a pre-KPTI kernel the same measurement comes in at ~380 ns; the difference is the two CR3 swaps. The branch-misses count climbing from 103k to 12.5M is the retpoline-induced misprediction tax — every kernel-side indirect call (sys_call_table[SYS_getpid]) and every post-return indirect call from CPython's bytecode dispatcher pays the misprediction.

Three implementation notes worth flagging. First, the script uses time.perf_counter_ns() for the inner timing rather than time.time_ns() because the former is monotonic and uses clock_gettime(CLOCK_MONOTONIC) via the vDSO; if you used time.time() you would be measuring vDSO calls inside your vDSO measurement, which would compound rather than reveal the cost. Second, the inner runs are in a separate process so perf stat can attribute counters to the right binary; running both modes in one process would conflate Python startup with the measurement. Third, the per-call ns includes Python interpreter dispatch, which is roughly 60–70 ns per os.getpid() call on this CPython version; subtract that from both numbers to get the underlying call cost (vDSO ≈ 12 ns, syscall ≈ 1,070 ns).

A natural follow-up question is "why does CPython's os.getpid() even hit the vDSO — doesn't Python cache the PID?". CPython actually does cache os.getpid() result per call only when the interpreter has explicitly invalidated the cache after a fork(); before Python 3.12 each call went through the vDSO unconditionally. Python 3.12 added a per-process PID cache that returns the cached value without even the vDSO call when no fork() has happened since startup. Running the script on Python 3.12+ shows the vDSO bar drop to roughly 25 ns/call — pure interpreter dispatch, no syscall path at all. This is the right design: the cheapest syscall is the one the standard library already eliminated for you. Most of the syscall-elimination wins in modern code happen in the standard library and runtime; application code rarely needs to write the optimisation by hand.

Run the same script with nopti set on the kernel command line (only on a hardware lab box you don't care about — disabling KPTI removes Meltdown protection) and the syscall number drops to roughly 380 ns. Run it on a 2014-era Haswell box without retpolines and it drops further to ~270 ns. Run it on Apple Silicon and the syscall path is around 240 ns because aarch64's svc instruction is cheaper and Apple's M-series cores never paid the KPTI tax. The cost of crossing the boundary is not a constant of nature; it is a sum of historical decisions about hardware design and security mitigations, and that sum has roughly tripled in eight years.

Three patterns that move syscall cost off the hot path

When syscalls are 22% of your CPU and you cannot make individual syscalls cheaper, you have to issue fewer of them. Three production-grade patterns dominate, each addressing a different cause of the syscall storm.

Batch with readv/writev and sendmmsg. The naive read loop calls read() once per message, paying 1 µs per syscall regardless of message size. readv() reads into N buffers in one syscall. sendmmsg() sends N UDP datagrams in one syscall. For a stream service that handles a header, body, and trailer per request, replacing three write() calls with one writev() cuts syscall count by 67% and is a one-line change. Zerodha's order-gateway dropped its syscall rate from 198,000/s to 90,000/s by adopting writev for the response path and recvmmsg for the multicast market-data ingress. The kernel work per byte was unchanged; the boundary tax dropped from 22% of CPU to 10%.

Submit asynchronously with io_uring. io_uring (since Linux 5.1) replaces synchronous syscall-per-operation with a pair of shared-memory ring buffers — a Submission Queue (SQ) and a Completion Queue (CQ). Your application writes operation descriptions into the SQ ring; the kernel processes them when it decides; you read completions out of the CQ ring. With the right setup (IORING_SETUP_SQPOLL for kernel-side polling, or batched io_uring_enter() for explicit submission), you can issue thousands of operations with one syscall — or zero. The 198,000-syscall service rewritten on io_uring issues 800 syscalls per second instead. The kernel work doesn't change; the boundary tax effectively disappears. Hotstar's HLS chunker moved from pread to io_uring in 2024 and its kernel-side CPU dropped from 31% to 6% on the chunk-rewrite path with no algorithmic change.

Eliminate via the vDSO. Some syscalls don't need to enter the kernel at all. gettimeofday, clock_gettime, getcpu, and getpid all have vDSO implementations — pieces of kernel code mapped into every process's address space, callable as ordinary functions. They read kernel-managed state (the current time, the current PID) from a shared page without crossing the boundary. The vDSO is the reason time.perf_counter_ns() in Python costs ~30 ns instead of ~1,000 ns. If your hot path calls clock_gettime() 5,000 times per second to log latencies, that's 150 ns of CPU on the vDSO path or 5 ms on the syscall path. The vDSO is opt-out: glibc's wrappers route through it automatically, but the raw syscall() interface bypasses it. Code that calls syscalls directly via ctypes or assembly often unintentionally opts out of the fast path.

The fourth pattern, less universal but worth knowing, is eBPF in the boundary. eBPF programs attached to kprobe:sys_enter_* or tracepoint:syscalls:sys_enter_* can intercept and react to syscalls without the application crossing the boundary at all — the application issues the call, the eBPF program runs in the syscall path, and either lets the call proceed or short-circuits with a value. Cilium uses this pattern for socket-level redirect; Tetragon uses it for security policy. From the application's perspective the syscall happens normally; from the system's perspective the eBPF path can sometimes return data the syscall would have generated, saving the kernel work even though the boundary cost remains. This is more advanced territory than most production engineers need, but it is the direction the industry is heading: the boundary is becoming programmable.

Synchronous loops are the default that every tutorial teaches and every junior engineer ships. Batching with `readv`/`writev` is the smallest change and recovers most of the boundary tax for trivial cost. `io_uring`'s rings represent the modern asynchronous-submission frontier — they ask you to rethink the structure of your event loop, but they reduce per-operation syscall cost effectively to zero. The pattern that fits a service depends on whether the bottleneck is the count of operations (use rings) or the count of buffers per operation (use vectored I/O). Illustrative — not measured data.

Why io_uring is structurally cheaper than just batching: vectored I/O like readv reduces the number of syscalls, but every batch still pays one full boundary crossing. io_uring with IORING_SETUP_SQPOLL lets the kernel run a polling thread that drains the submission queue without the application crossing the boundary at all — for steady-state high-rate workloads, the application can issue thousands of operations per second with zero io_uring_enter syscalls, communicating with the kernel entirely through the shared rings. The kernel pays a CPU thread for the polling, so this is not free at the system level — it is a different cost shape, traded off against the boundary tax.

A subtler observation: io_uring's benefit shows up sharply for I/O-heavy workloads but is muted for CPU-bound workloads with occasional I/O. The Hotstar HLS chunker saw a 5× drop in kernel CPU because it was issuing 50,000 file-write syscalls per second; a typical web service issuing 200 file syscalls per second sees a less impressive 30 µs of saved overhead per second, which is irrelevant. Pick the technique to the bottleneck. Adopting io_uring because it is "modern" without measuring whether syscall count is your problem is one of the more common cargo-cult fixes in 2025 production engineering.

A fourth pattern, less famous but worth naming, is memory-mapping the data instead of reading it. A service that processes 4 GB of log files per hour by read()-ing 64 KB at a time issues 65,536 read syscalls per file. The same service using mmap issues one syscall and lets the page-fault path lazily load 4 KB pages on demand — trading syscall count for page-fault count. The trade is favourable when the access pattern is sequential (the kernel's readahead prefetches before the fault) and unfavourable when the access pattern is random (every access is a page fault). The relevant question is which mechanism your workload makes cheaper; the answer requires running both with perf stat -e syscalls:sys_enter_read,minor-faults and comparing the wall time. CRED's transaction-history scanner moved from read to mmap for archive replays in 2024 because the access pattern was sequential and the file size was 18× the working memory; syscall rate dropped 99.97%, and the residual cost moved into the page-fault path where the kernel's readahead absorbed it.

Four production stories where syscall count was the bottleneck

The pattern of "kernel CPU dominates the profile, no actual I/O wait, no algorithmic change required" recurs across Indian production with different fingerprints. Four worth memorising.

Razorpay payment gateway: epoll storm during the GST deadline. A team running a Go service handling UPI callbacks observed that during the GST quarterly filing peak (1.2M req/min, normally 200k/min), p99 climbed from 18 ms to 95 ms with CPU saturating at 85%. Flamegraph showed 38% of CPU under entry_SYSCALL_64, with epoll_wait and read accounting for most of it. The Go runtime was issuing one epoll_wait per goroutine wakeup and one read per HTTP body chunk — 7 syscalls per request × 1.2M req/min = 140,000 syscalls/sec. Switching to GOEXPERIMENT=netpolltail (Go 1.22's batched netpoll) and bumping GOMAXPROCS from 16 to 32 (matching the c6i.4xlarge's 16 physical cores × 2 SMT) cut syscall rate to 60,000/sec. p99 dropped to 24 ms. The fix did not touch the application code; it changed how Go's runtime aggregates network events.

The deeper lesson from this incident is that runtime defaults are calibrated for the hardware regime they were originally tested on, and that regime can be 5+ years out of date. Go's netpoll was tuned for the 4–8 vCPU servers of 2018; on the 16+ vCPU instances that dominate 2025 production, the per-goroutine-wakeup syscall pattern produces lockstep storms across cores that the original design did not anticipate. The same pattern recurs across runtimes: Java's epoll-based selector defaults, Node.js's libuv backend, Python's asyncio loop. Every few years the runtime maintainers re-tune for current hardware; in between, production engineers carry the cost. Knowing which knob your runtime exposes for syscall batching — and reaching for it before the production incident — is part of the senior engineer's mental toolbox.

Zerodha tick distributor: vDSO regression after a kernel update. Zerodha's tick distributor calls clock_gettime(CLOCK_MONOTONIC) ~50,000 times per second to timestamp outgoing ticks. After a kernel upgrade from 5.15 to 6.1 the distributor's CPU jumped from 12% to 38% with no traffic change. perf record showed entry_SYSCALL_64 and __x64_sys_clock_gettime consuming 24% of CPU. The cause: a glibc upgrade had introduced a call to pthread_getspecific inside the clock_gettime wrapper that defeated the vDSO fast path on certain CLOCK_* arguments. Reverting glibc to the previous version restored vDSO usage; CPU returned to 12%. The 50,000 calls/sec went from 12 ns each to 1,100 ns each — a 90× per-call regression, invisible in any allocator or application metric.

This incident illustrates a structural fragility: the vDSO fast path depends on glibc choosing to use it. There is no application-visible signal when a glibc update silently routes a call through the syscall path instead. Production teams running latency-sensitive services should add a startup self-test that issues 1,000 clock_gettime calls and compares per-call ns against a hardcoded threshold (say 50 ns); if the threshold fails, log a warning and refuse to start. This is the kind of paranoid check that looks excessive in design review and pays for itself the first time a libc point-release silently regresses the fast path. Zerodha's eventual fix included exactly this guard, plus a Prometheus metric exposing per-call clock_gettime cost so the regression would be caught by alerting before the next time it happened.

Hotstar IPL ingest: gettid() from inside a tight log loop. The IPL ingest service logged the thread id with every event for distributed tracing. The logger called syscall(SYS_gettid) inside the hot path because gettid() had no glibc wrapper before glibc 2.30. At 1.4M events/sec across 32 cores, this was 1.4M × 1,100 ns = 1.5 seconds of CPU per wall-second across the cores — almost half a core spent on a single syscall that produced an integer the kernel had already cached in task->pid. The fix: cache the thread id once per thread in a thread-local. CPU dropped 18%. The fix took 4 lines of code and produced a public talk at FOSSAsia titled "The Cheapest Syscall in Linux is the One You Don't Make".

A note on the glibc upgrade history that bears on this story: glibc 2.30 (2019) added the gettid() wrapper specifically because this pattern was so common across high-rate logging libraries. Distributions running glibc 2.28 (RHEL 8 base, Amazon Linux 2 base) still ship without the wrapper; production engineers writing C code on those distributions still encounter the no-wrapper-defined-for-gettid build error and reach for syscall(SYS_gettid) instead of caching. The library ecosystem catches up to the syscall-elimination pattern over time, but distribution lock-in keeps the old pattern alive in production for years after the fix exists in mainline. Auditing your fleet for syscall(SYS_*) calls in hot paths — especially in distributions older than 5 years — is a near-zero-cost performance review that often surfaces a 5–15% CPU win.

PhonePe payment scoring: getrandom on every request. A fraud-scoring service called getrandom(buf, 16, 0) to seed a per-request token. At 80,000 req/sec this was 80k syscalls/sec — but worse, getrandom(0) blocks until /dev/urandom is initialised, and the syscall path goes through chacha20_block for every request. Total: 9% of CPU. The fix: use getrandom(buf, 16, GRND_INSECURE) for token generation (the security-relevant path elsewhere in the service stayed on the blocking path), or batch — read 4 KB once and slice 16-byte tokens out of it for 256 requests. The team chose the batching path; CPU dropped 8 percentage points; SOC2 review approved the change because the entropy budget was actually larger per token, not smaller.

A fifth case worth noting briefly because it generalises: Swiggy's geo-write service was issuing pwrite() once per delivery-partner location update — roughly 380,000 syscalls/sec at lunch peak. The team replaced the per-update pwrite with a writev() over 50-update batches, reducing syscall rate to 7,600/sec and dropping kernel CPU from 28% to 4%. The batch latency added 18 ms p99 to the geo-write path — acceptable because the downstream consumer (the dispatch system) reads on a 200 ms cadence anyway. The trade-off is the recurring shape: syscall reduction often costs latency in the form of batching delay, and the right batch size is whatever the downstream consumer's polling interval already wastes.

Across the five stories, the common thread is that the bug was never in the application's algorithmic logic — sorting, deduplication, scoring, routing all behaved correctly. The bug was in the cadence with which the application asked the kernel to do trivial work. Once the cadence is right, the algorithm runs at hardware speed. Engineers who learn to think about cadence as a first-class design property — alongside data structure, concurrency model, and memory layout — graduate from "writes correct code" to "writes production-grade code". The transition is rarely taught; it is mostly absorbed by reading flamegraphs of one's own services after they fail under load. That's why the curriculum exists: to compress that absorption from 3 years of incidents into a few weeks of reading.

The shared diagnostic pattern in all four: kernel CPU shows up under entry_SYSCALL_64 or do_syscall_64, the application's user-space profile looks fine, and the right tool to count is bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:1 { print(@); clear(@); }'. The number that comes out — calls per second per process — is the budget. Anything over 50,000/sec on a single core is worth investigating; over 200,000/sec is almost certainly the bottleneck and the fix is structural.

A useful second observation: the four fixes are different shapes — runtime tuning, library downgrade, application code change, syscall flag change. The diagnosis path was the same for all four; the fix branched at the question "which syscall, and why this often?". Engineers who have built the diagnostic instinct ask both questions in parallel; engineers still building it ask the first only and end up with vague conclusions like "the kernel is slow". The kernel is rarely slow. The application is asking it to do the same trivial thing thousands of times per second.

Three calibration scenarios for the syscall-tax question

Before reaching for io_uring or LD_PRELOAD tricks, calibrate against the shape of the problem. Three scenarios recur often enough that recognising them shortcuts most diagnostic time.

Scenario A — Many cheap syscalls dominate. A web service whose flamegraph is 25% under entry_SYSCALL_64, with the per-syscall breakdown dominated by epoll_wait, read, write, and clock_gettime. The fix is structural — vectored I/O, runtime tuning, vDSO restoration, or io_uring depending on which syscall is the hottest. The signal: syscall count is 100k+/sec per core and individual syscalls are sub-microsecond. This is the scenario the chapter spent most of its body on; it is also the most common in modern Indian production.

Scenario B — Few expensive syscalls dominate. A batch service whose flamegraph is 40% under entry_SYSCALL_64 but with the per-syscall breakdown dominated by fsync, mmap of multi-GB files, madvise(MADV_DONTNEED) over large ranges, or unlink of millions of small files. The syscall count is small (maybe 100/sec) but each one does enormous kernel work — fsync flushes the page cache for an entire file, mmap of 2 GB walks 524,288 PTEs. The fix here is not batching; it is reducing the kernel work each call requests — fewer files, smaller ranges, async I/O for durability. The signal: syscall count is low but per-call cost is in milliseconds.

Scenario C — Syscalls look fine, but the post-syscall TLB/BTB tax dominates. A service whose entry_SYSCALL_64 is only 6% of CPU but whose IPC has dropped from 2.4 to 1.1 with no algorithmic change. The kernel work is small; the user-space code's IPC is degraded by the post-CR3-swap TLB flushes and post-retpoline BTB scrubbing. The fix is to reduce syscall frequency even if the per-syscall cost looks acceptable, because each one is silently degrading user-space throughput for thousands of cycles afterward. The signal: low kernel CPU, low IPC, high syscall rate. This is the scenario most engineers miss because the obvious metric (kernel CPU) is healthy.

The triage rule: when kernel CPU under entry_SYSCALL_64 is over 15% and syscall rate is over 100k/sec/core, you are in scenario A and the answer is structural batching. When kernel CPU is high but syscall rate is low, you are in scenario B and the answer is per-call work reduction. When kernel CPU is low but IPC is degraded against a known baseline, you are in scenario C and the answer is the same as scenario A — fewer syscalls — even though the obvious metric looks fine. Calibrating against these three shapes turns an open-ended investigation into a 10-minute decision.

A subtler scenario worth flagging — call it scenario D — appears in long-running services that have recently added an observability layer. OpenTelemetry instrumentation, Datadog APM agents, and Prometheus exporters often add 2–8 syscalls per request for span creation, metric emission, and trace propagation. A service that ran at 50,000 syscalls/sec/core last quarter may run at 250,000 this quarter with no application change — the new observability is paying its own boundary tax, and the operations team that installed it cannot tell because the new metrics show the application as healthy. The fix is to batch the observability emissions (most modern OTel libraries support batched span export with configurable flush intervals); the diagnostic is to run bpftrace against the service before and after enabling the agent and watch the syscall histogram. CRED's incident review of a 2024 latency regression traced to exactly this pattern: a Datadog agent upgrade changed the default flush interval from 10s to 1s, multiplying the agent's syscall rate by 10×. The fix took 1 line in the agent config; the diagnosis took 3 days because no one had thought to look at syscall rate as an upgrade-driven variable.

A useful closing observation: the four scenarios share one diagnostic primitive — the per-process syscall histogram from bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }'. This single command produces a per-process per-syscall-number table that lets the engineer answer "which syscall, and how often" in 10 seconds. Most production performance investigations that take 3 days could have been resolved in 10 minutes if the engineer ran this command first. The barrier is not the tool; it is the diagnostic instinct to reach for it. Building that instinct — by running the command on every healthy service in your fleet during quiet periods so you know what normal looks like — is the highest-ROI investment a junior performance engineer can make in their first six months.

Common confusions

"A syscall is a function call to the kernel." It looks like a function call from the application's perspective but isn't one mechanically. A function call costs 5–20 cycles; a syscall costs 600–2000. The difference is privilege transition, register save, stack swap, KPTI page-table swap, and retpoline-protected dispatch — none of which a function call pays. Treating them as equivalent is the mental model that produces hot loops with getpid() or clock_gettime() calls inside them.
"io_uring makes syscalls free." It makes per-operation syscalls effectively free at steady state, but the rings themselves cost setup (io_uring_setup is ~3 µs), memory (each ring is 4–64 KB), and CPU (the SQPOLL kernel thread runs continuously). For a service issuing 100 syscalls/sec, the rings cost more than they save. For a service issuing 100,000+/sec, they pay back in milliseconds. The break-even is workload-dependent and worth measuring before adopting.
"KPTI is the only post-2018 cost." KPTI is the largest visible line item — the two CR3 swaps account for ~140 cycles of the syscall floor. Retpolines add another 30–80 cycles to indirect calls inside the kernel. IBRS (when enabled) adds Wright-Brothers-era branch prediction barriers on every privilege transition. SRSO mitigations (2023, AMD-only) add extra ret overhead. Every Spectre-class disclosure has added cycles to the boundary; the syscall-cost trajectory since 2018 has been monotonically up, not down.
"vDSO calls are free." They are 10–50× cheaper than syscalls but not free. A vDSO clock_gettime is ~12 ns on Ice Lake, ~25 ns on Skylake-X, ~7 ns on Apple M2. At 1M calls/sec that is still 12 ms of CPU per wall-second. For latency-sensitive code that needs sub-microsecond timestamps you may need to read the TSC directly via rdtsc and convert in user space — the vDSO already does this internally, but its API includes a function-call boundary you can sometimes save.
"Reducing syscall count is always a win." Not when the alternative is busy-polling. A service that drops epoll_wait in favour of a tight user-space poll loop has zero syscall overhead and 100% CPU consumption, including when no work is available. Most workloads are bursty; the syscall-blocking model lets cores sleep between events, recovering the boundary cost in saved power and shared-CPU availability. The right default is: minimise syscalls when you have work to do, accept syscalls when you don't.
"strace measures syscall overhead." strace uses ptrace to intercept every syscall, which adds ~5 µs of additional overhead per call — five times the syscall's normal cost. A service profiled under strace looks 5–10× slower than it actually is. Use bpftrace or perf trace for syscall counting in production; reserve strace for "what is this process doing right now?" debugging where the slowdown is acceptable.

Going deeper

The history of syscall mechanisms — why we have `syscall` and not `int 0x80`

Linux's syscall mechanism on x86_64 wasn't always cheap. The original 32-bit Linux used int 0x80, a software interrupt that took ~1,000 cycles even on bare metal. AMD's 1999 SYSCALL/SYSRET instruction pair, designed specifically for fast user/kernel transitions, dropped this to ~250 cycles by avoiding the interrupt-descriptor-table lookup and the segment-register reloads. Intel adopted it on x86_64 in 2003. For 12 years syscalls were one of the fastest things in computing — which is why early-2010s server architectures cheerfully issued tens of thousands per request without consequences.

Intel's 32-bit SYSENTER/SYSEXIT (1997) is the forgotten cousin of this story — Intel's first attempt at a fast syscall path, mutually incompatible with AMD's SYSCALL on 64-bit. The two paths coexisted in the kernel for a decade; modern x86_64 Linux uses SYSCALL exclusively because AMD64's design won the architecture war and Intel adopted it for 64-bit mode. The 32-bit syscall path through SYSENTER is still in the kernel for backward compatibility but is rarely exercised; if you see entry_SYSENTER_compat in a flamegraph in 2025, you are running 32-bit binaries and should investigate whether that was intentional.

Meltdown and Spectre changed everything in 2018. KPTI (Kernel Page-Table Isolation) split the kernel and user page tables into two CR3 values that swap on every transition, doubling the TLB pressure and adding the 140-cycle floor. The RETBleed and Inception mitigations of 2022–2023 added more. The syscall mechanism the hardware designers built for speed has been re-instrumented for safety, and the boundary cost has roughly tripled in 8 years. Hardware vendors are working on faster mitigation paths (Intel's STIBP-as-default, AMD's automatic IBRS), but the historical trajectory has been monotonic. Code written in 2015 that issued 50,000 syscalls/sec ran at 250 ns each = 12 ms of CPU/sec. The same code on a 2025 KPTI-enabled CPU costs 50 ms/sec, a 4× regression that no application metric captures.

vDSO internals — how `clock_gettime` skips the kernel

The vDSO is a 4 KB shared object the kernel maps into every process at a fixed virtual address. Its symbols (__vdso_clock_gettime, __vdso_gettimeofday, __vdso_getcpu, __vdso_time on x86_64) are ordinary functions you can call without the syscall instruction. They work because the kernel maintains a vvar page — also mapped into every process — containing kernel state these functions need: the current monotonic time, the wallclock time, the CPU id, the TSC scaling factors. The vDSO functions read this state, do the necessary multiplies and adds in user space, and return. No privilege transition, no register save, no KPTI. Why this is safe even with Spectre: the vvar page is read-only from user space and contains values the kernel was already willing to publish (current time isn't a secret). The vDSO functions don't speculate across privilege boundaries because there isn't one to cross. The Spectre attack surface is the syscall path; the vDSO path was always outside that surface, which is why it didn't pay any of the 2018+ mitigation costs. When your application calls time.perf_counter_ns() in Python, glibc's clock_gettime(CLOCK_MONOTONIC) resolves to __vdso_clock_gettime, which reads the TSC, reads the scaling factor from vvar, multiplies, and returns — all in 25 ns. The same code path in 1995 took 1.5 µs through int 0x80. Sixty-fold improvement, no Spectre tax.

io_uring — the architecture and the gotchas

io_uring introduces three syscalls — io_uring_setup, io_uring_enter, io_uring_register — and one set of shared-memory rings. The setup syscall allocates the SQ (submission queue) and CQ (completion queue) and returns a file descriptor. The application mmaps the rings into its address space. To submit, it writes an io_uring_sqe (submission queue entry) into the SQ ring and either calls io_uring_enter to nudge the kernel or relies on IORING_SETUP_SQPOLL's polling thread to pick up new entries automatically. The kernel writes completion entries (io_uring_cqe) into the CQ ring; the application reads them.

Three production gotchas worth memorising. First, IORING_SETUP_SQPOLL consumes a kernel CPU thread per ring. On a 16-core box with 16 application threads each owning a ring, that is 16 kernel threads spinning. Use shared rings or IORING_SETUP_SQE_TIMEOUT to bound the polling. Second, the CQE depth is fixed at setup; if your application doesn't drain the CQ fast enough, new SQEs fail with EBUSY. Most production code uses CQ depth = 4× SQ depth as a safety margin. Third, io_uring's API has evolved rapidly — the syscall numbers, flag semantics, and SQE layout have all changed across kernel versions. Production users should pin a specific kernel version per deploy, not "latest LTS". Hotstar's incident review of an io_uring performance regression in 2024 traced to a kernel point-release that changed IOSQE_ASYNC semantics; the fix was a kernel pin in the AMI.

A fourth subtler point: io_uring makes your code asynchronous in a way that can hide bugs. A synchronous read() that fails returns immediately; an io_uring-submitted read that fails surfaces the error in the CQE potentially seconds later, by which point the calling context is gone. Production users adopt the pattern of attaching a request-correlation token to every SQE's user_data field and routing CQEs back to the originator through a dispatch table. This is more code than the synchronous version; it pays back in throughput, not in simplicity.

What changes on aarch64 and Apple Silicon

The KPTI tax is overwhelmingly an x86 story. AMD's Zen and Intel's Ice Lake/Sapphire Rapids both pay it because their hardware Meltdown protection requires the kernel to swap CR3 manually. ARM's design separates user and kernel address spaces by default through ASID-tagged TLB entries — the kernel doesn't need to flush user TLB entries on a syscall because they were never visible. Why aarch64 was largely immune to the Meltdown class of attacks: ARM's exception-level architecture (EL0 = user, EL1 = kernel, EL2 = hypervisor) uses separate translation tables (TTBR0 for user, TTBR1 for kernel) that are always installed simultaneously, with permission bits enforcing the boundary. The Meltdown vulnerability required the same translation table to map both user-readable and kernel-only pages, which the ARM design never did. This is not because ARM designers were prescient; it is because they were optimising for low-power mobile workloads where TLB flushes are expensive, and the side effect was Meltdown immunity. Apple Silicon (M1, M2, M3, M4) has additionally optimised the syscall path to roughly 240 ns per call — about 30% of an Ice Lake's cost — because the M-series cores never needed the KPTI swap. A Python service that issues 100,000 syscalls/sec costs 110 ms of CPU/sec on Ice Lake and 24 ms/sec on M2 Pro. This is not a benchmark trick; it is a real difference in what the boundary costs on each hardware family. Production deployments running on Graviton (ARM) instances see roughly half the syscall overhead of equivalent x86 deployments, all else equal — a direct line item in capacity planning.

When the right answer is "more syscalls, not fewer"

The optimisation framing of this chapter — fewer syscalls is better — has an important counterexample. Code that holds a kernel resource (a file descriptor, a lock, a buffer) across a long compute path serialises the kernel's view of the resource: nothing else can use it until the compute finishes. A web server that holds the accept socket lock across a 5 ms request handler blocks accept on all other concurrent connections. The right pattern is to release the resource — via an explicit syscall — as soon as it isn't needed, even though that adds a syscall. The same logic applies to mmap/munmap of large buffers (release them as soon as the response is sent so the kernel can reclaim the address space), to fsync calls (issue them as early as the durability requirement allows so the page cache flush overlaps with the next request's compute), and to epoll_ctl (remove a fd from the watch set as soon as you know it won't be read again). Resource-holding overhead is invisible to the per-process metrics this chapter focused on but visible to the system as a whole. The right framing is: minimise syscall count per request when the syscall does no useful work; do not minimise syscall count per request at the cost of holding kernel resources longer than needed.

A similar inversion applies to madvise(MADV_FREE) and madvise(MADV_DONTNEED). These are pure-overhead syscalls — they do no I/O, return no data, and benefit only the system's other tenants by hinting that pages can be reclaimed. An application running on a 64-vCPU shared host that never issues madvise is a worse neighbour than one that issues 5,000 madvise calls per second to release transient buffers. The local profile shows higher syscall overhead; the host profile shows lower memory pressure and better behaviour for co-tenant pods. Production systems that optimise local metrics at the cost of system-wide ones eventually get scheduled to underpopulated nodes — and discover that the savings were illusory once the operations team stops trusting them as good neighbours.

The seccomp-bpf cost — when the syscall has a filter on it

Modern containerised production deploys (Kubernetes with runtime/default seccomp, gVisor, Firecracker microVMs) put a Berkeley Packet Filter program in the syscall path that inspects every call's arguments before the kernel handler runs. The filter program runs in interpreted BPF or JIT-compiled native code; the typical cost is 30–80 ns per syscall on top of the boundary tax. For a service issuing 200,000 syscalls/sec under a 60-rule seccomp profile, that adds 6–16 ms of CPU/sec — small but measurable, and entirely invisible to application metrics. Why this cost is rarely caught: the BPF filter runs inside do_syscall_64, attributed to the same entry_SYSCALL_64 frame as the rest of the boundary cost. Engineers see kernel CPU and assume it is the syscall doing real work; the seccomp rule evaluation is silently included. To separate them, run perf record -e syscalls:sys_enter -e bpf:bpf_prog_load -p $(pidof svc) and compare; or temporarily remove the seccomp profile in a test pod and remeasure. CRED found a 4% CPU saving in 2024 by replacing a 200-rule seccomp profile (the gVisor default at the time) with a 30-rule profile tuned to the service's actual syscall set; the security team approved because the surface area was strictly reduced, not expanded. The lesson: the boundary cost in 2025 production is the sum of the hardware mechanism (KPTI, retpoline, IBRS), the kernel mechanism (entry stub, register save, dispatch), and the userspace-policy mechanism (seccomp filter). Any of the three can dominate; profiling has to look at all three.

Reproduce this on your laptop

sudo apt install linux-tools-common linux-tools-generic bpftrace
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Compare vDSO and SYSCALL paths with perf-stat counters
python3 syscall_overhead_demo.py

# Count per-process syscall rate on a running service
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:5 { print(@); clear(@); }'

# See which syscalls dominate a specific PID
sudo perf trace -p $(pidof <yourservice>) -s

You should see the vDSO path at 70–100 ns per call (most of it Python overhead) and the syscall path at 800–1500 ns depending on whether KPTI is enabled. The bpftrace line gives a per-comm syscall histogram every 5 seconds — anything over 50,000/sec for a single process is worth investigating with perf trace -s.

A useful exercise after reading this section: pick three production services on your own fleet, attach bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:5 { print(@); clear(@); }' for one minute during a normal-traffic window, and rank them by syscall rate per core. The service at the top of the ranking is probably not the one your team thinks is the most expensive — and the gap between expectation and measurement is the gap this chapter exists to close. Most teams discover at least one service in their fleet that is paying 20–40% of its CPU to the boundary; the fix is rarely more than a config change, but it requires that someone first asked the question.

Where this leads next

This chapter opened Part 12 — the costs your code does not contain but does pay. The next chapters break out each invisible cost into its own anatomy and fix catalogue.

/wiki/context-switch-cost — what the scheduler does when it moves a thread off a core, and why the cold-cache penalty is often larger than the switch itself.
/wiki/scheduler-latency — why CFS occasionally lets a runnable thread wait 8 ms even with idle cores, and what SCHED_DEADLINE and SCHED_FIFO change.
/wiki/tlb-misses-and-huge-pages — the cost of address translation and the workload regimes where 2 MB and 1 GB pages help vs hurt.
/wiki/page-fault-handling-minor-vs-major — the kernel path through __handle_mm_fault and why minor faults often dominate kernel CPU on healthy services.
/wiki/wall-some-overheads-are-invisible — the closing chapter of Part 11, which named the wall this part now decomposes.

The progression mirrors the diagnostic ladder a senior engineer runs when "the application looks fine but the kernel is hot": first identify the syscall storm (this chapter), then the context-switch storm, then the TLB and page-fault costs, then the scheduler-latency tail. By the end of Part 12 the reader can look at a flamegraph dominated by entry_SYSCALL_64, __handle_mm_fault, or __schedule and name not just the symbol but the application pattern that produced it. That is the vocabulary the production-debug chapters in Part 15 assume the reader has.

The reader who finishes this chapter has the right mental model for one specific failure mode — kernel CPU dominated by trivial syscalls. The next four chapters extend that model to context switches, page faults, TLB pressure, and scheduler latency; the chapter after those (/wiki/cgroup-throttling-cost) extends it to container-imposed costs that look like syscall overhead from the application but are actually cgroup-level CPU bandwidth controls. The complete Part 12 catalogue lets the reader attribute every invisible cost a service can pay to its specific kernel mechanism, which is the prerequisite for the production-debug ladder Part 15 will assume as background knowledge.

A final framing for the chapter: every line of code that calls into the standard library is implicitly making a decision about syscall cost. The decision is almost always invisible at the code-review level; reviewers focus on correctness and readability, not on whether os.path.exists() issues a stat syscall under the hood. Building the habit of mentally tagging every standard-library call with its kernel cost — open is one syscall, stat is one syscall, time.time() is one vDSO call, subprocess.run() is dozens of syscalls — turns code review into performance review at no additional cost. The senior engineer who can read a diff and say "this loop will issue 200,000 syscalls per second under our peak load" is doing the same diagnostic work as the engineer who reads a flamegraph after the fact, but earlier and cheaper.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), §3.2.4 "System Calls" — the canonical decomposition of the syscall path.
Jonathan Corbet, "The current state of kernel page-table isolation" (LWN, 2017) — the definitive write-up of KPTI's mechanism and cost.
Jens Axboe, "Efficient IO with io_uring" (kernel.dk, 2019) — the original io_uring design document.
Linux kernel documentation, Documentation/x86/entry_64.txt — the per-cycle anatomy of entry_SYSCALL_64.
Andy Lutomirski, "On vsyscalls and the vDSO" (LWN, 2011) — why the vDSO exists and what it replaced.
Mark Mossberg, "How fast is a system call?" (2019) — measurement methodology for the post-KPTI cost.
Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk on coordinated omission; directly applicable when measuring syscall-bound services.
/wiki/wall-some-overheads-are-invisible — the previous chapter; the wall this one starts to decompose.