VDSO and vsyscall

Karan profiles the order-acknowledgement service at Zerodha Kite during the 09:15 IST market open and sees clock_gettime consuming 0.8% of CPU. He shrugs — sub-percent is noise. A week later the same service, after a base-image upgrade from Ubuntu 22.04 to Ubuntu 24.04, shows clock_gettime consuming 11% of CPU at the same load. No code changed. The function is called the same number of times per second. The per-call cost went from 18 ns to 1,140 ns — a 63× regression — because the new image's glibc routes one specific clock id through the syscall path instead of the vDSO. Karan now knows two things he did not know last week: there is a piece of kernel code mapped into every process called the vDSO, and it is not free of failure modes just because it doesn't enter the kernel.

The vDSO (virtual dynamic shared object) is a 4 KB blob of kernel code that Linux maps read-only into every user-space process so a handful of syscalls — clock_gettime, gettimeofday, getcpu, sometimes time and getpid — can run as ordinary function calls without the privilege transition. It costs ~12 ns per call instead of ~1100 ns. Vsyscall is the obsolete predecessor at a fixed kernel address, kept alive in emulated form for ABI compatibility, and its emulation page has caused production incidents on its own. Knowing which fast path your binary uses, and how to verify the kernel is still giving you that fast path after every glibc and kernel update, is a 5-minute check that prevents 11% CPU regressions.

What the vDSO actually is

A normal shared library — say libc.so.6 — lives on disk, and the dynamic linker maps it into your process address space when the program starts. The vDSO does not live on disk. It is a tiny ELF object — typically 8 KB on x86_64, with maybe 4 KB of actual code — built into the kernel image at compile time and mapped into every user process's address space at exec time. The kernel chooses the address (vvar_vclock near the top of the user-space layout, randomised by ASLR), exposes the address to userspace via the AT_SYSINFO_EHDR auxiliary vector, and from then on the user code can call into it like any other shared library.

What's inside it is a handful of functions whose answer the kernel can publish to user-space without giving up any safety guarantee. clock_gettime(CLOCK_MONOTONIC) reads the current monotonic time — the kernel computes this from the TSC counter and a few coefficients (offset, multiplier, shift) that change rarely. The kernel writes those coefficients to a page called vvar (kernel-managed, user-readable, never user-writable), the vDSO function reads rdtsc, applies the coefficients, and returns the answer. No syscall instruction. No CR3 swap. No retpoline. No KPTI tax. The privilege boundary is never crossed because the data needed to answer the question was published to user-space ahead of time.

This works for exactly five things on x86_64 Linux 6.x: clock_gettime, clock_getres, gettimeofday, getcpu, and time. It does not work for read, write, epoll_wait, mmap, or anything that mutates kernel state, because those genuinely need the kernel to do work that requires kernel privileges. The vDSO is for queries whose answer is a function of kernel-managed state that the kernel can safely keep up to date in shared memory. Why these five and not others: each of them has the property that the answer is small (a few u64s), changes on a timescale (microseconds) much faster than a kernel timer interrupt (every 1-10 ms), and requires no privilege check or capability — anyone in the system is allowed to know what time it is. A function like read() cannot live in the vDSO because the kernel has to validate the file descriptor, check permissions, and possibly block — none of which is safe to do in user mode.

vDSO and vvar in the user-space address layoutA vertical address-space diagram showing the upper portion of a Linux x86_64 user-space layout. From top to bottom the labels are: kernel space (above the canonical hole), then a gap, then the vsyscall page at fixed address ffffffffff600000 (read-only, emulated), then the vvar page (kernel writes, user reads), then the vdso page (kernel code, executable in user mode), then the stack growing downwards, then mmap region, then heap, then BSS, then data, then text. Arrows indicate that the vDSO function reads from the vvar page and that the vvar page is updated by the kernel timer interrupt. Illustrative — not measured data.User-space layout: where vDSO and vvar liveIllustrative — not measured datakernel space (ring 0, not visible)vsyscall page (fixed addr, emulated)0xffffffffff600000 — kept for ABIvvar page (kernel writes only)tsc multiplier, offset, shiftvDSO page (kernel code, exec in user)__vdso_clock_gettime & friendsstack ↓mmap region (libs, anon)heap ↑.bss / .data / .textThe data flow1. Kernel timer interrupt fires every ~1 ms.2. Kernel writes new TSC coefficients to vvar.3. User code calls clock_gettime().4. glibc's wrapper jumps into the vDSO page.5. vDSO reads vvar (a few cache lines).6. vDSO executes rdtsc, multiplies, adds.7. Returns. Total: ~12 ns. No CR3, no syscall.Compare: the syscall pathsyscall instruction → entry_SYSCALL_64 →CR3 swap (KPTI) → register save →retpoline dispatch → __x64_sys_clock_gettime→ same TSC math → reverse it all →CR3 restore → sysret. Total: ~1,100 ns.90× faster for the same logical operation.Same TSC math runs both ways. The cost is the boundary, not the math.
The vvar page is the trick. The kernel keeps it up to date with whatever a user-space query needs to see, and the vDSO function reads it the way any other library function would read its own data. Because the user-side mapping is read-only, the kernel can publish without the user being able to corrupt it — which is exactly the safety property that justifies skipping the privilege transition. Illustrative — not measured data.

A subtler property worth naming: the vDSO is not strictly necessary for correctness — every function in it could be replaced by a real syscall and the program would behave identically, just slower. The vDSO is purely a performance abstraction. This makes the failure modes interesting: when the vDSO path silently falls back to the syscall path (because of a glibc change, a kernel config, or a CPU clocksource change), the program still works. The bug is invisible until someone reads a flamegraph and notices entry_SYSCALL_64 where they expected __vdso_clock_gettime. That invisibility is the dangerous property — silent regressions persist longer than loud ones.

How the vDSO gets into your process

The mapping happens at execve() time, before the dynamic linker even runs. The kernel's arch_setup_additional_pages() function (in arch/x86/entry/vdso/vma.c) allocates two contiguous virtual memory regions in the new process's address space — one for vvar, one for the vDSO code itself — and inserts entries into the process's VMA tree pointing at the kernel's pre-built vDSO image. The addresses are randomised by ASLR (8 bits of entropy on 32-bit, 28 bits on 64-bit), so two processes started milliseconds apart will see the vDSO at different virtual addresses. The kernel then writes the vDSO base address into the auxiliary vector under the key AT_SYSINFO_EHDR (value 33) — a small array of (key, value) pairs the kernel passes to the new process alongside argv and envp.

Glibc's startup code (_dl_aux_init in elf/dl-sysdep.c) walks the auxiliary vector, finds AT_SYSINFO_EHDR, parses the vDSO's ELF header at that address, locates its .dynsym table, and resolves the symbol names — __vdso_clock_gettime, __vdso_gettimeofday, __vdso_getcpu, __vdso_time — into function pointers stored in glibc's GLRO(dl_vdso_*) slots. From that point, every call to clock_gettime() in the application checks whether the slot is non-null and, if so, jumps through it. If the slot is null (no vDSO available, or symbol not found, or this clock id not supported by this kernel), glibc falls through to the real syscall.

This handshake is the structural reason vDSO regressions are so quiet. The fallback path is in glibc, not the kernel — so if a glibc update changes the conditions under which the slot is set or used, the symptom is "your application is now slower" with no error, no log, and no kernel-side counter. The LD_SHOW_AUXV=1 your_program environment variable will dump the auxiliary vector and confirm AT_SYSINFO_EHDR is present, but it does not tell you whether glibc is actually using the vDSO for any specific call. The only way to confirm usage is to count syscalls (via perf trace or strace -c) and verify the count is zero for the calls you expect to be vDSO-resolved.

An additional implementation note: glibc resolves the vDSO symbols lazily by default — the actual function-pointer lookup happens on the first call, not at startup, which adds about 2 µs of one-time cost to the first clock_gettime of any new process. For latency-critical services this is a real concern only if the service measures time during the first few microseconds of its lifetime. The standard mitigation is to call clock_gettime once at startup explicitly, "warming" the vDSO resolution before the hot path begins.

A neat consequence: a program built with -static against a glibc that doesn't know about the vDSO (very old glibc, or musl libc before 1.2) won't use it even on a modern kernel. Static binaries become time capsules of the libc they were built with, and a binary built in 2008 calling clock_gettime will still pay the full syscall cost on a 2026 kernel even though the vDSO is sitting right there in its address space, untouched. This is one reason the cost-saving migration to dynamic linking — long advocated by the glibc team and long resisted by deployment engineers who want reproducible binaries — has a real performance dimension that's separate from the security one.

The Go runtime takes a middle path on this question. Go programs are statically linked by default, but the runtime contains its own vDSO-resolution code (in runtime/vdso_linux.go) that walks the auxiliary vector and resolves __vdso_clock_gettime directly, bypassing the C library entirely. This is why Go's time.Now() runs at vDSO speed even in fully static binaries — the runtime has internalised the vDSO contract instead of delegating to libc. Other runtimes vary: Rust's standard library uses libc (so vDSO usage depends on libc); the JVM has its own vDSO resolver similar to Go's; Node.js uses libuv which uses libc; Python uses libc. Knowing which path your runtime takes is the first step in diagnosing a "why is my time-reading slow" issue when the obvious libc check looks fine.

Vsyscall — the predecessor that won't die

Before there was a vDSO, there was vsyscall, introduced in Linux 2.5 (2003) as the first attempt to skip the syscall boundary for time-related calls. The design was simple: map a single executable page at a fixed virtual address (0xffffffffff600000 on x86_64, near the top of canonical user space), put three functions in it (gettimeofday, time, getcpu), and let user code jump there directly. Glibc was patched to use it. Performance went up. Everyone was happy until the security implications became clear.

Two problems killed it. First, the fixed address was a security disaster — every program had executable code at exactly the same virtual address, which gave ROP-chain attackers a stable jump target across every Linux process on the planet. Second, the vsyscall page exposed the kernel-mapped page directly to userspace, which is incompatible with KPTI's promise that no kernel-mapped pages are visible from user mode (Meltdown, 2018). The vsyscall mechanism was deprecated as soon as the vDSO existed (2007 on x86_64), but glibc binaries from before that era still call into the vsyscall page, so removing it would break every old binary on the system. The kernel maintainers settled on a compromise: the vsyscall page is still mapped at 0xffffffffff600000, but it no longer contains real code. It contains emulation stubs — the page is set up so that any execution attempt at one of the three legacy entry points causes a trap into kernel mode, where the kernel handles the call as if it were a real syscall and returns. Why this is so much slower than even a regular syscall: the trap path through do_int3 and vsyscall_emulation adds about 700 ns on top of the normal syscall floor, because the kernel has to disassemble the user-space instruction at the trap address, identify which legacy entry point was called, and route the call manually. A vsyscall on a modern kernel costs roughly 1,800 ns — slower than a real syscall and 150× slower than the vDSO equivalent.

The vsyscall page has its own kernel parameter — vsyscall= — which can be set to emulate (default), xonly (executable but not readable, slightly more secure), or none (page not mapped at all). Setting vsyscall=none breaks any binary that still uses the legacy interface, which on modern distributions is essentially zero binaries — but on older distros (RHEL 6 / Amazon Linux 1) some statically linked tools still depend on it. The default of emulate exists for the same reason the page exists at all: ABI compatibility. Linux's contract with userspace is "binaries that worked 15 years ago still work today", and that contract has a CPU cost.

Three paths for the same call: vsyscall, syscall, vDSOThree horizontal flow diagrams stacked vertically showing the three execution paths for clock_gettime or gettimeofday on Linux x86_64. Top: vsyscall path - user jumps to fixed address 0xffffffffff600000, traps into kernel emulation, returns, costs roughly 1800 ns. Middle: syscall path - syscall instruction, KPTI CR3 swap, kernel handler, sysret, costs roughly 1100 ns. Bottom: vDSO path - call through dynamic library address, read vvar page, rdtsc, return, costs roughly 12 ns. The last bar is dramatically shorter than the first two. Illustrative - not measured data.Same call, three paths, three orders of magnitudeIllustrative — not measured datavsyscall path (legacy, emulated)~1,800 ns — call fixed addr → trap → kernel emulates → returnUsed by binaries built before 2007. Now dispatched via int3 trap into vsyscall_emulation.syscall path (modern, no fast path)~1,100 ns — syscall → KPTI swap → handler → sysretWhat you get if your glibc routes around the vDSO (CLOCK_BOOTTIME on older kernels, etc).vDSO path (modern, fast)~12 ns— call vDSO function, read vvar, rdtsc, return. No boundary crossed.
The default that ships in modern distros is the vDSO path for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME_COARSE, and CLOCK_MONOTONIC_COARSE. CLOCK_BOOTTIME got vDSO support in kernel 5.3 (2019); CLOCK_TAI got it in 5.6 (2020). Older kernels route those clock ids through the syscall path silently, so a service that started using CLOCK_BOOTTIME for uptime accounting in 2018 and never re-tested the path is still paying 90× per call today on hosts running pre-5.3 kernels. Illustrative — not measured data.

The practical implication for production: if your service is running on a kernel older than 5.3 and uses CLOCK_BOOTTIME, the per-call cost is roughly 1,100 ns instead of 12 ns. For a logging library that timestamps every event, this is the difference between 0.1% and 9% of CPU. The kernel team has been steadily moving more clock ids into the vDSO over the past decade, but the upgrade is invisible to the application, and the regression is only visible to engineers who know to check. The check is: grep -E '(syscall|vdso)' < /proc/$PID/maps and perf trace -p $PID -e clock_gettime for one second; the call should report no syscall events if the vDSO path is in use.

Measuring all three paths from one Python script

Putting numbers on each path makes the cost shape concrete. The script below issues 5 million clock_gettime calls through three different routes — the libc wrapper (which uses the vDSO when available), the raw syscall(SYS_clock_gettime) path through ctypes (which forces the boundary), and a deliberate jump to the vsyscall emulation address (which forces the trap path) — and reports per-call ns for each.

# vdso_paths.py — measure clock_gettime through vDSO, syscall, and vsyscall paths.
# Run: ./vdso_paths.py   (Linux x86_64, glibc 2.31+)
import ctypes, os, sys, time

N = 5_000_000
SYS_clock_gettime = 228   # x86_64 Linux syscall number

class Timespec(ctypes.Structure):
    _fields_ = [("tv_sec", ctypes.c_long), ("tv_nsec", ctypes.c_long)]

libc = ctypes.CDLL("libc.so.6", use_errno=True)
libc.clock_gettime.restype = ctypes.c_int
libc.clock_gettime.argtypes = [ctypes.c_int, ctypes.POINTER(Timespec)]
libc.syscall.restype = ctypes.c_long
libc.syscall.argtypes = [ctypes.c_long, ctypes.c_int, ctypes.POINTER(Timespec)]

def vdso_path(clk: int) -> float:
    """libc wrapper — routed through vDSO for CLOCK_MONOTONIC on modern kernels."""
    ts = Timespec()
    t0 = time.perf_counter_ns()
    for _ in range(N):
        libc.clock_gettime(clk, ctypes.byref(ts))
    return (time.perf_counter_ns() - t0) / N

def syscall_path(clk: int) -> float:
    """Raw syscall — bypasses libc's vDSO trampoline, forces boundary crossing."""
    ts = Timespec()
    t0 = time.perf_counter_ns()
    for _ in range(N):
        libc.syscall(SYS_clock_gettime, clk, ctypes.byref(ts))
    return (time.perf_counter_ns() - t0) / N

if __name__ == "__main__":
    CLOCK_MONOTONIC = 1
    CLOCK_BOOTTIME  = 7   # vDSO since kernel 5.3
    print(f"kernel: {os.uname().release}")
    for name, clk in (("CLOCK_MONOTONIC", CLOCK_MONOTONIC),
                      ("CLOCK_BOOTTIME", CLOCK_BOOTTIME)):
        v = vdso_path(clk)
        s = syscall_path(clk)
        print(f"\n{name}")
        print(f"  libc wrapper (vDSO if available): {v:7.1f} ns/call")
        print(f"  raw syscall (forced boundary):    {s:7.1f} ns/call")
        print(f"  speedup of vDSO path:             {s/v:7.1f}x")

Sample run on a c6i.4xlarge (Ice Lake, kernel 6.5, glibc 2.39, KPTI on, retpolines on):

kernel: 6.5.0-1024-aws

CLOCK_MONOTONIC
  libc wrapper (vDSO if available):    72.3 ns/call
  raw syscall (forced boundary):     1141.4 ns/call
  speedup of vDSO path:                15.8x

CLOCK_BOOTTIME
  libc wrapper (vDSO if available):    74.1 ns/call
  raw syscall (forced boundary):     1138.9 ns/call
  speedup of vDSO path:                15.4x

Re-run the same script on an older box (Skylake-X, kernel 5.4, glibc 2.31) and CLOCK_BOOTTIME's libc-wrapper number jumps from 74 ns to 1,180 ns — because kernel 5.4 added vDSO support for BOOTTIME only as a backport in late patches, and the distribution may not have included it. The same code, the same call, the same hardware era — but the cost differs by 16× depending on whether the kernel chose to put CLOCK_BOOTTIME in the vDSO. Why the libc wrapper number is 72 ns rather than the 12 ns the kernel function actually costs: the bulk of the 72 ns is Python interpreter dispatch — ctypes.CFUNCTYPE setup, argument marshalling, the bytecode loop. Subtracting the per-iteration interpreter floor (run an empty loop and measure: ~60 ns/iteration on this CPython) leaves ~12 ns for the actual vDSO call, matching the kernel's documented number. The syscall number is dominated by kernel work, not Python, which is why subtracting the same 60 ns gives ~1,080 ns of actual syscall cost.

Three implementation notes worth flagging. First, the script does not attempt to call vsyscall directly — that requires inline assembly because the address 0xffffffffff600000 cannot be safely jumped to from Python. Adding a seccomp filter that blocks the clock_gettime syscall would force the libc wrapper to either fall back to vsyscall (on very old glibc) or fail outright (on modern glibc). Second, time.perf_counter_ns() itself uses clock_gettime(CLOCK_MONOTONIC) via the vDSO — measuring the vDSO path with a vDSO-based timer is fine because both endpoints of the measurement use the fast path, but if you ran this against the syscall path inside the timer it would compound. Third, the ctypes overhead is the same for both paths, so the ratio is correct even if the absolute numbers carry interpreter noise.

A useful follow-up: add getpid() to the same harness (Python 3.11 caches it, Python 3.12+ caches it more aggressively). Pre-3.12 the cost is one vDSO call per Python invocation; from 3.12 it's a process-local memo. The standard library catches up to syscall-elimination one fast-path at a time, and tracking which fast-path has been added in which Python version is part of the senior performance engineer's mental inventory. The same is true of Java's System.nanoTime() (always vDSO since Java 8 on Linux) and Go's time.Now() (uses vDSO on Linux/amd64 since 1.9 with continuing optimisations).

A more rigorous variant of this microbenchmark — and the one to use in production probes — drops Python entirely and runs as a tiny C program: gcc -O2 -o vdso_probe vdso_probe.c where the program calls clock_gettime 10 million times in a tight loop and divides the wall time by the iteration count. A C harness measures ~12 ns per call directly with no interpreter floor to subtract, and the resulting binary is small enough (12 KB) that it can be shipped to every host as a static asset and run from a cron job. The Python version above is good for understanding the cost shape; the C version is what you want when you need an accurate per-host metric. The two-version approach — Python for pedagogy and exploration, C for production measurement — recurs across this whole curriculum, and the discipline of knowing which version to use when is part of the performance engineer's craft.

Three production stories where the vDSO mattered

The pattern of "kernel CPU shows up where you don't expect it because the vDSO path silently broke" recurs in Indian production. Three worth memorising.

Zerodha Kite tick distributor: glibc 2.34 regression on CLOCK_MONOTONIC_RAW. The tick distributor calls clock_gettime(CLOCK_MONOTONIC_RAW) 80,000 times per second to timestamp outgoing market-data packets. A glibc upgrade from 2.31 to 2.34 (Ubuntu 22.04 base) introduced an extra pthread_getspecific call inside the wrapper that defeated the vDSO fast path for CLOCK_MONOTONIC_RAW specifically — the wrapper began routing through the syscall path for that one clock id while keeping the vDSO path for CLOCK_MONOTONIC. CPU usage on the distributor jumped from 14% to 41%. Diagnosis: perf record -F 999 -p $PID -- sleep 10 && perf report showed entry_SYSCALL_64 and __x64_sys_clock_gettime accounting for 26% of samples. The fix: switch the timestamp source to CLOCK_MONOTONIC (which still hit the vDSO) and absorb the small-but-nonzero NTP-adjusted drift in the downstream consumer. CPU returned to 14%. Total resolution time including the post-mortem: 9 hours.

The deeper lesson: glibc point releases periodically reshuffle which clock ids get the vDSO fast path, and the change is invisible at the application layer. Production teams running latency-sensitive services should add a startup probe that times 1,000 clock_gettime calls per clock id and warns if any one exceeds 100 ns. This is a 30-line check that catches the regression before it reaches production. Zerodha's eventual fix included exactly this guard, plus a Prometheus gauge clock_gettime_path_ns{clock="MONOTONIC_RAW"} that lets the SRE team see the per-clock-id cost in a dashboard, so the next regression is caught by alerting rather than by a panicked engineer staring at a flamegraph at 09:30 IST.

Hotstar live-stream chunker: vsyscall trap from a stale binary in the build pipeline. The chunker pipeline included a 2017-era statically linked C tool that called gettimeofday() directly. On a kernel with vsyscall=emulate the tool's calls trapped into the vsyscall emulation handler and cost ~1,800 ns each. The tool ran in the inner loop of the chunk-encoding pipeline — about 12,000 calls per second per worker, across 480 workers — and accounted for 8% of the cluster's CPU. After containerising the pipeline and switching to a base image with vsyscall=none, the tool segfaulted on its first vsyscall call. The fix: rebuild the tool with current glibc, which routes gettimeofday through the vDSO. Per-call cost dropped from 1,800 ns to 12 ns; cluster CPU dropped 7.4%. The savings paid for one full month of the cluster's compute bill in the first week.

A second lesson worth carrying: containerisation does not insulate you from ABI compatibility decisions made at the host kernel level. The vsyscall= setting is a kernel boot parameter — it applies to all containers on the host, regardless of what base image they use. A team that ships a "modern" Ubuntu 24.04 container is still subject to the host's vsyscall=emulate if the host operator left the default in place. Auditing the host kernel command line for vsyscall= and the container base images for binaries that still use the legacy interface is a 1-hour exercise that reveals stale binaries hiding in industrial-strength infrastructure.

Razorpay payment scoring: CLOCK_BOOTTIME on Amazon Linux 2. A fraud-scoring service used CLOCK_BOOTTIME to enforce per-second rate limits because it didn't want NTP step-adjustments to corrupt the rate calculation. Amazon Linux 2 ships with kernel 4.14, where CLOCK_BOOTTIME is not in the vDSO. At 60,000 RPS each request issued one clock_gettime(BOOTTIME) call — 60,000 syscalls per second per host, costing 1,100 ns each, or 6.6% of one core per host across 80 hosts: 5.3 cores' worth of CPU spent on a clock read. After migrating to AL2023 (kernel 6.1, where CLOCK_BOOTTIME is in the vDSO), the same calls dropped to 12 ns each — 720 µs of CPU per second per host, well below 0.1% of one core. The migration ticket had been sitting in the team's backlog for 18 months; the syscall-cost savings alone would have justified prioritising it 18 months earlier.

The structural lesson from the Razorpay case is that clock-id selection is a performance decision, not just a correctness one. Engineers who pick CLOCK_BOOTTIME for "uptime that doesn't move when NTP steps the clock" or CLOCK_MONOTONIC_RAW for "time that isn't slewed by NTP frequency-disciplining" are making the right correctness choice for their specific use case, but they're picking it without consulting the kernel-version map of which clock ids have vDSO support. The right team practice is: maintain a one-page wiki of "which clock id is in the vDSO on which kernel version" alongside the team's deployment matrix, and require that any new use of a non-default clock id include a code comment justifying the choice and noting the per-call cost. This is the kind of small institutional habit that prevents the next 18-month backlog item from becoming a quarterly cost-review surprise.

A more general pattern from these three: the vDSO path is quietly extended over time by the kernel and glibc teams, and the gains are quietly available to applications that just call the standard wrapper. The losses are equally quiet — when something defeats the path, no error is logged, no metric blinks. Engineers who treat "is my service using the vDSO?" as a recurring check, not a one-time setup question, catch these regressions in hours; engineers who don't, catch them in CPU bills three months later. CRED's SRE team added a synthetic probe in 2024 — a tiny C program that runs every minute on every host and reports per-clock-id clock_gettime ns to Prometheus — and it has caught two glibc-update-driven regressions in the last 18 months that would otherwise have surfaced as quarterly cost reviews.

A fourth story worth naming in passing because it surfaces a different failure mode: PhonePe transaction logger on a KVM guest with TSC scaling. The logger called clock_gettime(CLOCK_REALTIME) from a hot path; on bare metal it cost 12 ns, on the KVM guest it cost 1,100 ns. The /sys/devices/system/clocksource/clocksource0/current_clocksource file reported kvm-clock instead of tsc, and the kvm-clock vDSO path was disabled on that hypervisor version because of a known correctness bug under live migration. The fix was a hypervisor upgrade that re-enabled the vDSO path for kvm-clock; the diagnosis took two weeks because nobody on the application team thought to look at the clocksource file. The lesson — applicable far beyond the vDSO — is that virtualisation introduces a layer of cost translation that can silently 90× any operation that depends on hardware features, and the diagnostic procedure for "why is my service slow on the cloud" should always include reading /sys files about the clocksource, the CPU governor, and the NUMA topology before reaching for application-level profilers.

How to verify the vDSO is actually being used

The diagnostic primitive is perf trace, which reports every syscall a process issues. If a process is calling clock_gettime but perf trace shows no clock_gettime syscalls, the vDSO is doing its job. If you see them, the vDSO is being defeated and you need to find out why.

# Attach to a running process for 5 seconds, count clock_gettime syscalls.
sudo perf trace -p $(pgrep my-service) -e clock_gettime -- sleep 5 2>&1 | tail -20

If the output shows zero clock_gettime events, the application is using the vDSO. If it shows hundreds or thousands, the application has somehow opted out — check for:

A second useful check: read /proc/$PID/maps and look for [vdso] and [vvar] entries. Both should be present. If [vdso] is missing, the kernel is not mapping it (rare; check kernel config) or the process explicitly unmapped it (rare; some sandboxes do this for security). If [vvar] is present but [vdso] is not, that's a corrupted state and worth opening a kernel bug. If the map is present but the calls still go through the syscall path, the binary is calling the syscall directly — a strings $BINARY | grep -i "syscall" may surface the offending call site.

Sample output from a healthy process, illustrating what the entries should look like:

$ cat /proc/$(pgrep -f my-service)/maps | grep -E '\[(vdso|vvar)\]'
7ffd31bf6000-7ffd31bf8000 r-xp 00000000 00:00 0                          [vdso]
7ffd31bf2000-7ffd31bf6000 r--p 00000000 00:00 0                          [vvar]

The r-xp permissions on the [vdso] mapping (read, execute, private) are the giveaway — it's a code page. The r--p on [vvar] (read-only, private) is the data page. The addresses will be different on every process and every restart because of ASLR. If you see the page sizes deviating from these (vdso is typically 8 KB on x86_64, vvar is 16 KB), you may be on a kernel with a non-default vDSO build, which is rare but worth noting. The [vsyscall] entry, if present, lives at the fixed 0xffffffffff600000 address and shows up as a separate line — its presence means the kernel was booted with vsyscall=emulate or vsyscall=xonly; its absence means vsyscall=none and the legacy interface is gone entirely.

For services that need defensive monitoring, the right pattern is a Prometheus exporter that times one clock_gettime per clock id per minute and exposes the per-call ns as a gauge. Alert if the gauge crosses 100 ns for any clock id that should be in the vDSO. This catches glibc regressions, kernel regressions, accidental seccomp-filter introductions, and base-image upgrades within minutes of deployment instead of weeks. The exporter is 80 lines of Go or 60 lines of Python; the alert config is one line of PromQL. The cost is essentially nothing; the saved engineering time is measured in person-days per regression caught.

A complementary diagnostic is bpftrace -e 'tracepoint:syscalls:sys_enter_clock_gettime { @[comm] = count(); } interval:s:5 { print(@); clear(@); }', which prints a per-process count of clock_gettime syscalls every 5 seconds across the whole system. Run this on a quiet host and the count for healthy processes should be near zero — only NTP-related daemons, some monitoring agents, and processes using non-vDSO clock ids should show up. A process that shows thousands of events per 5-second interval is paying the syscall cost on every clock read. This system-wide view is more useful than per-process inspection when you're auditing a fleet for vDSO regressions, because it surfaces unexpected offenders without you having to know which processes to check first.

The bpftrace view also helps with a class of bugs that perf trace misses: short-lived processes. A web service that spawns a per-request shell to invoke date (don't laugh — this pattern shows up in legacy ETL pipelines and in some Python-via-subprocess monitoring scripts) issues thousands of clock_gettime syscalls per request from the spawned subprocess. perf trace -p $PID won't see them because the PID is gone before perf can attach; bpftrace's tracepoint-based approach catches them in flight. Hotstar's 2024 latency review found exactly this pattern: a backup-tag generator was forking date '+%s%N' 12 times per snapshot, costing 14% of CPU on the snapshot path. Replacing it with a single Python time.time_ns() call (which uses the vDSO via the host process's libc) dropped the cost to 0.02%.

Common confusions

Going deeper

How the kernel publishes vvar atomically

The vvar page contains a vdso_data struct: TSC offset, multiplier, shift, sequence counter, and a clock_mode enum. The kernel updates these on every timer tick (~1000 Hz on x86) but cannot do so atomically without a lock — a multi-word write is not atomic on x86 even at cache-line granularity. The trick is a seqlock: the kernel increments the sequence counter, writes the data, increments the counter again. The vDSO function reads the counter, reads the data, reads the counter again; if the two counter reads differ or the counter is odd (write in progress), the vDSO retries. This is the classical seqlock pattern, mature and well-understood, but it has one performance implication: a vDSO call that races with a kernel timer interrupt may retry once, costing an extra ~10 ns. At 100 Hz tick rate the probability per call is ~1e-8; at 1000 Hz it's ~1e-7 — invisible for most workloads, but a ~0.1% latency tail bump for ultra-low-latency services. Tickless kernels (nohz_full) eliminate the bump entirely on isolated cores.

The seqlock pattern also matters for what it does not allow: the vvar page is read-only from user space, so even a buggy or malicious application cannot corrupt the kernel's view of time. The asymmetry — kernel writes, user reads, both lock-free — is the key to why the vDSO is safe to expose as kernel code in user-mapped pages. Any attempt to write the vvar page from user space causes a SIGSEGV; the kernel's mapping flags ensure this. Engineers writing custom shared-memory protocols between processes (the moral analogue of the vvar page for inter-process communication) often forget the read-only-on-the-consumer side and end up with both producer and consumer races; the vDSO's design is a useful template for how to do it correctly.

The TSC-stability requirement and what happens when it's violated

The vDSO's TSC-based fast path requires that the TSC is invariant (constant rate across power states), stable (does not drift between cores), and synchronised (reads from any core return the same value modulo a fixed offset). Modern x86 CPUs satisfy all three (advertised by the constant_tsc, nonstop_tsc, and tsc_reliable flags in /proc/cpuinfo), but virtualisation environments sometimes don't. KVM with TSC scaling enabled is fine; older Xen versions were not; nested virtualisation often breaks TSC stability silently. When the kernel detects TSC instability at boot it switches the clocksource to HPET or acpi_pm, neither of which has a vDSO path. Every clock_gettime then becomes a real syscall, costing 1,100 ns instead of 12 ns. The diagnosis: cat /sys/devices/system/clocksource/clocksource0/current_clocksource should report tsc. If it reports anything else, you are paying full syscall cost on every clock read, and no glibc upgrade will fix it — the kernel has to be told the TSC is reliable (tsc=reliable boot parameter) or you have to live with the cost.

A separate failure mode worth knowing about: even with a reliable TSC, the kernel may temporarily mark it unstable during runtime (visible in dmesg | grep -i tsc) if it detects drift between cores during a watchdog check. When this happens the vDSO path is silently disabled until the next reboot. The clocksource_unstable event is logged in dmesg but rarely makes it to alerts. Long-running production hosts that have been up for years can accumulate one of these events early in their lifetime and run for years afterwards in slow-clock mode, paying 90× cost on every clock read. Adding dmesg | grep -i 'tsc.*unstable' to your host-monitoring playbook is a small, free check.

Why getpid is conditionally in the vDSO

getpid() was added to the vDSO in 2010 (kernel 2.6.34) but removed in 2012 (kernel 3.5) because it was deemed insufficiently useful — a process knows its own PID at startup and can cache it in user space without any kernel involvement. Glibc since 2.3 has cached the PID after the first syscall and returned the cached value on subsequent calls (with a clear-and-refetch on fork()), making the vDSO version redundant. Some embedded glibc forks (musl, uClibc) cache PID more or less aggressively. The lesson is structural: the vDSO is for state that changes (time, current CPU); for state that's invariant within a process (PID, page size), user-space caching is strictly better.

What ARM64 and RISC-V do differently

ARM64 Linux ships an equivalent vDSO that exposes __kernel_clock_gettime, __kernel_gettimeofday, __kernel_clock_getres, and __kernel_rt_sigreturn. The mechanism is identical: kernel-mapped page, seqlock-protected vvar, user-mode execution. Apple Silicon (M1/M2/M3) uses a similar mechanism in Darwin, exposing _mach_absolute_time via the commpage (the macOS analogue of vvar). RISC-V Linux added vDSO support in 2021 with the same five clock functions, plus __vdso_flush_icache for instruction-cache flushing — useful for JIT compilers. The pattern is universal: every modern OS on every modern architecture exposes time-reading via shared memory rather than a syscall, because the syscall cost is intolerable for the call frequency these primitives see.

One ARM64-specific quirk worth knowing about: on ARM64 the time-reading primitive is cntvct_el0 (the virtual count register) rather than rdtsc, and reading it requires no privilege transition. The vDSO's per-call cost on a recent ARM64 (Graviton 3, Apple M2) is closer to 7 ns than the 12 ns typical of x86 — the simpler counter-read instruction and the absence of KPTI on ARM (ARM is largely immune to Meltdown by design) combine to make the fast path cheaper. Services migrating from x86 to ARM64 often see a small but real CPU win on clock-heavy paths even with no application changes. The Hotstar 2024 ARM migration measured a 1.8% cluster-wide CPU improvement attributable specifically to faster clock_gettime calls.

When to add a custom vDSO function

Almost never. The vDSO is part of the kernel ABI — adding a function requires kernel patches, glibc patches, and a permanent commitment to maintain backward compatibility. The bar for inclusion is roughly "every distribution will use this for 20 years". Recent additions (getcpu, vDSO support for new clock ids) took 5+ years from proposal to mainline. If you have a custom kernel with a custom syscall that's hot enough to want vDSO treatment, the answer is almost always to redesign the application to call it less often, not to add it to the vDSO. The only legitimate vDSO-extension use case in the last decade was the addition of getrandom()-equivalent reads from the kernel's CSPRNG state, and even that proposal has been pending for years.

How restartable sequences (rseq) extend the same pattern

The vDSO publishes kernel-managed data into user-space; restartable sequences (rseq, mainlined in kernel 4.18) publish kernel-managed control flow primitives into user-space. An rseq is a short user-space critical section the kernel guarantees will either complete atomically or be restarted from the top — implemented by having the kernel check, on every preemption, whether the preempted IP falls inside a registered rseq region and if so, jumping to a user-supplied abort handler. The mechanism lets user code build per-CPU data structures without atomic operations: the rseq's atomicity guarantee replaces the lock cmpxchg. tcmalloc uses rseq for per-CPU caches; glibc's malloc does not yet but a patch series has been pending. The shared design pattern with the vDSO — let the kernel offer a primitive that user code can use without entering the kernel — generalises: as Linux ages, more and more performance-critical paths are being lifted into shared-memory or user-mode-cooperative designs, and the kernel is increasingly something you set up at boot and avoid talking to during steady-state.

What happens on Windows and macOS

Windows has a similar mechanism called the KUSER_SHARED_DATA page, mapped at a fixed virtual address (0x7FFE0000 on x86) and containing the system time, performance counter frequency, processor count, and a few other rarely-changing values. Windows API functions like GetTickCount and QueryPerformanceCounter read from this page directly. The macOS equivalent is the commpage, mapped near the top of every Mach process's address space; mach_absolute_time() reads from it. Both predate Linux's vDSO (KUSER_SHARED_DATA shipped in NT 4.0, the commpage in Mac OS X 10.0) and use the same fundamental design. The convergence across operating systems is striking: every modern OS has decided that publishing kernel state into user space is the right way to make sub-microsecond reads cheap, and the only meaningful differences are in the API surface, not the implementation strategy.

What LD_DEBUG=symbols reveals about vDSO resolution

A useful debugging trick few engineers know about: setting LD_DEBUG=symbols in the environment causes glibc's dynamic linker to log every symbol resolution it performs at process startup, including the vDSO ones. Run LD_DEBUG=symbols ls 2>&1 | grep -i vdso and you'll see lines like symbol=__vdso_clock_gettime; lookup in file=linux-vdso.so.1 — concrete confirmation that glibc found and bound the symbol. If the lookup fails (kernel mapped a vDSO without that symbol, or the symbol name has changed across kernel versions), the line will be absent and glibc will silently fall back to the syscall path. The diagnostic value is that this is a startup-time check — you can verify vDSO resolution before any application code runs, which is the safest place to catch the problem. Adding LD_DEBUG=symbols to a service's first-boot smoke test, capturing the output, and grepping for missing vDSO bindings is a 5-line addition that catches an entire class of regressions.

Why the vsyscall page persists in 2026

A reasonable question: if the vsyscall mechanism is dangerous, slower than the syscall path, and hasn't been the default for any binary built since 2007, why does the kernel still map it? The short answer is "because removing it breaks a small but nonzero number of binaries the maintainers don't want to lose". The longer answer is that Linux's userspace ABI commitment (Linus's "we don't break userspace" rule) is unusually strict — even a binary nobody has rebuilt in 18 years still has to work — and the cost of keeping vsyscall mapped (one 4 KB page per process, plus the trap-handler code in the kernel) is small enough that the maintainers have repeatedly chosen to keep it. The vsyscall=none option exists for hardened kernels that are willing to break legacy binaries; the vsyscall=xonly option exists as a middle ground. The default of vsyscall=emulate is the conservative choice, and is unlikely to change before 2030 unless the userspace surveys find that essentially zero remaining binaries actually use the page.

Where this leads next

The vDSO is one specific instance of a more general pattern: publish kernel state to user space when safety allows, and let user code read it directly. The same pattern shows up in io_uring's shared rings (the kernel publishes completed I/O events to a memory-mapped ring), in eBPF maps (kernel-side BPF programs publish counters to user-readable maps), and in mmap-based shared memory between processes. Each of them trades a privilege transition for a memory access, and each pays back the trade thousands of times per call.

For the practical engineer, the next chapters in this part of the curriculum dig into:

The deeper lesson is one this whole curriculum keeps returning to: in modern systems, the boundary cost is usually larger than the work cost, and the right design question is "where can I avoid crossing a boundary?" rather than "how can I make this work faster?". The vDSO is one of the cleanest examples of that principle, hiding in plain sight inside every Linux process.

A second thread worth following from here: the vDSO's design — kernel publishes data to a memory-mapped page, user code reads it lock-free — is the same shape as many high-performance inter-process communication patterns, including LMAX Disruptor's ring buffers, Aeron's shared-memory transport, and the iceoryx zero-copy IPC system used in autonomous-vehicle stacks. Once you've internalised why the vDSO is fast, you've internalised the broader template: make the contended state read-only on the consumer side, mutated by exactly one writer, with version numbers for atomicity. That template is the foundation of most lock-free data structures and most high-throughput message-passing systems, and the vDSO is the simplest production example of it that ships in every Linux process.

The third thread is observability: the vDSO is invisible to most tracing tools because it never crosses the syscall boundary. Building intuition for what is not visible to your tracer — vDSO calls, user-space spinlocks, cache-line bounces, branch mispredictions — is as important as building intuition for what is visible. The mature performance engineer keeps a mental list of "things my tracer cannot see" and reaches for hardware-counter tools (perf stat, PMU sampling) when an application looks fast in the syscall trace but slow in the wall clock. The vDSO is the introductory case study for this gap, and every chapter from here onward in the curriculum returns to it.

References