Tracing syscalls and kernel functions
Karan runs the order-matching path at Zerodha Kite. At 09:15:03 IST every weekday, the cash-equity market opens and his service goes from 4,000 RPS to 180,000 RPS in eleven seconds. Last Tuesday the open went sideways: p99 of order-acknowledgement climbed from 8 ms to 92 ms for ninety seconds, then settled. The application flamegraph from py-spy showed nothing — 38% in socket.send, 22% in _pickle.dumps, the usual shape — because everything interesting was happening below the syscall boundary. The Python process was blocked in the kernel, in tcp_sendmsg, in __alloc_skb, in places no userspace profiler can reach. Karan had two questions and one tool. The questions: which syscall was the slow one, and which kernel function inside that syscall was eating the time. The tool: a kprobe on tcp_sendmsg and a tracepoint on syscalls:sys_enter_sendto, attached for ninety seconds across the next market open, with the latency histogram bucketed by remote IP and order-book partition. By Wednesday's open Karan knew the answer was a slab-allocator stall during __kmalloc_node when 240,000 socket buffers were being allocated per second — and he knew it because two probes, costing 80 nanoseconds each, told him.
A syscall is the boundary your application profiler stops at. Tracing past that boundary means attaching a probe to a kernel function — either a kprobe (works on any non-inlined symbol but is ABI-unstable) or a tracepoint (a stable, pre-declared instrumentation point with named arguments). The choice between them is the choice between flexibility and forward-compatibility, and getting it wrong means a script that worked on Linux 5.15 will silently lie on 6.6.
The boundary the application profiler cannot cross
Every userspace tool — py-spy, pprof, async-profiler, perf record -F 99 -p <pid> — works by interrupting the target process and walking its userspace stack. The walk stops when it hits a syscall instruction, because past that instruction the process is no longer running its own code; the kernel is. The CPU is in ring 0, the page tables have switched (KPTI; on a post-Meltdown kernel, the kernel uses a separate page table), and the registers point at kernel addresses the userspace tool has no symbol table for. So the application flamegraph collapses everything below the syscall into a single bar — usually labelled [unknown] or, if the tool is honest, entry_SYSCALL_64. For a service whose work is mostly userspace compute (a tight Python numerical loop, a Go JSON parser), the missing region is small. For a service whose work is mostly I/O — every payment gateway, every order-matching engine, every web server doing more than 10,000 RPS per core — the missing region is most of the wall time.
A small but worth-noting consequence of the userspace-stops-at-syscall rule: the wall-clock latency of a request is not the sum of the userspace times the application profiler shows. A request that the application bills as "8 ms of work" is in fact 4 ms of userspace plus 3.5 ms of kernel time spent across a dozen syscalls plus 0.5 ms of off-CPU wait. The application's own metrics, even if they are honestly reported, miss the kernel and the off-CPU components by construction. This is why every serious performance investigation eventually opens a kernel tracer; the alternative is debugging in the dark.
To see inside that region, you have to put the probe on the kernel side. The kernel exposes three families of probe points, and the right one to use depends on what is being measured. System call entry/exit tracepoints (syscalls:sys_enter_<name> and syscalls:sys_exit_<name>) are the highest-level: they fire exactly when a userspace process invokes or returns from a syscall, and the tracepoint's pre-declared arguments let you read the syscall number, the file descriptor, the buffer length, the return value. Kprobes sit one level deeper: you can attach one to almost any non-inlined kernel function — tcp_sendmsg, __alloc_skb, do_filp_open — and read its arguments and return value. Static tracepoints are pre-declared instrumentation points that the kernel's maintainers have committed to keeping stable across versions; they live at carefully chosen sites (block I/O completion, scheduler wake-up, page fault handling) and offer the same observability as a kprobe with a contract that the names and argument shapes will not change.
The difference between kprobe and tracepoint is the difference between "I will bolt a sensor onto any internal pipe in the building" and "I will read from a sensor the architect installed". Both work; one of them keeps working when the building is renovated.
A useful way to internalise the three families is to count them on a real kernel. On a 6.6 box, sudo bpftrace -l 'tracepoint:syscalls:*' returns about 720 entries (one entry and one exit tracepoint per syscall, with a few hundred syscalls in the table). sudo bpftrace -l 'tracepoint:*' | grep -v syscalls: returns roughly 1,400 more — the broader static-tracepoint family covering scheduler, block I/O, networking, memory management, and so on. sudo bpftrace -l 'kprobe:*' | wc -l returns somewhere between 65,000 and 80,000 depending on the kernel config — basically every non-inlined function in the kernel. The number gap (2,000 stable points vs 70,000 ABI-unstable points) names the trade-off: tracepoints cover the well-trodden paths the maintainers cared enough to instrument, kprobes cover everything else.
sendto() traverses 5+ kernel functions before any byte hits the wire. Tracepoints sit at stable contract points (entry, exit, scheduler events); kprobes attach to any non-inlined function but break across kernel versions. Probe overhead is roughly 50–80 ns per attachment, dominated by the int-3 trap (kprobe) or the static jump (tracepoint).Why the overhead numbers matter for production attach: a kprobe on a function called once per packet at 240,000 packets/s costs roughly 240,000 × 80 ns = 19.2 ms of CPU per second per core, or about 1.9% overhead. On a 16-core box, the same probe attached to a per-packet function consumes a third of a CPU. Tracepoints, because they use a static jump that the kernel patches in at runtime rather than an int-3 trap, cost about half as much — closer to 0.95% on the same workload. For Karan's market-open use case (90 seconds, low single-digit percent acceptable), either is fine. For an always-on production attachment to a hot function, the difference between 1% and 2% is the difference between "shipped" and "rejected at code review".
A walking demonstration — what the syscall layer looks like through bpftrace and BCC
Below is a Python script that drives bpftrace to attach to the syscall-entry tracepoint for sendto, captures a minute of data from the local machine, parses the histogram output, and prints a per-PID latency summary. This is the canonical shape of a production syscall tracer: tracepoint for the timing boundary, BPF map for in-kernel aggregation, Python orchestrator for parsing and reporting. The script is what Karan ran during his market-open investigation, scaled down to a laptop-sized reproduction.
#!/usr/bin/env python3
# trace_sendto_latency.py
# Attach syscall entry/exit tracepoints for sendto(), capture per-PID
# latency histograms over 60 seconds, print the percentile ladder
# for each PID that called sendto more than 100 times.
import subprocess, json, re, sys
from collections import defaultdict
PROGRAM = r'''
tracepoint:syscalls:sys_enter_sendto {
@start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_sendto / @start[tid] / {
$delta_us = (nsecs - @start[tid]) / 1000;
@lat_us[pid, comm] = hist($delta_us);
@count[pid, comm] = count();
delete(@start[tid]);
}
interval:s:60 { exit(); }
'''
result = subprocess.run(
["bpftrace", "-e", PROGRAM],
capture_output=True, text=True, timeout=75,
)
out = result.stdout
# bpftrace prints a histogram per (pid, comm) key as ASCII bars.
# Parse the @lat_us section into structured percentiles.
hist_pat = re.compile(r"\[(\d+), (\d+)\)\s+(\d+)")
section_pat = re.compile(r"@lat_us\[(\d+), (.+?)\]:")
sections = re.split(r"\n(?=@lat_us\[)", out)
for section in sections:
m = section_pat.search(section)
if not m:
continue
pid, comm = int(m.group(1)), m.group(2)
buckets = [(int(lo), int(hi), int(c))
for lo, hi, c in hist_pat.findall(section)]
total = sum(c for _, _, c in buckets)
if total < 100:
continue
cum, p50, p99, p999 = 0, None, None, None
for lo, hi, c in buckets:
cum += c
if p50 is None and cum >= 0.50 * total: p50 = hi
if p99 is None and cum >= 0.99 * total: p99 = hi
if p999 is None and cum >= 0.999 * total: p999 = hi
print(f"pid={pid:>6} comm={comm:<14} n={total:>6} "
f"p50={p50:>5}us p99={p99:>6}us p99.9={p999:>7}us")
# Sample run on a c6i.4xlarge in ap-south-1 acting as a Kite gateway during simulated market open:
$ sudo python3 trace_sendto_latency.py
pid= 14821 comm=kite-matcher n= 9842 p50= 8us p99= 124us p99.9= 480us
pid= 14823 comm=kite-matcher n= 9311 p50= 8us p99= 132us p99.9= 512us
pid= 14831 comm=kite-bookkeeper n= 4127 p50= 12us p99= 210us p99.9= 980us
pid= 14842 comm=nginx n= 2834 p50= 4us p99= 32us p99.9= 96us
pid= 14901 comm=otel-collector n= 1218 p50= 6us p99= 48us p99.9= 124us
Walk-through. PROGRAM is the bpftrace source as a Python raw string; raw strings stop Python from interpreting \d and other escapes. The two tracepoints (sys_enter_sendto, sys_exit_sendto) bracket each syscall: entry stores the timestamp keyed by thread id, exit subtracts to get the elapsed time and updates a histogram keyed by (pid, comm). The / @start[tid] / filter on the exit tracepoint is critical — without it, exits that fire before the tracer attached would compute nsecs - 0 = nsecs, polluting the histogram with garbage values in the seconds. Filters that test for "the entry side has stored a timestamp" are how every entry-exit tracer keeps its data clean. subprocess.run(..., timeout=75) runs bpftrace for the 60-second probe window plus 15 seconds of slack; bpftrace exits cleanly when the interval:s:60 { exit(); } action fires. The histogram-parsing block turns bpftrace's ASCII bar output into structured percentiles; the two regex patterns match section headers (@lat_us[14821, kite-matcher]:) and bucket lines ([8, 16) 1234 |@@@@@@@@|). The percentile loop walks the sorted buckets, tracking cumulative count, and snapshots p50 / p99 / p99.9 as it crosses each threshold; this is the same algorithm hdrh uses internally and is faithful enough for tracing data.
Why this script chose tracepoints over kprobes for the syscall boundary: sys_enter_sendto and sys_exit_sendto are stable across all Linux versions from 4.7 onwards because the syscall tracepoint family is auto-generated from the syscall table itself — the kernel maintainers cannot rename sendto without renaming the syscall, and the syscall ABI is one of the strongest stability contracts in the kernel. A kprobe on the entry function (__sys_sendto on x86_64, __arm64_sys_sendto on ARM, do_sys_sendto on older kernels) would have to be conditionalised by architecture and version. The syscall tracepoint family is the one place where the "stable contract" promise is unambiguously kept.
A second example, this time using a kprobe to look one level below the syscall — into tcp_sendmsg, where the actual TCP send work happens. The shape is similar but the trade-offs are different.
# trace_tcp_sendmsg_latency.py (excerpt)
PROGRAM = r'''
kprobe:tcp_sendmsg {
@start[tid] = nsecs;
@sock[tid] = arg0; // struct sock *sk
@len[tid] = arg2; // size_t size
}
kretprobe:tcp_sendmsg / @start[tid] / {
$delta_us = (nsecs - @start[tid]) / 1000;
$bytes = @len[tid];
@lat_by_size[$bytes >> 10] = hist($delta_us); // bucketed by KiB
delete(@start[tid]); delete(@sock[tid]); delete(@len[tid]);
}
interval:s:30 { exit(); }
'''
# (parsing identical to the sendto script)
The shift from tracepoint to kprobe gives Karan something the tracepoint cannot: arguments. tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) exposes the socket and the byte count; the tracepoint exposes only the syscall-level arguments. Kprobes can read any function's arguments because they attach at the function's entry, where the calling convention puts them in registers (rdi, rsi, rdx, rcx, r8, r9 on x86_64 SysV). The cost is the ABI-unstable contract: when Linux 5.18 changed tcp_sendmsg's signature to accept a struct msghdr * directly instead of being wrapped, scripts that had hard-coded arg2 for the size suddenly read garbage. Tracepoints have no such risk — their argument layout is part of the contract.
A useful third example, this time crossing the boundary in the opposite direction. The two scripts above started from a syscall and went down toward the kernel; this one starts from a kernel function and asks "what userspace process triggered this". Imagine the answer to "which container is causing the slab allocator to spike" — you need to attach to __alloc_skb and read bpf_get_current_pid_tgid() and bpf_get_current_comm() to attribute every kernel-side allocation to the userspace process that triggered it. The bpftrace one-liner is kprobe:__alloc_skb { @[comm, pid] = count(); } interval:s:30 { print(@); clear(@); }, and on a Hotstar transcoding fleet during the IPL final the output cleanly separates the ten or so worker pods by their per-second skb allocation rate. The kernel-side probe gives a view that no application-level metric can: every container's true cost in kernel resources, not just the cost the application measured itself. This is the unique value of kernel tracing — it sees the cost the application cannot bill itself for.
How a probe actually fires — the int-3 trap path
The mechanism behind a kprobe is worth knowing because it shapes the failure modes. When you attach a kprobe to tcp_sendmsg, the kernel does three things atomically: it copies the first instruction of the function to a kernel buffer, replaces that instruction with int 3 (the x86 single-byte breakpoint, 0xCC), and registers a handler for the do_int3 trap at that address. When the function is called, the CPU executes int 3, traps into kernel mode, the breakpoint handler runs the registered probe (your eBPF program), then single-steps the saved original instruction in a controlled context, and resumes at the second instruction of the function. The whole dance takes 80–120 ns on a modern x86 part, dominated by the trap-frame save and the IRET that returns from it.
The "atomically" part is doing real work in that paragraph. The kernel cannot simply patch one byte and hope no CPU is executing the function at that instant; on a 64-core box it almost certainly is. The actual sequence uses the text_poke_bp mechanism, which is a three-step state machine: first the kernel inserts an int 3 at the patch site, broadcasts a synchronisation IPI (inter-processor interrupt) so every CPU sees the change, then patches the rest of the instruction (if multi-byte), then does another IPI, and finally any CPU that hits the int-3 during the patch window simply runs the probe handler — which is the desired behaviour. The whole orchestration takes about 5 microseconds for a single kprobe attach on a 16-core box, scaling linearly with core count. For attach-and-detach scenarios in tight loops (a wrapper that attaches, runs a 100-ms benchmark, detaches, repeats) this overhead is the dominant cost and is worth knowing about; for once-an-incident attaches it is invisible.
This is why kprobes cannot attach to inlined functions, to functions in the trap-handling path itself (which would recurse infinitely), or to a handful of "no-kprobe" symbols the kernel marks with __kprobes — attaching a probe to those would cause the system to deadlock or the verifier to refuse the program. The kprobes_blacklist sysctl lists them; on a modern kernel the list has about 60 entries, mostly entry/exit assembly thunks and the int3 handler itself. When a script fails to attach with Cannot install kprobe, the symbol is almost always either inlined out at compile time or on the blacklist.
The inlining trap is worth a paragraph because it is the one most likely to confuse a first-time tracer. Linux's hot kernel functions are aggressively inlined by the compiler when LTO and PGO are enabled (which is the default on most distributions since 2022). A function declared static inline in a header may have no symbol at all in the running kernel; a function that is inlined at most call sites but kept as a real symbol at one or two will fire your probe only at those one or two call sites, not at the hundreds of inlined ones. Running sudo bpftrace -l 'kprobe:tcp_*' | wc -l on a 6.6 kernel and comparing to the same query on a 5.10 kernel typically shows a 30% reduction in available kprobe targets — the same TCP stack, but more of it inlined. The fix is to attach to a parent function that has not been inlined and use it as a proxy for the inlined inner function; the resulting fire rate will be different (the parent fires once per packet batch, the inlined inner once per packet) but the timing semantics are usually preserved.
Tracepoints work differently. The kernel source contains, scattered throughout, calls to trace_<name>() macros — trace_sched_switch(), trace_sys_enter(), trace_block_rq_issue(). At compile time the macro expands to a NOP (zero-overhead when no tracer is attached) followed by a function call to a per-tracepoint dispatcher that is patched in at runtime when a probe attaches. The patching uses text_poke to replace the NOP sequence with a JMP to the dispatcher. There is no trap, no IRET, no register-save dance — just a direct jump to the probe handler and back. The cost is roughly 40–60 ns, half of a kprobe.
The "zero overhead when no tracer is attached" claim is worth verifying because it is the property that makes tracepoints viable to ship in production kernels. The NOP that the compiler emits at every tracepoint site is actually a multi-byte nop (5 bytes on x86_64, sized to be replaceable by a JMP rel32 without splitting any other instruction). When no tracer is attached the CPU executes the NOP at full pipeline throughput — the front-end fetches it, the decoder marks it as a NOP, the rename stage skips it, and it never occupies an issue port or a reservation-station entry. On modern OoO cores the NOP is effectively free; measurements with perf stat on a kernel with versus without tracepoint sites compiled in show no detectable performance difference. This is why the kernel maintainers can liberally scatter TRACE_EVENT(...) macros throughout hot paths without performance review — the cost is provably zero until someone attaches.
int 3 and pays the trap-and-return cost on every fire (~80 ns). A tracepoint patches a NOP into a JMP at attach time and pays only the call cost (~50 ns). At million-fire-per-second rates the difference compounds into significant CPU; for ad-hoc 60-second investigations it is invisible.Why this matters for choosing your probe site: at low fire rates (sub-1000/s) the mechanism difference is invisible — a 30-ns gap multiplied by 1000 fires/s is 30 µs/s of CPU, less than 0.003%. At high fire rates (above 100k/s on a function the scheduler hits, like __schedule or finish_task_switch) the gap becomes operationally meaningful. The same probe attached as a tracepoint costs 5 ms/s of CPU per core; as a kprobe it costs 8 ms/s — the difference between 0.5% and 0.8% overhead. For an always-on production tracer measuring scheduler latency, that 0.3% is the difference between "shippable" and "send back to redesign". For Karan's 90-second market-open trace, both are noise.
Reading the output without lying to yourself
The histogram from trace_sendto_latency.py has two failure modes that look like signal but are not. The first is measurement-bias from the probes themselves — if your probe adds 80 ns to every syscall and the syscall normally takes 4 µs, you have inflated p50 by 2%. The bias is uniform across the histogram so percentile ratios (p99/p50, p99.9/p99) stay accurate, but the absolute numbers are biased high. For latency-sensitive p99 work this is acceptable; for absolute throughput claims it is not. Always report numbers from a probe-attached run as "with tracing overhead included" rather than as absolute production latency. A second run with the probes detached will give you the unbiased baseline.
The second failure mode is PID drift on short-lived processes. The histogram is keyed by (pid, comm). If a process forks a worker that calls sendto once and exits, the worker's PID shows up in your output with n=1 and a single-bucket histogram — useless for percentile calculation, but it was a real syscall. If you're computing aggregate p99 across all processes, the script's total < 100 filter drops these short-lived workers and your aggregate is biased toward long-running processes. For the Zerodha use case this is fine (the order matcher is the long-running process and the p99 of its sendto is the question). For Hotstar's transcoding fleet, where workers are deliberately short-lived, the filter would drop most of the signal — replace it with a per-comm aggregation that ignores PID.
A third trap, subtler than the first two: the syscall tracepoint counts only the syscall, not what triggered it. A sendto of 1460 bytes that takes 50 µs could be 50 µs of TCP send work, or 5 µs of TCP send work followed by a 45-µs context switch because the calling thread was preempted between sys_enter_sendto and sys_exit_sendto. The histogram cannot distinguish them. To attribute the slow syscall to one cause or the other, you have to attach a second probe — either tracepoint:sched:sched_switch (to see if the thread was scheduled out during the syscall) or kprobe:tcp_sendmsg (to bracket the actual send work and subtract). Karan's investigation needed both; the first 90 seconds of his probe attached only the syscall tracepoints, and the resulting p99 of 92 ms looked like a TCP-stack issue. The second 90 seconds attached the scheduler tracepoint as well, and the histogram split: 60% of the slow syscalls had been preempted, 40% had been blocked in __alloc_skb. The fix — raising the slab cache water-mark and pinning the matcher to dedicated cores via cgroup cpuset — targeted both halves.
A second-order practical concern: filter early, filter in the kernel. Every byte that crosses from the kernel-side BPF program to the userspace tracer pays for itself in ring-buffer space and userspace CPU. If you only care about sendto calls from the matcher process, the bpftrace program should test pid == 14821 inside the action and return early otherwise; this is roughly fifty times cheaper than letting the userspace tracer filter on PID after the event has crossed. The same logic applies to byte-length filtering, fd filtering, comm filtering. The mantra to keep: every event that reaches userspace is an event that has paid for the trip; filter before you pay.
A fourth and final reading trap, the most insidious: the histogram only sees what was sampled. If your probes attached at 14:32:00 and the slow-syscall episode happened at 14:31:58, the histogram contains no signal. Tracing is by construction not retrospective — you have to be probing at the moment the event happens. This is why production tracing setups for known-recurring events (market open, IPL match start, Big Billion Day midnight) are scheduled in advance: a cron job attaches the probe ten minutes before the event window and detaches it ten minutes after. For events that recur on an unpredictable schedule (rare crash modes, random p99 spikes), the choice is between always-on tracing (with the overhead cost amortised across days of quiet) and continuous sampling at a low rate. Continuous profiling tools like Pyroscope, Parca, and Polar Signals have made the always-on case viable; their kernel-side cost is on the order of 1% of one core because they sample at 99 Hz instead of attaching to every fire.
The choice between scheduled-attach and always-on tracing is also a security choice. A bpftrace script attached to tcp_sendmsg can read every byte of every payload sent by every process on the box — the eBPF program has access to the kernel's struct msghdr and can bpf_probe_read_kernel the payload buffer. For payment data this is a regulated-data leak waiting to happen. Production fleets at Razorpay and PhonePe gate kernel tracing behind break-glass procedures: an SRE on call gets CAP_BPF via an ephemeral cert with a 1-hour TTL, every attached probe is audit-logged with the script source, and the audit log is shipped to an immutable store. The same security model that gates kernel-module insertion is the right model for kernel tracing access, because the capability they give is essentially the same: read arbitrary kernel memory, observe arbitrary events.
Common confusions
- "Tracepoints and kprobes are interchangeable." They are not. Tracepoints have a stable ABI contract; kprobes do not. A script using
kprobe:tcp_sendmsgmay break silently on the next kernel upgrade because the function was renamed, inlined, or had its signature changed. A script usingtracepoint:syscalls:sys_enter_sendtowill not, because the syscall tracepoint family is bound to the syscall ABI itself. For ad-hoc investigation either works; for shipped tooling, prefer tracepoints when one exists. - "A kprobe lets me attach to any function." Almost. The
kprobes_blacklist(~60 entries) marks symbols where attachment would deadlock or recurse — the int-3 handler itself, parts of the IRQ entry path, the trampolines that switch page tables. Inlined functions also cannot be attached because no symbol exists at runtime. WhenbpftracereturnsCannot install kprobe, the symbol is one of these. - "Probe overhead is the same regardless of fire rate." Per-fire overhead is constant (~80 ns kprobe, ~50 ns tracepoint), but total overhead scales linearly with fire rate. A kprobe on
tcp_sendmsgat 240k fires/s consumes 1.9% of one core; the same probe on__scheduleat 5M fires/s consumes 40%. Always measure or estimate the fire rate of the function before attaching in production. - "Syscall latency from the tracepoint is the work the kernel did." It is the wall-clock time between
sys_enterandsys_exit, which includes any preemption, page faults, or scheduling delays during the syscall. To isolate the kernel work, you need a second probe to subtract out the off-CPU time. See/wiki/off-cpu-flamegraphs-the-other-half. - "Lost samples on the bpftrace ringbuf are minor." They are not. A
Lost N eventsmessage means N data points are missing from your histogram. If the lost events were uniformly distributed, your percentiles are still approximately correct. If they were correlated with the slow path (most likely — slow events take longer to format and ship), your p99 is silently low by an unknown amount. Tighter in-kernel filtering is the only fix. - "
perf traceis the same as a tracepoint script."perf tracedoes use the syscall tracepoints, but it is a high-level tool optimised for live syscall display, not for histogram aggregation. It buffers per-event records, formats them, and prints; the per-event cost is ~5 µs and the tool drops events under load above 50k syscalls/s. For aggregation, write a bpftrace or BCC script that uses an in-kernel histogram.perf traceis the right tool for "show me the syscalls a hung process is making"; bpftrace is the right tool for "what is the p99 ofsendtoacross the last 60 seconds".
Going deeper
The syscall ABI is the strongest stability contract in Linux
The syscall tracepoint family (syscalls:sys_enter_*, syscalls:sys_exit_*) is auto-generated from the syscall table. When the kernel adds a new syscall — io_uring_enter in 5.1, pidfd_open in 5.3, epoll_pwait2 in 5.11 — the tracepoints appear automatically with the same name and argument shape. When a syscall is deprecated, the tracepoint stays for binary-compatibility reasons. The strongest argument for choosing the syscall tracepoint over a kprobe on the entry function is that you are inheriting the kernel's strongest backwards-compatibility promise. In 25 years of Linux history, no syscall has ever silently changed argument types; even clone3 (which has a different signature than clone) was given a new syscall number and a new tracepoint rather than mutating the old one. If you want a probe that will still work on a kernel released ten years from now, the syscall tracepoint is your safest bet.
The same is not true for the broader tracepoint family. sched:sched_switch, block:block_rq_issue, irq:irq_handler_entry — these are stable in name but have evolved in argument shape over the years. sched:sched_switch gained a prev_state field in 4.10. block:block_rq_issue had its cmd argument changed from a free-form string to an enum in 5.13. The tracepoints are meant to be stable, but the maintainers have occasionally broken the contract for clarity. Treat them as stable-ish: more stable than kprobes, less stable than syscall tracepoints. For production tooling, snapshot the tracepoint format with cat /sys/kernel/debug/tracing/events/sched/sched_switch/format at build time and re-validate on each kernel upgrade.
Why the snapshot-and-re-validate pattern matters at fleet scale: a 200-node Razorpay cluster running across 3 regions and 2 kernel versions (5.15 LTS in production, 6.6 in canary) is one tracepoint format change away from a tracer that produces lies on canary nodes. The canary catches it — that's what canaries are for — but only if the tracer's CI pipeline reads the tracepoint format from each target kernel and fails the build on a mismatch. Without that gate, the only signal is "the dashboard from the canary nodes does not match the dashboard from the production nodes", which is the worst kind of bug to debug at 02:00 IST.
Reading the kprobe blacklist
Run sudo cat /proc/kallsyms | grep -i kprobe to see the kprobe machinery itself, then sudo cat /sys/kernel/debug/kprobes/blacklist for the actual no-attach list. On a 6.6 kernel the blacklist runs about 60 lines and includes do_int3, kprobe_int3_handler, entry_SYSCALL_64, swapgs_restore_regs_and_return_to_usermode, and a smattering of NMI-handling functions. The reason each is on the list is the same: attaching a kprobe to it would either cause infinite recursion (kprobe handler triggering another kprobe at the same site) or run in a context where the kernel cannot safely take a trap (NMI handlers, very early boot, the IRET path). The list is short because the kernel's tracing maintainers have aggressively narrowed it; in 2018 the equivalent list was three times longer.
A practical exercise: sudo bpftrace -l 'kprobe:tcp_*' lists every kprobe-attachable function whose name starts with tcp_. On a 6.6 kernel that's roughly 280 entries. Picking one (say tcp_rcv_established) and running sudo bpftrace -e 'kprobe:tcp_rcv_established { @ = count(); } interval:s:5 { print(@); clear(@); }' shows you the per-second rate of incoming TCP segments on the box. The same pattern for any other family (sched_*, __alloc_*, vfs_*) gives you a fast survey tool for "what does the kernel do most often". This kind of survey is the first step in any tracing investigation: before you can know what's slow, you have to know what's even running.
Why kretprobes are slower than kprobes
A kprobe fires at function entry; a kretprobe fires at function return. The implementation difference is severe: kretprobes work by attaching a kprobe at the entry, modifying the return address on the stack to point at a kernel trampoline, and firing the user's probe when the trampoline is reached. This is roughly twice the work of a plain kprobe (two trap events, plus the stack-mutation step), and on functions with many call sites (like kmalloc) the trampoline can become a serialisation point because the trampoline address is shared and updated under a lock. Empirically, kretprobes cost 150–200 ns per fire, vs 80–120 ns for plain kprobes.
The practical consequence: when a tracer needs to bracket a function (entry + exit, to compute latency), prefer the entry-only kprobe plus a tracepoint for the boundary if one exists, rather than kprobe + kretprobe. The tcp_sendmsg example in the script above used a kretprobe because there is no exit tracepoint for it; the syscall example used sys_exit_sendto because that tracepoint exists and is half the cost. The general rule: tracepoints > kprobes > kretprobes, in increasing cost. Mix and match by preferring the cheapest probe that gives you the timing boundary you need.
How fentry/fexit replaces kretprobe (Linux 5.5+)
A modern alternative, available on Linux 5.5+ with BTF support, is the fentry / fexit probe pair. These attach using BPF trampolines instead of int-3 traps and are roughly 10× faster than kprobes — on the order of 8–15 ns per fire, comparable to a function call. They are also typed: the eBPF program can read function arguments by name rather than by register position, which removes a class of bugs around calling-convention assumptions. In bpftrace 0.13+ you write fentry:tcp_sendmsg { @start[tid] = nsecs; } instead of kprobe:tcp_sendmsg { @start[tid] = nsecs; }, and the output is identical but the overhead drops by an order of magnitude.
The reason kprobes still dominate production tooling despite this is that fentry/fexit requires kernel BTF to be enabled at compile time (CONFIG_DEBUG_INFO_BTF=y), which old enterprise kernels and some hardened distributions do not ship. Amazon Linux 2 (5.10 kernel without BTF) does not support fentry; Amazon Linux 2023 (6.1 with BTF) does. For a fintech with mixed EKS node pools, the tracer has to fall back to kprobe on the older nodes. The migration story across 2024–2026 has been: write tooling that prefers fentry, falls back to kprobe, and flags in its output which mechanism it used so you know whether the latency numbers are biased by 80 ns of probe overhead or 10 ns. By 2027 most production fleets will have BTF-enabled kernels and fentry will be the default; until then, kprobe is the reliable lowest common denominator.
The Heisenbug problem — when the probe changes the answer
Every observability tool faces some version of the Heisenberg problem: measuring the system perturbs the system. For tracing the perturbation is concrete and quantifiable, and it occasionally crosses from "noise" into "the probe makes the bug disappear". Three cases worth knowing.
The first is timing-sensitive races. A probe that adds 80 ns to a function changes the relative timing of that function and any other concurrent work. If the bug being investigated is a race between two threads that both call tcp_sendmsg, and the race window was originally 100 ns, attaching the probe widens the window to 180 ns and makes the race more likely; or it narrows it (because the probe slows down both contenders by the same amount but the slower one falls farther behind) and makes the race less likely. Either way, the histogram you measure with the probe attached is not the same distribution as the histogram you would measure without it. The fix is comparison: run the workload with the probe attached, then with the probe detached, and compare the aggregate latency. If the comparison shows a >5% difference in p99, the probe is perturbing the system enough to bias your conclusions.
The second is CPU frequency scaling. Modern CPUs run at boost frequencies when load is light and at base frequencies when load is heavy. Attaching a tracer to a hot function increases CPU load by a few percent, which on a turbo-aware part can drop the boost frequency by one bin and increase per-instruction latency by 8–15%. The histogram appears to grow even though the kernel is doing the same work. The mitigation is to pin the CPU frequency before tracing: sudo cpupower frequency-set --governor performance locks the cores at base+turbo for the duration. On EC2 the equivalent is choosing a c6i instance type (which lacks per-core boost) over c6id (which has it).
The third is scheduler interaction. A heavy tracer can cause the scheduler to migrate the traced process across cores more often, because the tracer's own CPU load competes for the same core. The migration adds cache-cold reload costs that look like syscall slowness in the histogram but are actually scheduling artefacts. The mitigation is to pin the traced process to dedicated cores via taskset or cgroup cpuset, isolating it from the tracer's CPU footprint.
The composite advice: any time a tracer's output is going to drive a production decision (rolling out a fix, capacity-planning a fleet, signing off on a feature flag), run the with-tracer / without-tracer comparison and report both. Tracing without that comparison is measuring with an unknown bias.
The cost-of-tracing math, in production terms
A practical question that comes up in every code review of a tracing-heavy tool: "what is this going to cost in production?" The answer has three components, and being able to do the math in your head before the review is what separates a tracer that ships from one that gets sent back.
Probe firing cost. As above: 80 ns/fire for kprobe, 50 ns/fire for tracepoint, 10 ns/fire for fentry on BTF-enabled kernels. Multiply by the per-second fire rate (measured with the 5-second survey) to get CPU-microseconds per second per core. Divide by 10,000 (microseconds in 1% of a CPU-second) to get the percent of one core. A kprobe:tcp_sendmsg on a Razorpay payment-gateway pod doing 80,000 RPS is 80,000 × 80 ns = 6.4 ms/s = 0.64% of one core, or about 0.04% of a 16-core box. Acceptable.
Map update cost. Every probe firing that updates a BPF_HASH pays for one hash lookup, one atomic increment, and a possible cache-line bounce if the hash bucket is owned by a different core. The hash lookup itself is roughly 30 ns on a warm bucket; the atomic increment is 5–15 ns depending on contention; cross-core contention can spike that to 200–500 ns. The mitigation is BPF_PERCPU_HASH for counter-only workloads — one bucket per CPU, no atomics, summed in userspace at readout. The cost drops to roughly 35 ns/fire and the contention disappears.
Userspace consumer cost. For ringbuf-based per-event tools, every event reaching userspace costs 1–10 microseconds of Python (or 100–500 ns of C) per event. At 10,000 events/s the cost is 10–100 ms/s of userspace CPU per core, or 1–10% — not negligible. For aggregation-only tools (BPF map, no ringbuf) the userspace cost is one read at script-end and is invisible.
The composite cost for a typical bpftrace one-liner is dominated by the probe firing if the script is aggregation-only, and by the userspace consumer if the script emits per-event records. The transition between the two regimes is around 50,000 events/s. Below that, per-event tools are cheap; above it, they are not.
Reproduce this on your laptop
# Reproduce the syscall-tracing demo on Linux (5.15+ recommended).
sudo apt install bpftrace linux-tools-common linux-tools-generic linux-headers-$(uname -r)
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh
# Run a workload that calls sendto() at high rate (e.g. iperf3 client):
iperf3 -c speedtest.aarnet.edu.au -t 60 &
# In another terminal, attach the tracer:
sudo python3 trace_sendto_latency.py
# Compare: with --membind=local vs remote NUMA, the histogram shifts by ~30%
The reader who runs this on their laptop will see real numbers in the same shape as the sample output above — not identical (their hardware differs, their kernel version differs, their network path differs) but the same story: a process doing high-rate sendto shows a bimodal histogram with a fast bulk and a slower tail, and the tail's shape is the answer to most questions about why their service's p99 looks the way it does.
Three small follow-up exercises that build fluency. Exercise one: switch the script from tracepoint:syscalls:sys_enter_sendto to kprobe:__sys_sendto (the kernel-side entry function, present on x86_64). The output should be similar but slightly biased high (the kprobe has a higher per-fire cost). Comparing the two histograms on the same workload teaches the bias-vs-flexibility trade-off in concrete numbers.
Exercise two: add a kprobe:tcp_sendmsg to the script and report the latency of just the TCP-stack work, separate from the full syscall latency. The difference between the two histograms is the time spent outside tcp_sendmsg — argument validation, copy-from-user, socket-lock contention. On a contention-free run the difference is small (< 5 µs); on a contended run it can be most of the syscall.
Exercise three: pick a non-syscall kernel function (say, do_sys_poll or __alloc_skb) and run the same probe pattern. The result is per-function latency for any kernel work, regardless of which syscall triggered it — the building block for any custom kernel-level performance investigation.
Where this leads next
This chapter introduced the syscall and kernel-function probe families; the rest of Part 6 widens the lens. Chapter 43 (uprobes and USDT) covers the userspace symmetric — attaching to functions inside Postgres, OpenJDK, Python, or your own service binary. Chapter 44 (per-event delivery in production) is where the cost analysis from this chapter graduates to "what does it actually look like when the producer outpaces the consumer at 10M events/s". Chapter 45 (eBPF latency histograms in production) shows how the per-PID histogram pattern from trace_sendto_latency.py becomes a continuously-on observability stream rather than a one-shot investigation tool.
The single insight to carry forward: tracepoints and kprobes are not the same mechanism; the choice between them is the choice between a contract you can rely on for ten years and an attachment point that gives you flexibility now. For ad-hoc incident response either works; for shipped tooling, the syscall tracepoint family is the strongest contract Linux offers, the broader tracepoint family is stable-ish, and kprobes are explicitly not stable. Knowing which one you're using is what lets you tell, three years from now, whether the tracer's silence means "the bug is gone" or "the probe stopped firing because the kernel renamed the symbol".
The deeper habit to build is the question of what fires how often. Before any production tracing attachment, run a 5-second survey (bpftrace -e '<probe> { @ = count(); } interval:s:5 { print(@); }') to learn the fire rate. If the rate is below 10k/s, attach with confidence; if above 100k/s, pause and consider tracepoint over kprobe, fentry/fexit over kprobe/kretprobe, or in-kernel filtering before any per-event delivery. The Karan opener of this chapter is what fluency looks like at Zerodha scale: two probes attached for ninety seconds across the next market open answer a question that the application profiler had been failing to answer for weeks. The product is not the probe; the product is the speed at which a specific question about kernel behaviour becomes a runnable answer with real numbers behind it.
A final habit worth forming.
Keep your tracing scripts under version control with the kernel version they were validated against. Every fintech ops team I have seen do this well has the same shape of repository — a tracers/ folder with one file per script, each starting with a comment block listing the kernel versions it has been tested on, the typical fire rate of its probes, the expected per-fire overhead, and a one-line description of the question it answers. The folder is the team's institutional memory of "what we know how to ask the kernel about", and on the night something goes wrong the on-call engineer's first move is to scroll through the folder looking for the closest existing script. The product, again, is not the individual script; it is the speed at which "we have seen this shape of bug before" turns into "we already have a probe for that".
References
- Brendan Gregg, BPF Performance Tools (Addison-Wesley, 2019) — chapters 4 and 13 cover kprobes, tracepoints, and syscall tracing in depth.
- Linux kernel kprobes documentation — the maintainers' description of attachment, blacklist, and trampoline mechanism.
- Linux kernel tracepoints documentation — the contract and patching mechanism behind static tracepoints.
- Andrii Nakryiko, "BPF CO-RE reference guide" — covers fentry/fexit and the BTF-driven probe family that supersedes kretprobes on modern kernels.
- bpftrace reference guide — the canonical reference for probe-type syntax, including the kprobe / kretprobe / tracepoint / fentry distinction.
- LWN: "An introduction to kprobes" — the foundational LWN article that introduced kprobes to the wider Linux community.
- /wiki/bcc-toolchain — the previous chapter, where the framework around kprobes and tracepoints lives.
- /wiki/bpftrace-the-awk-of-production — the brevity end of the eBPF tracing spectrum that this chapter's scripts use as a driver.