Tracepoints and dynamic instrumentation

It is 22:14 IST and Karan, on call for Zerodha Kite's order-matching service, is staring at the flame graph he captured 90 seconds ago. The fat box says match_order is 71% of CPU. He already knows that. What he needs to know is whether it is 71% because every call to match_order is uniformly slow, or because one call in a thousand takes 400 ms while the other 999 take 30 µs — and the slow ones happen to align with the 10:00 IST market open. A flame graph cannot answer that question; it tells you where time is spent in aggregate, never in distribution. Karan needs to attach a probe to the entry and exit of match_order, time the difference for every invocation for the next 60 seconds, bucket the differences into an HdrHistogram, and read the p99 and p99.99 directly. He cannot recompile the binary. He cannot restart the pod. He cannot ship a debug build. He needs the kernel to give him a live measurement of a function he did not instrument, on a binary he did not build today, with overhead small enough that the live trading session does not notice. That is what tracepoints, kprobes, uprobes, and USDT exist for.

Tracepoints are the kernel's static, stable observation points; kprobes and uprobes attach dynamically to any kernel or userspace symbol; USDT is the userspace equivalent of tracepoints — pre-placed, zero-cost markers a developer left for you to attach to. Together they turn a running binary into an instrument you can measure without recompiling, restarting, or pausing it. The cost is correctness work the verifier enforces and overhead measured in CPU cycles per probe firing; the win is per-call latency distributions, argument capture, and stack snapshots from production at the moment the bug fires.

Four kinds of probe — what each one costs and when each one fits

A modern Linux kernel exposes four probe families, and the choice between them determines whether your trace is a 5-line script or a 300-line debugging saga. The families differ in where they attach (kernel vs userspace), how they attach (statically pre-declared vs dynamically inserted), and what they cost per firing.

Four families, two axes. Static probes are cheap and stable but only exist where someone pre-declared them. Dynamic probes attach to anything you can name, but you pay for that flexibility in attachment cost and in the brittleness of the symbol you depended on.

Tracepoints are the kernel's stable, hand-curated observation points. Every kernel release ships a few thousand of them — sched:sched_switch fires on every context switch, block:block_rq_complete fires when a block-I/O request completes, syscalls:sys_enter_openat fires on every openat syscall, tcp:tcp_retransmit_skb fires when the kernel retransmits a TCP segment. The kernel maintainers commit to keeping the field layout stable across versions, the attachment cost is 8–15 ns per firing because the tracepoint site is a pre-placed nop patched into a jmp only when at least one consumer is attached, and the field schema is queryable at runtime via /sys/kernel/debug/tracing/events/<subsystem>/<event>/format. Tracepoints are the right default for any kernel-level observation question — disk-I/O latency, scheduler decisions, syscall counts, page faults — because they are the answer the kernel community already agreed to maintain.

Kprobes are the dynamic counterpart. You can attach a probe to any kernel function — kprobe:do_sys_open, kprobe:tcp_v4_rcv, kprobe:ext4_writepages — even if no tracepoint exists there. The mechanism is an int3 software-breakpoint instruction the kernel writes over the first byte of the target function; when execution reaches it, the trap handler runs your eBPF program before single-stepping the displaced instruction and resuming normally. The cost per firing is 70–300 ns, dominated by the trap-handler entry and exit. The catch is that kernel functions are not a stable ABI — do_sys_open was renamed do_sys_openat2 in 5.6, tcp_v4_rcv has been inlined and re-emerged across releases, and a kprobe script that worked on Ubuntu 20.04 may attach to nothing on Amazon Linux 2023. Always-current production scripts pin to specific kernel versions or use kfunc/fentry (the BPF-trampoline successor to kprobes) for ABI-stable kernel attach.

USDT (Userspace Statically Defined Tracing) is the user-space mirror of tracepoints. The library or runtime author placed a nop instruction at a meaningful point — python:function__entry fires every Python frame entry, mysql:query__start fires on every MySQL query, node:http__server__request fires on every Node HTTP request handler — and the kernel patches the nop to a trap only when a consumer attaches. Idle USDT cost is roughly 1–2 ns (the nop), and active cost is comparable to kprobes (70–300 ns) plus whatever your handler does. The contract is the same as tracepoints: the developer who placed the probe is committing to keep its arguments stable, so a USDT-based script written today will keep working on tomorrow's release of the library if the developer holds up their end. PostgreSQL, MySQL, OpenJDK, CPython, Node.js, libpython, libc-malloc, and OpenSSL all ship USDT probes.

Uprobes are the dynamic counterpart for userspace. You name a binary path and a symbol — uprobe:/usr/lib/x86_64-linux-gnu/libssl.so.3:SSL_read, uprobe:/opt/zerodha/match_engine:match_order — and the kernel attaches an int3 to that virtual address inside the target's address space. Cost is 1–3 µs per firing, which is an order of magnitude worse than kprobes because the trap crosses the userspace/kernel boundary in both directions. Uprobes are also brittle on optimised builds: the symbol you named may have been inlined into all callers and not exist as a callable function, the binary may be dlopened by some processes and statically linked into others, and Go binaries put their own complications on top because the Go runtime moves stacks around in ways the kernel's uprobe machinery did not originally anticipate (this is fixed in Go 1.17+ and modern kernels but still trips teams running older toolchains).

Why the cost gap matters: at Zerodha order-matching scale (200,000 messages/second per pod during market open), a uprobe at 2 µs per firing on the message-handler hot path adds 400 ms of CPU per second per core — a 40% overhead that is unacceptable. The same observation via a tracepoint or kfunc on an adjacent kernel-side event (the recvmsg syscall return) costs 10 ns × 200k = 2 ms per second per core, a 0.2% overhead that disappears in the noise. Picking the right probe family is not a stylistic choice; on hot paths it is the difference between "tracing in production" and "tracing brought down production". When the function you want to time is hot enough that probe cost dominates, change the question — measure once per N invocations using a sampling filter, or move the probe to a colder adjacent boundary (the syscall, the next-stage queue write) where you can reconstruct the timing from the kernel-side signal.

The flowchart for picking a probe: if a tracepoint or USDT exists at the boundary you care about, use it; if not, use kfunc/fentry on kernel paths or uprobe on userspace paths, knowing you have signed up for ABI fragility. The probe family is not the experiment — it is the instrument; pick the one with the lowest cost and highest stability that still observes what you need.

A live latency histogram from a uprobe — the artefact that closes the question Karan started with

Karan's question — is match_order slow on every call or only the tail? — is answered with a 12-line bpftrace one-liner driven from a Python orchestrator. The script attaches a uretprobe-style timing probe to the entry and return of match_order, computes the elapsed time per call, buckets the values into a log2 histogram, prints the result every 10 seconds, and stops cleanly after a duration the operator picks.

# trace_match_order.py - per-call latency distribution from a live binary
# Run inside the sp-profiler sidecar from the previous chapter:
#   kubectl debug -it zerodha-match-7c8-x2k -n trading-prod \
#       --image=internal/sp-profiler:v3 --target=app -- \
#       python3 trace_match_order.py --duration 60 --binary /opt/zerodha/match_engine
#
# Prereqs (already in the sidecar): bpftrace, python3, regex stdlib
import argparse, re, subprocess, sys, time
from collections import defaultdict

PROG_TEMPLATE = r"""
uprobe:{binary}:match_order {{
    @start[tid] = nsecs;
}}
uretprobe:{binary}:match_order /@start[tid]/ {{
    @lat_ns = hist((nsecs - @start[tid]));
    delete(@start[tid]);
}}
interval:s:{interval} {{
    print(@lat_ns);
    clear(@lat_ns);
}}
"""

def run_bpftrace(binary: str, duration: int, interval: int) -> str:
    prog = PROG_TEMPLATE.format(binary=binary, interval=interval)
    cmd = ["bpftrace", "-e", prog, "-q"]
    t0 = time.time()
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                         text=True, bufsize=1)
    out_lines = []
    try:
        while time.time() - t0 < duration:
            line = p.stdout.readline()
            if not line: break
            out_lines.append(line.rstrip())
            print(line, end="")  # stream to operator
    finally:
        p.terminate(); p.wait(timeout=5)
    return "\n".join(out_lines)

def parse_buckets(text: str) -> dict:
    """Parse @lat_ns histogram blocks into {(low_ns, high_ns): count}."""
    buckets = defaultdict(int)
    line_re = re.compile(r"\[(\d+),\s*(\d+)\)\s+(\d+)")
    for line in text.splitlines():
        m = line_re.search(line)
        if m:
            lo, hi, n = int(m.group(1)), int(m.group(2)), int(m.group(3))
            buckets[(lo, hi)] += n
    return buckets

def percentiles(buckets: dict, ps=(50, 90, 99, 99.9, 99.99)) -> dict:
    """Approximate percentiles from log2 buckets - use bucket-high as upper bound."""
    total = sum(buckets.values())
    if total == 0: return {p: None for p in ps}
    items = sorted(buckets.items())                     # sorted by bucket low
    cum = 0; out = {}; targets = {p: total * p / 100.0 for p in ps}
    remaining = dict(targets)
    for (lo, hi), n in items:
        cum += n
        for p, t in list(remaining.items()):
            if cum >= t:
                out[p] = hi                              # log2 upper bound
                del remaining[p]
    for p in remaining: out[p] = items[-1][0][1]
    return out

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--binary", required=True)
    ap.add_argument("--duration", type=int, default=60)
    ap.add_argument("--interval", type=int, default=10)
    a = ap.parse_args()
    print(f"# attaching uprobe+uretprobe on {a.binary}:match_order")
    print(f"# capturing for {a.duration}s, printing every {a.interval}s")
    raw = run_bpftrace(a.binary, a.duration, a.interval)
    buckets = parse_buckets(raw)
    pct = percentiles(buckets)
    print("\n# summary across the full window:")
    for p, v in pct.items():
        print(f"  p{p:<5} <= {v:>10} ns ({v/1000:.1f} us)")

if __name__ == "__main__":
    main()

# Sample run (60 seconds against a synthetic load, 110k calls captured):
# attaching uprobe+uretprobe on /opt/zerodha/match_engine:match_order
# capturing for 60s, printing every 10s
@lat_ns:
[1K, 2K)            34218 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)            58711 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)            12044 |@@@@@@@@@@                                          |
[8K, 16K)            3120 |@@                                                  |
[16K, 32K)            842 |                                                    |
[32K, 64K)            201 |                                                    |
[64K, 128K)            68 |                                                    |
[128K, 256K)           19 |                                                    |
[256K, 512K)            7 |                                                    |
[512K, 1M)              3 |                                                    |
[1M, 2M)                1 |                                                    |
[2M, 4M)                1 |                                                    |

# summary across the full window:
  p50    <=       4096 ns (4.1 us)
  p90    <=       8192 ns (8.2 us)
  p99    <=      32768 ns (32.8 us)
  p99.9  <=     262144 ns (262.1 us)
  p99.99 <=    4194304 ns (4194.3 us)

Three lines tell the story.

uprobe:{binary}:match_order { @start[tid] = nsecs; } stores the entry timestamp keyed by thread id. The @start[tid] map is a per-thread scratchpad that lets the entry and return probes find each other even in a multithreaded process where different cores are running match_order simultaneously.
uretprobe:.../match_order /@start[tid]/ { @lat_ns = hist((nsecs - @start[tid])); delete(@start[tid]); } is the entire measurement. The /@start[tid]/ filter drops the firing if the entry probe missed (e.g. because the process was already in match_order when the probe attached); the delete reclaims the map slot to bound memory; hist() is bpftrace's built-in log2 bucketer that holds counts in a kernel BPF map and prints the histogram on print(@lat_ns).
The interval:s:10 block prints and clears every 10 seconds, so the operator sees the distribution evolve over time. A latency cliff that opens up at minute 4 is visible as a tail growing in successive prints; a steady-state distribution looks the same in every print.

The interpretation closes Karan's question. The body of the distribution is in the 1–8 µs range — the median order match takes 4 µs, in line with the team's design budget. The tail is what the flame graph hid. 99.99% of calls finish within 4 ms, but the slowest call in the 60-second window took between 2 and 4 ms (the [2M, 4M) bucket), and a handful of calls landed between 256 µs and 1 ms — orders of magnitude slower than the median. The flame graph could not have shown this because flame graphs aggregate; they answer "where is time spent" not "what is the distribution". With the tail visible, Karan can now correlate the slow calls to specific request features — symbol IDs, order sizes, side (buy/sell), match-book depth — by adding kstack or args to the probe, which is the next chapter's territory.

Why log2 bucketing is the right default at this scale: at 200,000 calls per second, storing a precise nanosecond per call would consume 1.6 MB/s of map memory plus the kernel-userspace transfer cost. Log2 bucketing collapses each call to one of ~30 buckets (0–1 ns, 1–2 ns, 2–4 ns, ..., 16 s–32 s), which is enough resolution to read p50 / p90 / p99.99 cleanly from the printed output, and the per-firing cost is one CLZ instruction plus an atomic counter increment — under 30 ns. HdrHistogram (the Java/Python library) gives finer resolution (typically two significant figures throughout the range) at higher per-firing cost; for in-kernel BPF where memory and instruction count both matter, log2 is the standard.

The same pattern works for any function on any binary. Replace /opt/zerodha/match_engine:match_order with /usr/lib/x86_64-linux-gnu/libpq.so.5:PQexec and you get per-call latency for every PostgreSQL query the process issues; with /usr/lib/libssl.so.3:SSL_read and you get per-read TLS latency including the kernel buffer wait; with python3.11:_PyEval_EvalFrameDefault (with Python's USDT probes, USDT is preferable) and you get per-frame Python latency at the interpreter level. The probe is the same shape; only the symbol changes.

When a tracepoint beats every fancier option — kernel-side measurement

Sometimes the right answer is to not touch the userspace binary at all and instead read the kernel's own observations. A common Indian-fintech debugging task is "are the slow Razorpay UPI requests slow because of the network or because of the application?" A laptop intuition reaches for ss, tcpdump, or a userspace timer. The production answer is usually a tracepoint on tcp:tcp_probe or tcp:tcp_retransmit_skb plus block:block_rq_complete for the disk side — all kernel-side, all stable across releases, all measuring the kernel's own view of what the application's TCP and disk operations actually did.

# How long is each TCP send blocked on the network vs the application?
# Tracepoints on tcp:tcp_probe expose snd_cwnd, snd_wnd, srtt, ssthresh per packet.
sudo bpftrace -e '
  tracepoint:tcp:tcp_probe {
    @rtt_us = hist(args->srtt >> 3);   # srtt is in 1/8th us units
    @cwnd   = hist(args->snd_cwnd);
  }
  interval:s:30 {
    print(@rtt_us); print(@cwnd); clear(@rtt_us); clear(@cwnd); exit();
  }'

# Sample output from a Razorpay payments-core pod under steady load:
@rtt_us:
[256, 512)           412  |@@@                                                |
[512, 1K)           4218  |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[1K, 2K)            6481  |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)             918  |@@@@@@@                                            |
[4K, 8K)             184  |@                                                  |
[8K, 16K)             28  |                                                   |
[16K, 32K)             4  |                                                   |
@cwnd:
[8, 16)             1207  |@@@@@                                              |
[16, 32)            5481  |@@@@@@@@@@@@@@@@@@@@@@                             |
[32, 64)           12428  |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)            843  |@@@                                                |

The first histogram is the smoothed round-trip time the kernel has measured for every connection in this pod; the second is the congestion window — the kernel's estimate of how much data it can have in flight. Together they bound the network's contribution to per-request latency. If @rtt_us is concentrated under 2 ms but the application's p99 is 720 ms, the network is not the explanation — go back to userspace probes. If @rtt_us has a fat tail past 32 ms or @cwnd is stuck under 16, the kernel itself is telling you the network is the reason. Either way the answer is grounded in the kernel's own measurement, not a userspace timer that conflates everything between send() and the next recv().

The deeper reason this pattern matters: a userspace timer can only measure what the application can see, which excludes everything the kernel did on the application's behalf — epoll_wait blocked because no data arrived, sendmsg blocked because the socket buffer was full, the TCP stack was waiting for an ACK, the disk-flush in fsync was queued behind a different process's writes. Every one of those is invisible to the application but visible to the kernel through tracepoints. Why this often surprises teams: the application's metric library reports a single request_duration_seconds and the team treats it as the truth. It is the truth from the application's vantage point only — it includes any time the application thread was blocked in the kernel but cannot decompose where the block went. Tracepoints decompose the block: sched:sched_switch shows when the thread was descheduled, block:block_rq_complete shows when the disk I/O finished, tcp:tcp_probe shows when the next TCP ACK arrived. The 720 ms p99 the application reports is some sum across these; the only way to attribute it back to its sources is kernel-side probes.

For Hotstar's IPL final, the relevant tracepoints during the 25M-concurrent-viewer peak were sock:inet_sock_set_state (TCP socket lifecycle for the CDN edge), block:block_rq_complete (disk reads on the segment cache), and sched:sched_stat_runtime (CPU scheduling delay per task). A debugger who reaches only for userspace timers misses all three signals; one who knows the tracepoint ABI sees the picture in fifteen minutes.

Probe overhead in production — the budget you actually have

A back-of-envelope probe budget on one chart. The 1%-of-one-core line is rate × cost = 10⁷ ns/sec; everything below it is safe to leave running, everything above it perturbs the system measurably. Move probes diagonally — to a cheaper family, or to a lower firing rate (sampling, filters) — until they fall below the line.

Probe overhead is not zero, and the difference between a successful production trace and an outage is usually a careful budget rather than a clever script. The numbers you can trust on a modern kernel (Linux 6.x, x86_64, no PTI/MDS-related extra costs):

Tracepoints: 8–15 ns per firing when active. Idle cost is the nop — effectively zero.
kprobes: 70–300 ns per firing. Idle cost is zero (no consumer = no breakpoint).
kfunc / fentry (the BPF trampoline): 30–80 ns per firing — half of kprobe cost because they bypass the trap handler.
uprobes: 1–3 µs per firing. Idle cost is zero.
USDT: 1–2 ns idle (one nop); 70–300 ns when an attached consumer triggers the trap.

The budget rule of thumb: pick a probe whose total cost (firings/sec × per-firing cost) is under 1% of one CPU core, and you will not visibly perturb production. At 1 million firings per second on a tracepoint, the cost is 1M × 12 ns = 12 ms per second per core = 1.2% of one core — fine. The same volume on a uprobe is 1M × 2 µs = 2000 ms per second = 200% of one core — catastrophic. Either reduce the firing rate (sample 1-in-100, or filter on a high-cardinality predicate before the histogram update) or move the probe to a cheaper family.

The eBPF verifier protects you from a different class of failures — infinite loops, unbounded memory, unprivileged kernel reads — but not from overhead. A verified program that fires 10 million times per second will brown out the system as effectively as one that crashed it. Test your probe against representative load in staging, read bpftool prog show for instruction counts and run-time, and watch top for kernel-side CPU climb on the cores running the probe. The first time you trace a hot path in production is not the time to learn this lesson.

A second-order overhead concern is map memory. A hist() on tid (one bucket per thread) at a process with 4,000 threads costs 4000 × 30 buckets × 8 bytes = ~1 MB of kernel memory. A hist() on (comm, pid, syscall) tuples in a busy system can balloon to hundreds of megabytes in seconds. Set BPFTRACE_MAP_KEYS_MAX and BPFTRACE_MAX_MAP_KEYS explicitly when scripting unbounded keys, and prefer lhist() (linear histogram with explicit bounds) over unbounded hist() when the value range is known.

Common confusions

"A tracepoint and a kprobe are the same thing." They are not. A tracepoint is pre-declared by the kernel community with a stable field schema; a kprobe attaches to any function symbol and gets whatever the function's calling convention exposes (registers on x86-64). Tracepoints survive kernel upgrades; kprobes break when the function is renamed or inlined. Use a tracepoint when one exists at the boundary you care about.
"USDT is just kernel-side tracing for userspace." USDT probes live in userspace memory but the machinery that fires them goes through the kernel — the int3 trap on the patched nop is handled by the kernel's uprobe infrastructure, which delivers the event to your eBPF program. So USDT is "userspace-declared, kernel-mediated", which is exactly why a stripped binary still exposes USDT (the probe metadata is in the ELF .note.stapsdt section) but does not expose arbitrary uprobes (those need symbols, which strip removes).
"Tracing in production is dangerous because it changes timing." It changes timing by the per-firing cost of the probe. A tracepoint at 12 ns is below the noise floor of every userspace metric you can read; a uprobe at 2 µs on a hot loop is observably perturbing. The danger scales with overhead, not with "tracing"; well-budgeted tracing is safer than running a debugger or a profiler that pauses the process.
"I can use uprobes on Go binaries the same way as on C binaries." You can on modern Go (1.17+) and modern kernels, but the Go runtime moves goroutine stacks during garbage-collected memory motion, which classical uprobe pt_regs-based unwinding handles imperfectly. For Go-specific instrumentation, prefer the runtime-native pprof endpoints, USDT probes that the Go ecosystem ships, or eBPF programs that use the Go runtime's well-known offset table (go-finder and similar tools).
"All four probe families need root." They need CAP_BPF and CAP_PERFMON (Linux 5.8+) or root on older kernels. Rootless tracing is possible inside a sidecar with the right capabilities granted (the previous chapter's sp-profiler image is configured exactly this way), so production policy should be "grant CAP_BPF and CAP_PERFMON to a dedicated debug image" rather than "no tracing in production".
"The bpftrace one-liner is the production answer." A bpftrace one-liner is the prototype. The production answer is the same probe wrapped in a Python orchestrator (slowest-pod selection, duration cap, redaction, output upload — the previous chapter's pattern), with explicit overhead budgeting and a kill-switch the on-caller can hit. Treat one-liners as proof-of-concepts; treat the orchestrated version as the deliverable.

Going deeper

kfunc and fentry — kprobes without the trap

Modern kernels (Linux 5.5+) added the BPF trampoline mechanism, exposed to bpftrace as kfunc: and to libbpf as fentry/fexit. Instead of an int3 trap at the function entry, the kernel rewrites the function prologue to call a small JIT-compiled stub directly. Cost drops from 70–300 ns to 30–80 ns per firing, the BPF program receives typed arguments instead of raw pt_regs, and (most importantly) the kernel commits to a stable BTF schema for the arguments — so a kfunc:tcp_v4_rcv script does not break across kernel versions the way the equivalent kprobe does.

The trade is that kfunc only attaches to functions reachable through the BPF trampoline list, which is most of the kernel but not literally every function — some heavily inlined or notrace-marked functions are excluded. For production tracing of stable kernel ABIs, prefer kfunc; fall back to kprobe only when the symbol is not kfunc-reachable. Brendan Gregg's BPF Performance Tools second edition (in progress) and the Cilium team's libbpf documentation are the canonical references for the fentry pattern.

USDT in CPython — what the runtime gives you for free

CPython compiled with --with-dtrace (the default in distribution packages on Fedora, Amazon Linux 2023, and most Pythons via manylinux) exposes USDT probes for function__entry, function__return, line, gc__start, gc__done, import__find__load__start, and import__find__load__done. Attaching a bpftrace script to usdt:python:function__entry tells you every Python frame entry — function name, filename, line number — without a single change to the application code, with overhead in the 70–300 ns range per frame entry.

Real numbers: at a CPython service running 50,000 frame entries per second per core, the USDT-attached histogram costs ~1% of one core. At 500,000 per core (a hot async service), it climbs to 10% — still tolerable for a 60-second incident-time capture but not for steady-state monitoring. The right framing: USDT is your incident-time answer for CPython; py-spy (covered earlier in Part 5) is your steady-state continuous-profiling answer. Both walk the same PyThreadState; py-spy samples it from outside, USDT lets the kernel tell you about every entry.

Tracepoint stability and the `format` file

Every tracepoint has a queryable schema at /sys/kernel/debug/tracing/events/<subsystem>/<event>/format. For block:block_rq_complete it looks like:

name: block_rq_complete
ID: 1234
format:
    field:dev_t dev;        offset:8;       size:4; signed:0;
    field:sector_t sector;  offset:16;      size:8; signed:0;
    field:unsigned int nr_sector; offset:24; size:4; signed:0;
    field:int errors;       offset:28;      size:4; signed:1;
    field:char rwbs[8];     offset:32;      size:8; signed:1;
    field:char comm[16];    offset:40;      size:16; signed:1;
    ...

The kernel community committed to keeping these field names and offsets stable. A bpftrace script that uses args->sector for tracepoint:block:block_rq_complete will keep working across kernel upgrades; a kprobe-based script that reads the same value out of pt_regs will not. For long-lived production tracing pipelines (Cilium-style network observability, Falco-style security) tracepoints and BTF-typed kfuncs are the only sustainable substrate. The reason teams sometimes prefer kprobes despite this is that tracepoints exist only where the kernel community placed one; kprobes go anywhere. The right architectural posture: read tracepoints first, fall back to kfunc for ABI-stable kernel functions, and use kprobes only when both options are unavailable.

The Razorpay UPI debugging escalation ladder

Razorpay's payments-platform team published (Kubernetes Bangalore meetup, 2024) the escalation ladder they run during a UPI latency spike. The sequence:

Tracepoints first — tcp:tcp_probe, tcp:tcp_retransmit_skb, block:block_rq_complete, sched:sched_switch — driven from a Python orchestrator that prints a side-by-side histogram every 10 seconds. Cost: under 1% of one core. Time-to-signal: 2 minutes.
USDT on the application — usdt:libpq:query__start, usdt:libssl:read__start — to attribute network/disk wait to specific application stages. Cost: under 5% of one core. Time-to-signal: 5 minutes.
kfunc on the suspect kernel path — typically kfunc:tcp_recvmsg or kfunc:do_sendfile — when the tracepoints suggest a kernel-side path that has no tracepoint. Cost: under 10% of one core. Time-to-signal: 10 minutes.
uprobes on the application's own functions — only after steps 1–3 have located the suspect function. Cost: depends on call rate; the team budgets 20% of one core for a 60-second uprobe-driven trace. Time-to-signal: 15+ minutes.

The order is important. Reaching for uprobes first is a common rookie move that costs 10× more probe overhead and 10× more debugging time, because the tracepoints would have ruled out the network and disk in two minutes if they had been tried first. The escalation ladder is what experience looks like; documenting it in the runbook is how an organisation captures that experience for the next on-caller.

Reproduce this on your laptop

sudo apt install linux-tools-generic bpftrace bpfcc-tools
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh

# Tracepoint demo - count syscalls by name for 10 seconds:
sudo bpftrace -e '
  tracepoint:raw_syscalls:sys_enter {
    @[ksym(*(kaddr("sys_call_table") + args->id * 8))] = count();
  }
  interval:s:10 { exit(); }'

# Uprobe demo - time libssl reads on every TLS-using process:
sudo bpftrace -e '
  uprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_read  { @s[tid] = nsecs; }
  uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_read /@s[tid]/ {
    @lat = hist(nsecs - @s[tid]); delete(@s[tid]);
  }
  interval:s:30 { print(@lat); exit(); }'

The first one-liner reads the kernel's stable tracepoint and prints a syscall-count histogram. The second attaches uprobes to a specific shared library and produces a per-call latency histogram for SSL_read — the same shape as Karan's match-engine probe, against a binary every Linux laptop has installed.

Where this leads next

The probe families covered here are the kernel-side instruments. The next chapters in Part 15 turn them into operational disciplines:

/wiki/perf-record-and-perf-script-the-survival-kit — the kernel-side primitive underneath every flame-graph and tracepoint workflow, and the only tool that captures stack samples and tracepoint events in the same file format.
/wiki/continuous-profiling-in-production — the always-on companion that makes incident-time tracepoints unnecessary for the cases continuous profiling already covers, and zooms-in for the cases it does not.
/wiki/differential-flamegraphs — the visualisation that uses the same probes to compare two captures (before/after deploy, baseline/incident) and find the changed code path.
/wiki/ebpf-architecture-verifier-jit-maps — the kernel-side machinery that makes all of this safe: the verifier proves your probe terminates and reads only memory it is allowed to, the JIT compiles the program to native code, and the maps connect kernel to userspace.

The arc is: a flame graph tells you where; a tracepoint or probe tells you when, how often, and how long; a continuous profiler turns those snapshots into a time-series; a differential flamegraph tells you what changed. Each chapter adds a dimension the previous one could not see. The teams that learn to compose all four — and put them in a runbook so the on-caller does not invent the workflow at 03:00 — are the teams that keep their UPI p99 under 200 ms when the rest of the country is paying for Diwali groceries.

References

Brendan Gregg, BPF Performance Tools (2019) — the canonical reference for tracepoints, kprobes, uprobes, USDT, and the bpftrace/bcc toolchain.
Linux kernel tracepoint documentation — the upstream spec for tracepoint stability and the format file.
bpftrace reference guide — the language reference for one-liners and scripts, including hist, lhist, and interval.
Linux uprobes documentation — the uprobe attachment ABI and the cost notes referenced in §4.
USDT (SystemTap-compatible) probes — man dtrace — the userspace probe markup format, supported across CPython, OpenJDK, MySQL, PostgreSQL, Node.js, and OpenSSL.
The eBPF kfunc / fentry mechanism (Cilium docs) — the modern BPF trampoline that supersedes kprobes for ABI-stable kernel attach.
/wiki/flame-graphs-in-production — the previous chapter that this one extends with per-call distributions.
/wiki/ebpf-for-latency-histograms — the eBPF-specific deep dive on log2 histograms and HdrHistogram-style precision in kernel space.