USDT and uprobes: userspace eBPF
Aditi runs the merchant-payouts batch at Razorpay. Every night at 23:30 IST, a 90-minute reconciliation job sweeps the previous day's settlements and writes adjustment rows into a 4-TB Postgres 15 cluster. Last Thursday the job's p99 query latency walked from 18 ms to 240 ms over forty minutes, then collapsed back. The kernel tracer she had attached — kprobe:tcp_sendmsg, tracepoint:block:block_rq_issue — was clean: no syscall slowness, no scheduler preemption, no I/O wait. The cost had moved up the stack into Postgres itself, into ExecHashJoin and LWLockAcquire and palloc, and from the kernel side those functions were invisible. Aditi needed a probe that sat inside the Postgres binary. She had two choices: attach a uprobe to a Postgres function by name (uprobe:/usr/lib/postgresql/15/bin/postgres:ExecHashJoin) or read one of the eight USDT tracepoints the Postgres maintainers had compiled into the binary (usdt:/usr/lib/postgresql/15/bin/postgres:postgresql:lock-wait-start). She tried the USDT first because it is the maintainer-blessed contract, and within ninety seconds her bpftrace one-liner showed the answer: 87% of the latency was lock-wait time on a single relation, payouts_2026_04, because the autovacuum worker was holding a ShareUpdateExclusive lock that the reconciliation job kept colliding with. The fix was a vacuum_freeze_table_age tweak. The diagnosis took ninety seconds because the right probe was in the right place.
A uprobe attaches to any non-stripped userspace function the same way a kprobe attaches to a kernel function: by replacing the first instruction with a trap. A USDT (Userland Statically Defined Tracepoint) is the userspace twin of a kernel tracepoint — a stable, pre-declared instrumentation point that the binary's maintainer compiled in. Choosing between them is the same trade-off as kprobe vs tracepoint, scaled to userspace: flexibility now, or a contract that survives the next package upgrade.
The boundary the kernel tracer cannot cross
Kernel tracing — kprobe, tracepoint:syscalls:*, bpftrace -e 'kfunc:...' — sees everything that happens below the syscall boundary. It cannot see what happens above it. When Postgres's query executor calls ExecHashJoin, no syscall fires; it is all userspace work, ring 3, no transition into the kernel until the executor needs to read a page from disk. If the latency is in the hash-table probe, in the qual evaluation, in the per-row datum copying — the kernel tracer is blind, in the same way the application profiler was blind below the syscall boundary in the previous chapter. The two blind spots are mirror images of each other. Together they cover the whole stack; alone, each misses half.
The userspace-side blind spot is operationally expensive because most of a service's wall-clock time is in userspace. A typical Postgres query at Razorpay scale spends 78% of its wall time in userspace (parse, plan, execute, format) and 22% in the kernel (TCP send, page-cache reads, fsync). For an in-memory query — one whose working set fits in shared_buffers — the userspace fraction climbs to 95%+. The kernel tracer can tell you that the query did 12 syscalls and they all returned in microseconds; it cannot tell you that the executor spent 230 ms in ExecHashJoin doing a tuple-by-tuple scan because the planner picked a hash-join over the hash-build relation. To see that, you need a probe inside the Postgres binary.
There are two ways to attach a probe to userspace code. uprobes are the userspace symmetric of kprobes: pick any non-inlined symbol in any non-stripped binary, attach a probe, and the kernel rewrites the binary's text segment in memory to trap into your eBPF program when that function is entered. USDT (Userland Statically Defined Tracepoints) are the userspace symmetric of kernel tracepoints: the binary's maintainer compiled in pre-declared tracepoints with documented argument shapes — postgresql:lock-wait-start, python:function__entry, mysql:query__exec__start — that you read with a usdt: probe and that the maintainer has committed to keeping stable across versions.
The taxonomy is the same as the kernel side: the dynamic family (uprobe + uretprobe) gives you any symbol but no stability contract; the static family (USDT) gives you a contract but only at the points the maintainer chose to instrument. The number gap is similar too. A modern Postgres 15 binary exposes 8 USDT tracepoints (tablespaces, lock-wait-start/end, query__execute__start/end, transaction-start/commit/abort). Run nm /usr/lib/postgresql/15/bin/postgres | wc -l and you get roughly 18,500 non-static symbols — every one of which is a potential uprobe target. The 8 stable points cover the well-trodden paths the maintainers cared enough to instrument; the 18,500 dynamic points cover everything else.
Why the cost gap matters at production fire rates: a uprobe attached to a libc function called once per Postgres row at 80,000 rows/s costs 80,000 × 150 ns = 12 ms/s of CPU per backend, or about 1.2% of one core. With 32 backends running in parallel, that is 38% of one core just for the tracer — almost certainly noticeable in your throughput numbers. The same probe as a USDT (when one exists) drops to roughly 0.6% per backend, or 19% of one core across 32 backends. The math says: pick USDT for any high-fire-rate userspace point where it exists; reserve uprobes for the cases where it does not.
A walking demonstration — what userspace tracing looks like through bpftrace and BCC
Below is a Python script that drives bpftrace to attach to the Postgres query__execute__start and query__execute__end USDT tracepoints, captures a minute of data from a running Postgres backend, and prints the per-database p50/p99/p99.9 query-latency histogram with the slow-query SQL text included for the top three slowest queries. This is the canonical shape of a production userspace tracer: USDT for the timing boundary, BPF map for in-kernel aggregation, Python orchestrator for parsing and reporting. The script is what Aditi ran during her reconciliation investigation, scaled down to a laptop reproduction.
#!/usr/bin/env python3
# trace_pg_query_latency.py
# Attach Postgres USDT tracepoints for query__execute__start/end across
# every backend in a running Postgres cluster, capture per-DB latency
# histograms over 60 seconds, and print the percentile ladder plus the
# top-3 slowest query strings.
import subprocess, json, re, sys, os
from collections import defaultdict
PG_BIN = "/usr/lib/postgresql/15/bin/postgres"
PROGRAM = rf'''
usdt:{PG_BIN}:postgresql:query__execute__start {{
@start[tid] = nsecs;
// arg0 is the const char *query_string passed by the macro
@qstr[tid] = str(arg0, 256);
}}
usdt:{PG_BIN}:postgresql:query__execute__end / @start[tid] / {{
$delta_us = (nsecs - @start[tid]) / 1000;
@lat_us[comm] = hist($delta_us);
if ($delta_us > 50000) {{
@slow[@qstr[tid]] = max($delta_us);
}}
delete(@start[tid]); delete(@qstr[tid]);
}}
interval:s:60 {{ exit(); }}
'''
result = subprocess.run(
["sudo", "bpftrace", "-e", PROGRAM],
capture_output=True, text=True, timeout=75,
)
out = result.stdout
hist_pat = re.compile(r"\[(\d+), (\d+)\)\s+(\d+)")
section_pat = re.compile(r"@lat_us\[(.+?)\]:")
sections = re.split(r"\n(?=@lat_us\[)", out)
for section in sections:
m = section_pat.search(section)
if not m: continue
comm = m.group(1)
buckets = [(int(lo), int(hi), int(c))
for lo, hi, c in hist_pat.findall(section)]
total = sum(c for _, _, c in buckets)
if total < 50: continue
cum, p50, p99, p999 = 0, None, None, None
for lo, hi, c in buckets:
cum += c
if p50 is None and cum >= 0.50 * total: p50 = hi
if p99 is None and cum >= 0.99 * total: p99 = hi
if p999 is None and cum >= 0.999 * total: p999 = hi
print(f"backend={comm:<14} n={total:>6} "
f"p50={p50:>5}us p99={p99:>7}us p99.9={p999:>8}us")
# Top 3 slow queries (worst-case per query string)
slow_pat = re.compile(r'@slow\[(.+?)\]: (\d+)')
slow = sorted(slow_pat.findall(out), key=lambda x: -int(x[1]))[:3]
for qtext, us in slow:
print(f" slow: {int(us)/1000:.1f} ms -> {qtext[:80]}...")
# Sample run on a c6i.4xlarge in ap-south-1 acting as Razorpay payouts replica
# during a simulated reconciliation window:
$ sudo python3 trace_pg_query_latency.py
backend=postgres n= 9842 p50= 8us p99= 124us p99.9= 480us
backend=postgres n= 4127 p50= 12us p99= 210us p99.9= 980us
backend=autovacuum lau n= 87 p50= 24us p99= 18000us p99.9= 89000us
slow: 240.3 ms -> SELECT id, amount, settled_at FROM payouts_2026_04 WHERE merchant_id = $1 AND ...
slow: 188.7 ms -> UPDATE payouts_2026_04 SET reconciled = true WHERE batch_id = $1...
slow: 156.2 ms -> SELECT count(*) FROM payouts_2026_04 WHERE settled_at > now() - interval...
Walk-through. PG_BIN is the absolute path to the Postgres executable; USDT tracepoints attach to a specific binary, not to a process or a process name, so the path matters. PROGRAM is an f-string — the rare case where r'' and f-string can be combined without escape pain because the single literal { characters used by bpftrace need to be doubled in f-strings ({{). The two USDT probes (query__execute__start, query__execute__end) bracket each query: start stores the timestamp and the query-string argument keyed by thread id, end subtracts to get elapsed and updates the latency histogram. The arg0 in the start probe is the first USDT argument as documented in the Postgres source (src/include/utils/probes.d); the str(arg0, 256) truncates to 256 bytes which is enough to identify the query without overflowing the BPF stack. The if ($delta_us > 50000) block captures the worst-case latency per query string for queries slower than 50 ms; the @slow map's key is the query text, so multiple invocations of the same query collapse into one entry. The percentile loop is identical to the syscall-tracing chapter — the same algorithm applies because it is the same shape of data.
Why this script chose USDT over a uprobe on exec_simple_query: the USDT pair query__execute__start / query__execute__end is the maintainer's contract for "when did query execution begin and end", and the contract has held across Postgres 9.x through 16.x. A uprobe on exec_simple_query would have to deal with three changes: the function was renamed to exec_simple_query_internal in 11, the calling convention changed in 13 (extended-protocol queries now go through exec_execute_message, not exec_simple_query), and the internal restructuring in 15 split the parser path off. The USDT, by contrast, fires from the same code site across all those versions because the maintainers explicitly preserved it. For a production tool that has to work across a fleet running mixed Postgres versions, this is the difference between "ship" and "support nightmare".
A second example, this time using a uprobe to look at a function with no USDT — into LWLockAcquire, where Postgres lightweight-lock contention manifests. The shape is similar but the trade-offs are different.
# trace_pg_lwlock.py (excerpt)
PROGRAM = rf'''
uprobe:{PG_BIN}:LWLockAcquire {{
@lock_start[tid] = nsecs;
@lock_id[tid] = arg0; // LWLock *lock
@lock_mode[tid] = arg1; // LWLockMode mode (LW_SHARED, LW_EXCLUSIVE)
}}
uretprobe:{PG_BIN}:LWLockAcquire / @lock_start[tid] / {{
$delta_us = (nsecs - @lock_start[tid]) / 1000;
if ($delta_us > 100) {{
@wait_us[@lock_mode[tid]] = hist($delta_us);
@wait_count[@lock_mode[tid]] = count();
}}
delete(@lock_start[tid]); delete(@lock_id[tid]); delete(@lock_mode[tid]);
}}
interval:s:30 {{ exit(); }}
'''
The shift from USDT to uprobe gives Aditi access to the lock pointer and the mode, which the USDT does not expose; the cost is a fragile attachment that may break on the next Postgres minor release if the function signature changes. The general rule is the same as the kernel-side rule: USDT > uprobe > uretprobe, in increasing cost and decreasing stability. Mix and match by preferring the most stable probe that gives you the timing boundary and arguments you need.
A third example, this time crossing into a different runtime. Modern CPython (3.12+) ships with USDT tracepoints for function entry, function return, garbage-collection start/end, and import events; a Python service at Hotstar can be observed end-to-end with the kernel-and-userspace combination. The bpftrace one-liner is sudo bpftrace -e 'usdt:/usr/bin/python3.12:python:function__entry { @[str(arg0), str(arg1)] = count(); } interval:s:30 { print(@); clear(@); }', and on a Hotstar transcoding worker during the IPL final the output cleanly separates the per-second function-call rate by (filename, function_name) — data that would have required a sampling profiler before, now available as a real per-event count. The same pattern works for any USDT-instrumented runtime: OpenJDK has 50+ DTrace probes for HotSpot (compilation events, GC events, monitor events), Ruby's MRI has a comprehensive set, MySQL's InnoDB has 20+, and Node.js has function-entry/return probes when built with --with-dtrace. Treat the binary's USDT manifest (tplist -p $(pidof <name>)) as the first thing you check when joining a new performance investigation; it tells you what the maintainers thought was worth observing.
How a uprobe actually fires — the same trap, but across processes
The mechanism behind a uprobe is worth knowing because it shapes the failure modes more sharply than the kprobe case. When you attach a uprobe to LWLockAcquire, the kernel does the same fundamental dance as a kprobe — copy the first instruction to a buffer, replace it with int 3, register a handler — but with one extra complication: the patched instruction lives in a userspace process's memory, not the kernel's. The kernel cannot simply patch the instruction in the on-disk ELF (that would change every running and future copy); it has to patch the in-memory copy of the page that the process is actually executing.
Linux solves this with the breakpoint-injection-via-COW pattern. The kernel locates the page that contains the target instruction in the process's page tables, marks it copy-on-write if it was shared (which is the common case — libc, the Postgres binary itself, and most other shared text segments are mapped read-only and shared between every process running them), allocates a private copy for just this process, patches the int-3 into the private copy, and re-maps the private page in place of the shared one. The cost is two extra page faults on first attach plus the memory cost of the COW copy — a 4 KiB page per attached process per uprobe site. For a single bpftrace one-liner attached to one Postgres backend the cost is invisible. For a uprobe attached to libc's malloc across every backend in a 200-backend cluster, the cost is 800 KiB of physical memory the cluster did not have before, plus two page-fault costs per backend at attach time (roughly 50 microseconds each).
When the patched function is called, the CPU executes int 3, traps into kernel mode, the breakpoint handler runs the registered probe (your eBPF program), single-steps the saved original instruction in a controlled context, and resumes at the second instruction of the function. The whole dance takes 150–200 ns on a modern x86 part, dominated by the trap-frame save, the IRET, and the cross-mode privilege-level transition (which is more expensive in userspace because the kernel has to swap page-table contexts back to the user's CR3 before resuming). This is roughly twice the cost of a kprobe on the same hardware; the gap is the privilege transitions plus the kpti page-table swap on Meltdown-mitigated kernels.
The "across processes" part has another twist: a uprobe attached at the binary level fires in every process that maps that binary, not just one. Attaching uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc on a busy Postgres server fires the probe for every running process — postgres backends, autovacuum workers, the WAL writer, the checkpointer, every cron job, the SSH daemon, the OS package's apt if it happens to run. The probe's eBPF program will see every process's malloc, and the BPF map will fill with thousands of (pid, comm) entries instead of the dozens you expected. The fix is bpftrace -p <pid> or, for finer control, an in-program if (pid != 12345) { return; } filter that drops events from uninteresting processes. Forgetting this is a beginner mistake that costs a few seconds of confusion and a flooded ringbuf.
The same is not true for USDT: USDT probes attach to a specific process's compiled-in tracepoint and are scoped naturally to "this binary", not "every process running this binary". A USDT probe attached with bpftrace -p $(pidof postgres) actually attaches to one specific PID; reattaching to a different PID gets a separate probe. This is one more reason USDT is gentler in production than uprobe: the scope is naturally narrow, and the per-process attach cost is paid once.
The inlining trap that bit kprobes also bites uprobes — aggressively. Postgres compiled with -O2 -fLTO (the default for distribution packages since 2023) inlines hot functions much more often than the source structure suggests. LWLockAcquire is a real symbol — the function is too large to inline at most call sites — but its hot fast path (LWLockAttemptLock) is inlined into every caller. A uprobe on LWLockAttemptLock attaches at the one or two non-inlined sites; the dozens of inlined sites are unobservable. The fix is the same as the kernel-side fix: attach to a parent function that has not been inlined and use it as a proxy.
Why USDT is half the cost of uprobe for the same workload: the USDT probe site in the compiled binary is a 5-byte NOP that the linker placed at a known address. Attaching a USDT probe rewrites that NOP into a JMP to the BPF dispatcher, the same way a kernel tracepoint is patched. There is no int-3 trap, no IRET, no kpti page-table swap — just a direct jump. The cost drops from 150 ns to 80 ns per fire. The catch is that USDT requires the binary to have been compiled with the tracepoints in place; you cannot retrofit USDT into a binary you did not build. uprobe is the universal escape hatch when no USDT exists, and it always costs more.
Reading the output without lying to yourself
The histogram from trace_pg_query_latency.py has the same two failure modes the kernel-tracing chapter described — probe-induced bias and short-lived process drift — plus three new ones unique to userspace tracing.
Multi-process aggregation. The @lat_us[comm] map keys by comm, which is the process name. Every Postgres backend appears as postgres, so the aggregate histogram mixes all backends together. If one backend is doing OLAP-style 200 ms scans and the others are doing OLTP-style 200 µs point lookups, the aggregate p99 is dominated by the OLAP backend and tells you nothing useful about either workload. The fix is to key by (pid, comm) and to filter the output to backends with enough events to matter. For Aditi's investigation, the right key was (database_oid, query_id) — both of which are available as USDT arguments — because the question was "which database's queries are slow", not "which backend is slow".
The process-attach race. A uprobe attached at t=10:00:00 only affects processes that exec the binary at or after that time, plus the running processes that the kernel walked at attach time. A Postgres backend forked at 10:00:00.001 may miss the attach window depending on the order of operations between the kernel's process-iteration and the fork. For long-running backends this is fine; for short-lived ones (autovacuum workers, parallel worker processes) it is not. The mitigation is to attach with bpftrace -p <postmaster_pid> and let the BPF program inherit attachment to children, or to attach early enough that all interesting children are already running.
The libc-shared-symbol trap. A uprobe on malloc in /lib/x86_64-linux-gnu/libc.so.6 fires for every process on the box that uses the system libc. If you only care about Postgres's mallocs, the bpftrace program must filter on comm == "postgres" inside the action; without the filter, your map fills with mallocs from cron, sshd, the OOM killer, and a hundred other processes. Worse, the cost (150 ns) is paid for every fire regardless of whether the filter passes. Filtering early in the BPF program avoids the userspace data transfer cost but does not avoid the per-fire trap cost. For high-fire-rate libc symbols, the right answer is often "do not attach at the libc level — attach at the application's wrapper" (palloc in Postgres, je_malloc if jemalloc is being used, etc.).
The static-vs-dynamic linking gotcha. uprobes attach to file paths, not to symbol names abstractly. A binary statically linked against libc has its own copy of malloc inside the binary's own text segment, with a different attach path. uprobe:/lib/.../libc.so.6:malloc will not catch a statically-linked Go binary's malloc because Go does not link against libc that way (it uses a runtime-internal allocator). Tools that try to be helpful with library symbol resolution (bcc's malloc-stack-counting tool, for instance) sometimes paper over this and silently produce empty histograms when the target binary uses an internal allocator. The check is ldd /path/to/binary | grep libc — if libc is not in the output, your libc-symbol uprobe will see nothing.
The same in-kernel-filter mantra from the previous chapter applies more strongly here: every byte that crosses from kernel to userspace is paid for; filter before you pay. For uprobes specifically, the cost of not filtering is amplified because uprobes fire across every process on the box that maps the binary, not just the one you care about. The mantra to keep: filter inside the eBPF program before you decide an event is interesting, and filter on PID early because PID is the cheapest field to test.
A second-order practical concern unique to USDT: read the argument shapes from the binary, not the documentation. USDT arguments are documented in source files like Postgres's src/include/utils/probes.d, but the compiled shape can drift from the docs depending on which DTrace-compatibility shim the binary was built with. The reliable way is tplist -v -p <pid> | grep <usdt_name> (a BCC tool), which reads the ELF notes that USDT writes into the binary and prints the actual register-or-memory location of each argument. For Postgres's query__execute__start, on a 6.6 kernel running Postgres 15.4, tplist -v reports arg0=4@-72(%rbp) — the query string is at memory offset -72 from the frame pointer, not in rdi — and a bpftrace script that assumes register-passing will read garbage. Always cross-check.
A third trap, the most insidious because it is invisible: stripped binaries. A binary built with gcc -s or stripped with strip has had its symbol table removed; uprobes cannot resolve LWLockAcquire to an address because the address is no longer mapped to that name. bpftrace will fail with ERROR: Could not resolve uprobe. The fix is to install the debug symbols package (postgresql-15-dbgsym on Debian, postgresql-debuginfo on RHEL), or to attach by hex address (which you have to discover yourself with objdump -d). Many production fleets ship stripped binaries by default to reduce package size; the debug-symbol package usually has to be installed deliberately. If bpftrace -l 'uprobe:/usr/lib/postgresql/15/bin/postgres:*' returns zero hits, the binary is stripped.
Common confusions
- "USDT and uprobe are interchangeable." They are not. USDT is a stable contract; uprobe is not. A bpftrace one-liner using
usdt:postgresql:lock-wait-startwill keep working across Postgres major versions because the maintainers committed to that name and argument shape. A uprobe onLWLockAcquiremay break silently on the next Postgres minor release. For ad-hoc investigation either works; for shipped tooling, prefer USDT when one exists. - "A uprobe sees every process running the binary." Yes by default, no when you filter. A uprobe attached with
bpftrace -e '...'(no-p) fires for every process that maps the target binary; with-p <pid>it fires only for that process tree. For libc symbols this matters enormously — an unfiltered libc uprobe can fire millions of times per second from sources you did not intend to trace. - "uprobe overhead is the same as kprobe overhead." It is roughly 2× higher. The privilege-level transitions and the kpti page-table swap that uprobes pay (and kprobes do not) push the per-fire cost from ~80 ns to ~150 ns on x86_64. For low-fire-rate symbols this is invisible; for high-fire-rate libc functions it is the dominant cost.
- "Stripped binaries are still uprobe-able." They are not, by name.
bpftracecannot resolve a symbol name to an address without the symbol table, so attachment fails. You can still attach by hex address fromobjdump -d, but discovering the right address for a function is a separate workflow that most production teams do not invest in. Install the-dbgsymor-debuginfopackage instead. - "USDT is free until you attach a probe." Yes, but the binary has to have been compiled with the USDT macros.
--enable-dtrace(Postgres) or--with-dtrace(Node.js, Python before 3.12) is a build-time flag; binaries from distributions that did not enable it have no USDT sites at all. Check withtplist -p <pid>orreadelf -n <binary> | grep stapsdt. - "
straceand a uprobe see the same things." They see different things.stracetraces syscalls (the kernel boundary), so it sees what the application asks the kernel to do but nothing about what the application is doing in userspace before or between syscalls. A uprobe inside the application sees userspace work directly. For "which file did the process open" usestrace; for "which userspace function called open" usestrace -k(which adds stack traces) or a uprobe on the userspace caller.
Going deeper
USDT manifests and how to read them
Every USDT-instrumented binary embeds an ELF note section called .note.stapsdt that lists every USDT probe with its name, provider, argument count, and argument-location encoding. readelf -n /usr/lib/postgresql/15/bin/postgres | grep -A5 stapsdt dumps the raw entries; tplist -v -p $(pidof postgres) from BCC presents them in a more readable form. Reading the manifest before attaching is good practice for two reasons: it tells you what is actually available in the version you have (binaries built without --enable-dtrace will have an empty manifest even if the source has USDT macros), and it tells you the argument-location encoding (8@%rdi for an 8-byte register-passed argument vs 4@-72(%rbp) for a 4-byte stack-spilled one), which determines whether arg0 in your bpftrace program reads the right memory.
The biggest collection of pre-instrumented USDT in the Linux ecosystem is OpenJDK — HotSpot ships with about 60 probes covering compilation events, GC start/end, monitor contention, class loading, exception throws, and method entry/exit (the latter only when -XX:+ExtendedDTraceProbes is set, because per-method probes are expensive). For a Java service at Hotstar or Flipkart, the USDT manifest is a complete observability surface: GC pauses are observable as hotspot:gc__begin/gc__end events, JIT compilation is observable as hotspot:method__compile__begin, lock contention is observable as hotspot:monitor__contended__enter. A bpftrace script reading these probes can produce per-second GC-pause-time histograms across the JVM fleet without enabling -Xlog:gc (which costs CPU and disk) and without parsing GC logs (which is fragile). This is the value proposition of USDT at its strongest.
Why the USDT manifest is the first thing to check at fleet scale: a 40-microservice Razorpay deployment running across 6 languages (Java, Go, Python, Rust, Node, C++) has 40 different observability surfaces, and the team that owns each service knows which USDT probes their binary exposes. The platform team building tracing tools cannot keep up with which uprobe target is stable in which service this week; instead they query tplist -p <pid> at attach time and target only USDT. The result is a tracing toolchain that works across the entire deployment without per-service maintenance — and that property is what makes USDT-first tracing scale beyond the prototype phase.
Why the Python interpreter's USDT was a years-long fight
CPython gained USDT support in 3.6 (2016) but it was disabled by default for years because of a performance regression: the per-function-call probe site, even when no probe was attached, added measurable overhead to the interpreter's hot dispatch loop. The cost was small — about 0.3% on micro-benchmarks — but the CPython core team is conservative about per-call overhead and the feature was gated behind a build flag. Distribution packagers (Debian, Red Hat) had to choose: enable USDT and accept the 0.3% interpreter overhead, or disable it and lose the observability. Most chose to disable, which is why a python3 binary on Ubuntu 20.04 may have no USDT probes at all even though the CPython source has them.
In CPython 3.12 (October 2023) the situation changed: the new specialising adaptive interpreter included tracepoint sites that compiled to true zero-overhead NOPs (the same multi-byte NOP technique kernel tracepoints use), and Debian and RHEL flipped the default to enable USDT. This means a 2026-era Python service, built against a recent distribution, has USDT for function__entry, function__return, gc__start, gc__done, import__find__load__start, import__find__load__done, and audit. For a Hotstar or Razorpay Python service, this is the difference between "we ship py-spy flamegraphs" and "we ship per-function call-rate histograms in real-time". The migration story across 2024–2026 is gradually arriving at the latter.
How to write a USDT probe into your own binary
For a service you own, adding USDT is a small and worthwhile investment. The library is systemtap-sdt-dev on Debian (systemtap-sdt-devel on RHEL); the macro is DTRACE_PROBE from <sys/sdt.h>. A C function that wants to expose a tracepoint at the start of its hot path writes:
#include <sys/sdt.h>
void process_payment(int merchant_id, long amount_paise) {
DTRACE_PROBE2(razorpay, payment__process__start, merchant_id, amount_paise);
// ... actual processing ...
DTRACE_PROBE2(razorpay, payment__process__end, merchant_id, amount_paise);
}
The DTRACE_PROBE2 macro expands at compile time to a multi-byte NOP (zero overhead when no probe attached) plus an ELF note that the linker writes into the binary's .note.stapsdt section. The provider name (razorpay) and probe name (payment__process__start) are arbitrary strings; the convention is to name the provider after the service and use the namespace-style __ separator. Once compiled, bpftrace -e 'usdt:./your-binary:razorpay:payment__process__start { @[arg0] = count(); }' aggregates per-merchant payment counts at zero cost when not attached.
Rust, Go, and Python expose similar mechanisms: usdt crate for Rust, the runtime/trace package for Go (different mechanism but similar semantics), and the C API for Python extensions. The investment is small (a handful of lines per probe site) and the payoff at fleet scale is an observability contract you control. Razorpay, Zerodha, and Flipkart's payments and trading core services that the author has worked with all have USDT in their hottest paths for exactly this reason: it lets their SREs ask precise questions in production without rebuilding or restarting the service.
When uprobe is genuinely the only option
There are three legitimate cases where uprobe is the right tool despite its downsides. Third-party closed-source binaries — you cannot add USDT to a vendor's binary, so uprobe is the only path. Diagnostic emergencies on a binary you did not build — in an outage at 02:00 IST, you do not have time to recompile Postgres with --enable-dtrace and roll out a new package; you attach a uprobe to the symbol you suspect, accept the fragility, and move on. Probing the unblessed paths — the maintainer chose 8 USDT sites because those are the ones they cared about; if your bug is in the 9th place, USDT cannot help and you fall back to uprobe. The right mental model: USDT is the curated trail, uprobe is the bushwhack. Both have their place; one of them gets you home faster.
The multi-language trap — managed runtimes hide their own functions
A uprobe attached to a JVM process will see HotSpot's C++ functions (JavaCalls::call_helper, interpreter_entry, etc.) but not the Java methods running on top. A uprobe attached to a Python process will see CPython's C functions (PyEval_EvalFrameDefault, _Py_HashBytes, etc.) but not the Python functions. The names you see are the runtime's, not the application's. To get application-level Java method names you need either USDT (hotspot:method__entry, only available with extended DTrace probes) or async-profiler / py-spy (which use a different mechanism — libunwind plus runtime-specific frame walkers). The same is true for Node.js (V8 hides JS functions), Ruby (MRI hides Ruby methods), and Erlang (BEAM hides processes). The general lesson: uprobes see the host language's functions, not the guest language's. For multi-language stacks, USDT is much more useful because the runtime's maintainers can compile in probes that are guest-aware.
Reproduce this on your laptop
# Reproduce the USDT-tracing demo on Linux 5.15+ with Postgres 15.
sudo apt install bpftrace bcc-tools postgresql-15 postgresql-15-dbgsym
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh psycopg[binary]
# Verify the postgres binary has USDT compiled in:
sudo tplist -v -p $(pidof -s postgres) | grep query__execute
# Run a workload that triggers query execution at high rate:
python3 -c "
import psycopg, time
c = psycopg.connect('host=/var/run/postgresql user=postgres dbname=postgres').cursor()
for _ in range(100000):
c.execute('SELECT count(*) FROM pg_stat_activity')
" &
# In another terminal, attach the tracer:
sudo python3 trace_pg_query_latency.py
Three small follow-up exercises that build fluency. Exercise one: run tplist -p $(pidof postgres) and read every USDT probe Postgres exposes. The 8 entries are the entire stable observability surface; knowing them by heart is the first step to fluent Postgres tracing. Exercise two: write a bpftrace one-liner that counts lock-wait-start events bucketed by (pid, lockmode) and run it during a pg_dump of a 1 GB database. The output reveals which locks the dump is waiting on — usually AccessShareLock on user tables, but occasionally RowExclusiveLock from concurrent writers. Exercise three: pick a non-Postgres USDT-instrumented binary on your system (tplist -l lists every binary in your search path with USDT) and run tplist -v on it. Bind, MariaDB, libvirtd, qemu-system-x86_64, and OpenJDK are common ones. Reading their probe manifests is a fast survey of "what observability did the maintainers think was worth shipping".
Where this leads next
This chapter completed the kernel-and-userspace tracing pair. The previous chapter (tracing syscalls and kernel functions) covered kprobe and tracepoint; this one covered uprobe and USDT; together they give you observation points anywhere in the running stack. Chapter 44 (per-event delivery in production) is where the cost analysis from the last two chapters graduates to "what does it actually look like when the producer outpaces the consumer at 10M events/s, across both kernel and userspace probe sources". Chapter 45 (eBPF latency histograms in production) shows how the per-PID and per-database histogram patterns become continuously-on observability streams.
The single insight to carry forward: USDT and uprobe are not the same mechanism; the choice between them is the same shape of choice as tracepoint vs kprobe, with the additional twist that USDT requires the binary to have been compiled with the probes in place. For ad-hoc incident response either works; for shipped tooling, USDT is the contract you can rely on across version upgrades, and uprobe is the escape hatch when no USDT exists. Knowing which probes your services expose — running tplist -p <pid> against every binary in your fleet and writing the manifest into your team's runbook — is a one-day investment that pays back the first time an outage walks the latency from 18 ms to 240 ms and you have ninety seconds to find out where.
The deeper habit, the one Aditi's investigation modeled: start with the most-stable probe family that can answer your question. Syscall tracepoints first, then kernel tracepoints, then USDT, then kprobe, then uprobe, then uretprobe, in increasing order of fragility. Each step down the ladder buys you flexibility at the cost of a probe that is more likely to silently break the next time the kernel or the binary is upgraded. The market-open at Zerodha, the IPL final at Hotstar, the Big Billion Day at Flipkart, the reconciliation window at Razorpay — the SRE who walks into a 02:00 IST page with a tracing toolkit built on the stable end of the ladder gets the answer faster than the SRE whose toolkit broke on last week's package upgrade and now has to rebuild it from scratch. The product is not the probe; the product is the speed at which a question becomes a runnable answer with real numbers behind it.
References
- Brendan Gregg, BPF Performance Tools (Addison-Wesley, 2019) — chapter 5 covers uprobes and USDT in depth, with examples for Postgres, MySQL, OpenJDK, and CPython.
- Linux kernel uprobes documentation — the maintainers' description of attachment, COW page handling, and tracefs interface.
- SystemTap SDT documentation — the canonical reference for adding USDT probes to your own binary using
<sys/sdt.h>. - PostgreSQL DTrace probes documentation — the contract for the 8 USDT probes Postgres compiles in when built with
--enable-dtrace. - PEP 669: Low impact monitoring for CPython — the PEP behind CPython 3.12's renewed USDT/monitoring support that made Python observable in production again.
- Andrii Nakryiko, "Building BPF applications with libbpf-bootstrap" — modern guide to writing uprobe and USDT programs in libbpf, the production replacement for BCC.
- /wiki/tracing-syscalls-and-kernel-functions — the kernel-side companion to this chapter; the two together cover the whole probe surface.
- /wiki/bcc-toolchain — where
tplist,argdist, and the other USDT-aware utilities used in this chapter come from.