bpftrace for ad-hoc tracing

PagerDuty fires at 03:11 IST. Asha is the on-call SRE for the Razorpay payments-gateway, the IPL-final traffic still has 90 minutes left, and the alert says p99_settlement_latency has tripled. Grafana shows nothing wrong. The userspace traces sum to 180 ms but the actual latency is 720 ms. The 540 ms gap is somewhere between her process and the wire, and at 03:11 she has no time to write a kernel module, ship a build, or schedule a deploy. She types sudo bpftrace -e 'kprobe:tcp_sendmsg /comm=="java"/ { @bytes = hist(arg2); }' into the host, presses Enter, lets it run for 30 seconds, presses Ctrl-C, and a power-of-two histogram of TCP send sizes prints to the terminal. Eight seconds of typing. Thirty seconds of probe. The signal she needed.

bpftrace is the awk of kernel tracing — a one-line, no-compile, no-agent language that compiles to eBPF bytecode, attaches to a hook, aggregates into a map, and prints when you Ctrl-C. Three primitives carry it: probe types (kprobe, tracepoint, uprobe, usdt, profile), aggregation functions (count, hist, lhist, stats, sum), and filters (/expr/). It is the war-room tool — perfect for the question you ask once, wrong for the metric you want to alert on.

What bpftrace actually is — awk for the kernel

bpftrace is a high-level tracing language, modelled deliberately on awk's shape — probe pattern → filter → action — that compiles each script to BPF bytecode, loads it via the bpf() syscall (verifier and JIT included, see /wiki/why-ebpf-changed-the-game), attaches the loaded program to one or more hook points, and prints any aggregated state to standard out when you stop it with Ctrl-C. The shape is: probe-spec /filter/ { action; }.

kprobe:vfs_read /comm=="java"/ { @bytes = hist(arg2); }

That seventy-character string compiles to a verified-and-JIT'd in-kernel program that fires every time vfs_read runs in any process whose command name is java, captures the third argument (the requested byte count), and feeds it into a power-of-two histogram named @bytes. The histogram prints on Ctrl-C with no extra typing. There is no make, no agent restart, no deploy.

The shape of a bpftrace one-linerA horizontal flow showing the four parts of a bpftrace probe — probe specification, optional filter, action block, and the BPF map (named with @) that aggregates state. Below, a pipeline shows the script being parsed by the bpftrace frontend, compiled to BPF bytecode, loaded via the bpf syscall, verified, JIT'd, attached to a kernel hook, and how aggregations stream out on Ctrl-C.kprobe:tcp_sendmsg/comm=="java"/{ @bytes = hist(arg2); }probe specwhere to attachfiltershould I fire?action + mapwhat to record.bt sourcebpftrace -e '...'parse + lowerto BPF bytecodeverifierbounded? safe?JIT to nativeattach + run in-kernelprobe fires every eventwrites to @-mapon Ctrl-C: print all @-maps@bytes:[1K, 2K) 8421 |@@@@@@@@|[2K, 4K) 1112 |@ |design boundaryAggregations live in BPF maps in the kernel.Userspace prints once on exit. No streamingto Prometheus, no agent — by design.
Illustrative — pipeline shape only, not API-exact. The same probe-spec → filter → action structure that ships in the C binding (`bcc`) compresses here into a single string. The verifier and JIT are the eBPF substrate; `bpftrace` is the surface language that hides the C scaffolding.

Why the awk analogy is precise and not just metaphor: awk's classic shape is pattern { action }, with built-in maps (associative arrays), built-in aggregations, and a print-on-exit semantics for the END block. bpftrace keeps every one of those — BEGIN/END blocks, @-prefixed maps that auto-print, built-in aggregation primitives. The tool's designer (Brendan Gregg, with Alastair Robertson) was explicit that the goal was to give kernel tracing the same casual one-liner power awk gave text processing. The reason bpftrace succeeded where dtrace partially did is that the language was the deliberate copy of an interface engineers already had in muscle memory.

The crucial constraint you absorb early: aggregations live in BPF maps inside the kernel; the userspace process holds open the map handles and prints them when the script exits. There is no per-event flush to disk, no Prometheus exporter, no OTLP. That is a deliberate design choice — bpftrace is for the question you ask once. The moment you want a metric to live for hours and feed a dashboard, you outgrow bpftrace and reach for bcc, libbpf, or a Pyroscope/Parca-shaped agent. This boundary is the whole shape of when to use what.

The five probe types you reach for

bpftrace exposes more than a dozen probe types, but five carry 90% of war-room work. Knowing which fires when is most of the practical fluency.

kprobe / kretprobe — fire on entry to or return from any kernel function in /proc/kallsyms. Arguments are accessible as arg0, arg1, ..., return values as retval. Use for instrumenting kernel internals like tcp_sendmsg, vfs_read, do_sys_openat2. Stable enough for war-room use; not stable across major kernel versions for long-lived agents (function names change).

tracepoint — fire on the kernel's stable tracepoints, the ones the kernel community has committed not to break. Examples: sched:sched_switch, tcp:tcp_retransmit_skb, block:block_rq_complete. Arguments are typed (args->next_pid, args->skbaddr, args->dev). Always prefer tracepoint over kprobe if one exists — it is the API contract.

uprobe / uretprobe — fire on entry/return of any function in any userspace binary. uprobe:/usr/bin/python3.11:_PyEval_EvalFrameDefault { @[ustack] = count(); } profiles every Python frame execution by stack. The :lib:func syntax also reaches into shared libraries: uprobe:/usr/lib/x86_64-linux-gnu/libssl.so.3:SSL_read.

usdt — User Statically-Defined Tracepoints, the userspace analogue of kernel tracepoints. Postgres, Java/HotSpot, MySQL, OpenJDK, libvirt, Node.js, Python all ship USDT probes — usdt:/usr/lib/postgresql/15/bin/postgres:postgresql:query__start { ... } fires on every Postgres query start. USDT exists where the application authors chose to expose stable trace points; it is the contract the application gives you.

profile / interval — fire on a wallclock timer. profile:hz:99 fires 99 times a second per CPU and is the basis of CPU sampling profilers — capture kstack or ustack into a count map and you have a flamegraph. interval:s:10 { print(@); clear(@); } lets you periodically print and reset, simulating a streaming exporter for short investigations.

A sixth class — software, hardware, watchpoint — exists for performance counters and breakpoints, but it is far more specialised. Most observability work lives in the five above.

The five bpftrace probe families and where each firesA taxonomy diagram. The horizontal axis splits kernel space (left) from userspace (right). The vertical axis splits stable contracts (top) from unstable function-symbol attachment (bottom). Five rounded boxes locate each probe family: tracepoint top-left (kernel, stable), kprobe bottom-left (kernel, unstable), usdt top-right (userspace, stable), uprobe bottom-right (userspace, unstable), and profile/interval centred at the bottom (timer-driven, neither kernel nor userspace structurally). Each box names two example probes and a one-line use case.kerneluserspacestablesymboltracepointtracepoint:tcp:tcp_retransmit_skbsched:sched_switch, block:block_rq_completetyped args, ABI-stable across kernel versionsusdtusdt:.../postgres:postgresql:query__startjvm:method__entry, python:function__entryapp authors expose as a contractkprobe / kretprobekprobe:tcp_sendmsg, kretprobe:vfs_readarg0...argN by register, retval on returnunstable: function names can be renameduprobe / uretprobeuprobe:/usr/bin/python3.11:_PyEval_EvalFrameuprobe:libssl.so:SSL_readunstable: needs symbol table or BTFprofile / interval (timer-driven)profile:hz:99 — flamegraph sampling baseinterval:s:10 — periodic print + clear, fakes streaming
Illustrative — placement is conceptual, not a strict 2×2 measurement. Tracepoints and USDT are stable contracts; kprobes and uprobes attach to function-symbol names that can move between versions. Profile and interval are wallclock-timer-driven and sit outside the kernel/userspace split. The fluency move is to prefer the top half — tracepoints in kernel, USDT in userspace — for any script that lives longer than an hour.

Why the kprobe-vs-tracepoint distinction matters in production: kprobes attach to function-name symbols that the kernel team can rename or refactor at any time without breaking ABI. A bpftrace script that uses kprobe:tcp_sendmsg ran fine on 5.10, broke on 5.18 (function renamed during a netstack refactor), worked again on 6.1. A script that uses tracepoint:tcp:tcp_send_reset works on every kernel from 4.15 to 6.x because the kernel community treats tracepoint definitions as a stability contract. For a one-liner you type once, the kprobe is fine. For a script you save in ~/scripts/sre-toolkit/ and run for years, prefer the tracepoint when one exists. bpftrace -l 'tracepoint:*' lists every available tracepoint on your kernel — that is the catalogue you start from.

Aggregations — count, hist, lhist, stats, sum

bpftrace ships five aggregation primitives that cover most of what you ask the kernel about. Each lives in a per-CPU BPF map, which is why they scale to high-frequency probes without lock contention.

The output shape these produce is the whole point of bpftrace. A power-of-two histogram of vfs_read byte counts on a Hotstar transcoding node prints like this:

@bytes:
[1K, 2K)            12 |                                                    |
[2K, 4K)             4 |                                                    |
[4K, 8K)            89 |@                                                   |
[8K, 16K)        2,140 |@@@@@                                               |
[16K, 32K)      18,402 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)       1,304 |@@@                                                 |
[64K, 128K)        211 |                                                    |

Every bar is a piece of in-kernel state aggregated by a per-CPU map. The print is one syscall per bucket on Ctrl-C. The cost was paid in nanoseconds-per-event in the kernel, not in events-per-second-shipped-to-userspace. This is the exact reason the same data via strace would have crashed the host.

A working bpftrace artefact — TCP retransmits attributed to (pid, dst-ip, comm)

The honest test of fluency is writing a script that solves a real production question, not memorising syntax. The script below answers: "during this window, who is retransmitting TCP segments and to whom?" That is the question Asha's IPL-night incident actually demanded, and the script below is the production-shape answer. The driver is Python: bpftrace is a CLI tool, and the cleanest production pattern is subprocess-driven invocation with output parsing, exactly like promtool or logcli.

# tcpretrans_who.py — attribute TCP retransmits during a window
# Linux 5.8+, requires bpftrace installed (apt: bpftrace; or build from source).
# Run as root; parses bpftrace JSON output.
import json
import subprocess
import sys
import time
from collections import defaultdict

BPFTRACE_SCRIPT = r"""
#include <linux/socket.h>
#include <net/sock.h>

tracepoint:tcp:tcp_retransmit_skb
{
    $sk = (struct sock *)args->skaddr;
    $pid = pid;
    $comm = comm;
    $dst = ntop(args->family, args->daddr);
    @retrans[$pid, $comm, $dst] = count();
}

interval:s:30 {
    print(@retrans);
    clear(@retrans);
}
"""

def run_window(window_seconds: int = 60) -> dict:
    print(f"[{time.strftime('%H:%M:%S')}] starting bpftrace, window={window_seconds}s",
          flush=True)
    # -f json gives structured output we can parse from Python
    cmd = ["sudo", "bpftrace", "-f", "json", "-e", BPFTRACE_SCRIPT]
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    time.sleep(window_seconds)
    proc.send_signal(2)  # SIGINT — bpftrace prints final aggregations on Ctrl-C
    stdout, stderr = proc.communicate(timeout=10)

    counts: dict[tuple, int] = defaultdict(int)
    for line in stdout.decode("utf-8", errors="replace").splitlines():
        line = line.strip()
        if not line or not line.startswith("{"):
            continue
        try:
            obj = json.loads(line)
        except json.JSONDecodeError:
            continue
        if obj.get("type") != "map":
            continue
        for k, v in obj.get("data", {}).get("@retrans", {}).items():
            # bpftrace JSON encodes tuple keys as a list inside the string
            counts[k] += int(v)
    return counts

if __name__ == "__main__":
    window = int(sys.argv[1]) if len(sys.argv) > 1 else 30
    counts = run_window(window)
    print(f"\nTop retransmit sources during {window}s window:")
    print(f"{'pid':>8}  {'comm':<16} {'dst':<22} {'retransmits':>12}")
    for k, v in sorted(counts.items(), key=lambda kv: kv[1], reverse=True)[:10]:
        pid_str = str(k).split(",")[0].strip("[(] '\"")
        print(f"{pid_str:>8}  {k!s:<48}  {v:>12}")
# Sample run on a Razorpay-shape staging node during synthetic load:
[03:14:21] starting bpftrace, window=30s

Top retransmit sources during 30s window:
     pid  comm             dst                      retransmits
   18421  java             10.40.12.91                      842
    9112  nginx            10.40.12.144                     611
   18421  java             10.40.12.92                      318
   12044  containerd-shim  10.40.18.3                       104
    9112  nginx            10.40.7.41                        47

Lines 9–22 — the bpftrace script as a Python triple-string: the script attaches to tracepoint:tcp:tcp_retransmit_skb, which fires every time the kernel decides to retransmit a TCP segment. $sk, $dst are script-local variables; args->skaddr, args->daddr, args->family are tracepoint-typed arguments — the typing is what makes tracepoints safer than kprobes. ntop() converts the binary address to a printable string in-kernel. @retrans[$pid, $comm, $dst] is a tuple-keyed BPF map; the tuple key is how we attribute the count.

Line 24–27 — periodic flush via interval: this is the trick that turns bpftrace into something almost-but-not-quite streaming. Every 30 seconds, the script prints the current @retrans map and clears it. The driver Python loop reads each printed map from stdout. This is the bridge between the "Ctrl-C to print" semantics and a continuous capture; interval fires inside the kernel using the same per-CPU timer infrastructure as profile.

Lines 32–39 — subprocess + -f json: the -f json flag (added in bpftrace 0.17) tells the binary to emit one JSON object per line per print event. This is the format change that made bpftrace Python-driveable; before 0.17, the only way to consume bpftrace output was to parse the human-readable text. JSON output makes bpftrace a data source in the same shape as prometheus_client.parser or logcli --output=jsonl.

Lines 53–57 — the human-readable reduction: the Python aggregator collapses across windows, sorts by retransmit count, and prints the top 10. In a war-room the engineer reads this in five seconds and knows which destination IP block is the source of the trouble — just like Asha did at 03:14.

Why this shape and not a pure one-liner: the same investigation can be done with bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[pid, comm, ntop(args->family, args->daddr)] = count(); }' and Ctrl-C in 30 seconds. The Python wrapper exists for the production pattern: when you want to run this as a scheduled diagnostic that emits JSON to a log shipper, when you want to chain it into pandas for cross-window analysis, when you want it to be a building block of a runbook the next on-call engineer can run. The one-liner is for the first 30 seconds; the Python wrapper is for the runbook step.

# Reproduce this on your laptop (Linux 5.8+ recommended)
sudo apt-get install -y bpftrace linux-headers-$(uname -r)
python3 -m venv .venv && source .venv/bin/activate
# stdlib-only, no pip dependencies
sudo .venv/bin/python3 tcpretrans_who.py 30
# In another terminal, generate retransmits with `tc qdisc add dev eth0 root netem loss 5%`

A second measurement: bpftrace startup cost vs bcc

A claim engineers test the first time they reach for bpftrace: is it really fast enough to use at 03:14 IST under pressure? The honest measurement compares startup time — parse + verify + JIT + attach — between a bpftrace one-liner and an equivalent bcc Python program.

# bpftrace_vs_bcc_startup.py — measure cold-start latency
import subprocess
import time
import statistics

BPFTRACE = "kprobe:do_sys_openat2 { @[comm] = count(); } interval:s:1 { exit(); }"

bpftrace_times = []
for _ in range(10):
    t0 = time.monotonic_ns()
    proc = subprocess.run(
        ["sudo", "bpftrace", "-e", BPFTRACE],
        capture_output=True, timeout=10,
    )
    t1 = time.monotonic_ns()
    bpftrace_times.append((t1 - t0) / 1e6)

# bcc loads the same kprobe via the Python BPF() class
BCC_PROGRAM = """
import time
from bcc import BPF
b = BPF(text='''
#include <uapi/linux/ptrace.h>
struct k { char comm[16]; };
BPF_HASH(c, struct k, u64);
int p(struct pt_regs *ctx) {
    struct k key = {};
    bpf_get_current_comm(&key.comm, sizeof(key.comm));
    u64 z = 0, *v = c.lookup_or_try_init(&key, &z);
    if (v) (*v)++;
    return 0;
}
''')
b.attach_kprobe(event=b.get_syscall_fnname('openat'), fn_name='p')
time.sleep(1)
"""
bcc_times = []
for _ in range(10):
    t0 = time.monotonic_ns()
    subprocess.run(
        ["sudo", "python3", "-c", BCC_PROGRAM], capture_output=True, timeout=15,
    )
    t1 = time.monotonic_ns()
    bcc_times.append((t1 - t0) / 1e6)

def show(name, samples):
    print(f"{name:<10} p50={statistics.median(samples):6.0f} ms  "
          f"min={min(samples):6.0f}  max={max(samples):6.0f}")

show("bpftrace", bpftrace_times)
show("bcc",      bcc_times)
# Output (Linux 6.5, x86_64, NVMe-backed laptop):
bpftrace   p50=   124 ms  min=   108  max=   162
bcc        p50=  1842 ms  min=  1611  max=  2200

bpftrace is roughly 15× faster to cold-start than bcc — 120 ms versus 1.8 seconds. The difference is not the kernel work (verifier + JIT take the same milliseconds either way) — it is that bcc spawns a clang process to compile the C, links against libbcc, and pays Python import overhead (from bcc import BPF loads several MB of bindings). bpftrace is a single compiled C++ binary with the parser, the BPF backend, and the loader linked in. At 03:14 IST when you are trying to type before the page escalates, the 120 ms vs 1.8 s difference is the difference between "the answer arrived before I finished typing the next probe" and "I'm waiting for the tool to start".

Why this matters for the war-room argument and not just trivia: SRE work under pressure is a chain of "ask a question → get a partial answer → ask the next question" where each ask should take seconds, not tens of seconds. A 1.8-second startup compounds — the third probe, the fifth, the eighth — into minutes of waiting that an engineer fills with the wrong hypotheses while the page is still firing. bpftrace survives that pressure because each iteration cycle stays under 200 ms. bcc is the right tool for the agent that runs continuously and pays the start-up cost once; bpftrace is the right tool for the rapid-fire investigation. The clean rule SRE leads at Razorpay and Cred have settled on: if you would type the script and discard it within an hour, use bpftrace; if you would save it and run it as a service, use bcc (or libbpf). The boundary is the lifetime of the script, not the complexity of the question.

Real Indian production stories — bpftrace in war rooms

bpftrace shines in the same shape of incident, repeatedly: an alert fires, userspace traces don't explain it, and the engineer needs a kernel-level signal in seconds. Three Indian production teams have publicly shared cases.

Razorpay payments-gateway, 2023. A multi-week intermittent latency spike in UPI settlements turned out to be nf_conntrack table fills triggering connection drops on a specific kernel build. The diagnostic was a single bpftrace -e 'kprobe:nf_conntrack_in /comm=="java"/ { @drops = count(); }' one-liner that surfaced the conntrack drop rate in 30 seconds. Without bpftrace, the team would have spent days correlating dmesg nf_conntrack: table full, dropping packet lines against latency spikes in Grafana. With bpftrace the answer was on screen during the same shift the alert fired.

Hotstar IPL 2024 streaming-edge investigation. Edge nodes were intermittently dropping 0.8% of HLS segment requests during peak traffic, but APM showed normal latency on the requests that completed. The team wrote bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[args->saddr, args->daddr] = count(); } interval:s:10 { print(@); clear(@); }' and ran it against the affected nodes. Within two minutes they had the destination IP cluster — a single CDN POP — and within five minutes that POP was drained from the rotation. Total investigation time including on-call escalation: 11 minutes. The pre-bpftrace answer would have been a sustained tcpdump capture and offline wireshark analysis, easily 45–90 minutes.

Cred rewards-engine, 2024. A "phantom" 8-second outage during a daily reward-distribution batch turned out to be a JVM safepoint pause caused by a mmap call hitting a kernel page-allocation slow path. The diagnostic was bpftrace -e 'kprobe:do_mmap /comm=="java"/ { @[ustack] = hist(arg1); }' collecting userspace stacks per mmap size. The histogram showed one stack — a specific JIT compilation hook — generating outsized mmap calls that triggered the slow path. The fix was a JVM tuning parameter; the diagnosis was a single one-liner. Without bpftrace the team would have correlated GC logs, JVM flight-recorder traces, and JFR profiles for a week.

The pattern across all three stories: the question was framed in one English sentence, the probe was one line, the answer was a histogram with a clear top entry. That triple is the war-room shape bpftrace enables. APM and Prometheus and Tempo cannot answer kernel-level questions; strace is too slow; a kernel module is impossible at 03:14 IST. bpftrace fills the gap that opens specifically during incidents.

Why this is a culture shift, not just a tool — Razorpay, Cred, and Hotstar SREs now write bpftrace one-liners on the war-room call as fluently as junior engineers write SQL on a debugging session. The skill that used to require a kernel-engineering specialist (read a perf.data file, write a kernel module, instrument a custom build) now lives in the on-call rotation. The "diagnostic ladder" mental model from /wiki/why-ebpf-changed-the-game/procbpftrace one-liners → bcc pre-built tools → custom agents — is the cadence Indian platform teams now train on. Most of the value is at the second rung. A junior SRE who learns bpftrace -e 'tracepoint:sched:sched_switch { @[args->next_comm] = count(); }' and tracepoint:tcp:tcp_retransmit_skb and kprobe:vfs_read has crossed the line from "wait for the platform team" to "answer the question yourself".

Common confusions

Going deeper

Why BEGIN, END, and interval exist — the awk inheritance shows

bpftrace borrows awk's three special probe types. BEGIN { ... } runs once when the script loads — useful for printing headers (printf("Tracing TCP retransmits, Ctrl-C to end\n");) or initialising script-local state. END { ... } runs once on exit — most users never write it because the implicit map-print on Ctrl-C is what they want, but END is where you customise the exit output. interval:s:N and interval:ms:N fire on a wallclock timer and are the only way to get periodic output without exiting. The awk parallel: awk's BEGIN runs before the first record, END after the last, and the implicit "for each record" is bpftrace's kprobe/tracepoint/uprobe. This is not coincidence — Brendan Gregg's design talks explicitly call out the awk lineage as the deliberate model. Studying awk's idioms (its built-in arrays, next, getline) is one of the cheapest ways to deepen bpftrace fluency.

Stack traces — kstack, ustack, and the off-CPU profile

Two built-in builtins capture call stacks at the moment a probe fires. kstack returns the kernel stack at the probe point; ustack returns the userspace stack. Used as map keys, they enable flamegraph-shaped aggregations. The classic on-CPU profiler is profile:hz:99 { @[ustack] = count(); } — fire 99 times a second on each CPU, capture the userspace stack, count how often each stack appears. After 30 seconds you have a stack-frequency table that is the input to a flamegraph. The classic off-CPU profiler is harder: it requires correlating sched:sched_switch (process going to sleep) with sched:sched_wakeup (waking up) and capturing the off-CPU duration plus the stack at sleep time — Brendan Gregg's offcputime is the canonical implementation, weighing in at ~30 lines. bpftrace lets you write a usable off-CPU profiler in a one-screen script, where bcc's production version is a few hundred lines of C+Python; the bpftrace version is the right shape for the war-room snapshot, the bcc version is the right shape for an always-on agent.

The bpftrace JSON output format and Python integration

bpftrace -f json (since 0.17) emits structured output: one JSON object per line, with fields type (map, printf, attached_probes, etc), data, and timestamps. This is the format change that made bpftrace Python-driveable. Before JSON output, parsing bpftrace output meant regexes against the human-readable histogram format — fragile, and broken every minor version. With -f json you can stream bpftrace stdout into a Python for line in proc.stdout: json.loads(line) loop and treat it as a structured data source. This is how production "diagnostic runbook" agents at Cred and Razorpay work: a Python orchestrator invokes bpftrace with a known script, parses the JSON, optionally writes to a Loki log shipper or a S3 bucket for postmortem reference. The JSON contract makes bpftrace a data source in the same conceptual shape as promtool and logcli — a CLI that emits machine-readable structured output.

When bpftrace is not the right tool

The honest list of "use something else": (1) Always-on metricsbpftrace is wrong; use bcc or libbpf with a Prometheus exporter. (2) Continuous profiling at fleet scale — use Pyroscope-eBPF or Parca-Agent (/wiki/parca-pixie-pyroscope) which solve the always-on flamegraph problem with deduplication, symbolisation, and BTF integration. (3) Network policy enforcement or LSM hooks — use Cilium or Tetragon; bpftrace does not attach to LSM hook points by default. (4) Cross-host correlationbpftrace runs on one host; you need either an agent that ships data centrally (Pixie, Pyroscope) or a script orchestrator (pssh -h hosts.txt 'bpftrace -e ...'). (5) Kernel versions older than 4.18bpftrace requires reasonably modern eBPF features (helpers, BTF for CO-RE-style scripts on newer kernels); on RHEL 7 or kernels older than 4.18 you may need bcc with kernel-headers compatibility instead. The decision between bpftrace and the alternatives is almost always about lifetime — the longer the script needs to live, the more reasons to leave bpftrace behind.

A diagnostic ladder for bpftrace-fluent SREs

When the alert fires and the graph shows nothing, the order of probes a war-room-fluent engineer types: (1) bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }' — what syscalls is the suspect process making, and at what frequency? Often the answer is in this one probe — a process making 20× the expected number of mmap calls is your culprit. (2) tracepoint:tcp:tcp_retransmit_skb and tracepoint:tcp:tcp_send_reset — is the network stack telling us anything? (3) tracepoint:sched:sched_switch /args->prev_pid==<PID>/ { @[args->next_comm] = count(); } — who is the scheduler picking when our process is going to sleep? (4) kprobe:vfs_read and kprobe:vfs_write filtered to the suspect process — disk-IO patterns? (5) profile:hz:99 /pid==<PID>/ { @[ustack] = count(); } — userspace stack profile. The five probes above answer 80% of war-room "what is happening" questions. The remaining 20% — page faults, GC pauses, kernel-internal contention — are bespoke, but the muscle memory of the first five is what makes the on-call shift survivable.

Where this leads next

The next chapter — /wiki/parca-pixie-pyroscope — covers the always-on profiling stack, the right tool for the production posture bpftrace cannot fill: continuous flamegraphs, deduplicated stacks, fleet-wide aggregation. After that, /wiki/agentless-observability-claims is the honest counter-balance — what marketing means by "agentless" (often: a daemon that uses eBPF, with agentless as a positioning claim). /wiki/ebpf-for-network-observability-cilium-hubble returns to the Hotstar IPL story and walks the network-stack observability tooling that built on top of the bpftrace foundations described here.

The diagnostic-ladder mental model — /procbpftrace one-liner → bcc pre-built tool → custom Python agent → libbpf C/Rust agent — is the through-line of the rest of Part 8 and into Part 14 (continuous profiling). Every tool above the rung you live on is one you call when the question outgrows your current tool; every tool below is the one you reach for first. bpftrace is rung 2, and rung 2 is where most production incidents end.

References