bpftrace: the awk of production

At 22:14 IST on a Tuesday, Karan, an SRE at Zerodha Kite, gets paged: order-acknowledgement p99 has climbed from 4 ms to 38 ms on a single matching-engine pod. The CPU profile is flat, the application's own latency histogram is flat, and the network metrics look fine. The bug is happening inside a system call, somewhere between the userspace write() to the order-event socket and the kernel's TCP send path. Six years ago this would have meant attaching strace -T to a production process and praying the 30× slowdown didn't make the bug disappear. Tonight Karan types a single line — bpftrace -e 'kprobe:tcp_sendmsg /comm == "kite-match"/ { @ns = hist(nsecs); }' -c "sleep 30" — and 30 seconds later has a power-of-two-bucket latency histogram of every tcp_sendmsg call from the matching engine, aggregated in-kernel, costing the pod under 1% of one core. The histogram has a fat second mode at 8–16 ms that the application metrics cannot see. Forty minutes later he has the fix. The thing that turned a four-hour bisect into a forty-minute incident is not eBPF in the abstract — it is the existence of a tool that lets you compose a probe, a filter, and an aggregation in one line, the way awk lets you compose a record matcher and a print action in one line.

bpftrace is a high-level tracing language that compiles one-line probe expressions to eBPF and runs them in the kernel. The grammar is probe /filter/ { action }, the actions include in-kernel aggregations like count(), hist(), lhist(), and quantize(), and the user-space side prints results when the program ends. It hides verifier mechanics, map allocation, and JIT details behind a syntax that takes ten minutes to learn and an afternoon to be fluent in.

The grammar that fits in your head

bpftrace's surface area is small enough that a working SRE can hold all of it in their head after one afternoon of practice. A program is a list of clauses; each clause has three parts:

probe /filter/ { action }

The probe is where the program runs — kprobe:tcp_sendmsg, tracepoint:sched:sched_switch, uprobe:/usr/bin/python3:_PyEval_EvalFrameDefault, interval:s:1. The filter is an optional predicate; only events satisfying it run the action. The action is what to do — increment a counter, sample a stack, build a histogram. Clauses are independent; a single bpftrace script can attach to twelve different probes, each with its own filter and action, and they all run in-kernel concurrently against the same set of maps.

The action language is roughly C-shaped — if, else, ==, &&, integer arithmetic — with three things tacked on that make it useful for production tracing. First, maps: any identifier prefixed with @ is a kernel-resident map that bpftrace allocates, sizes, and tears down for you. @count[comm] = count(); means "increment the slot keyed by the current process name in a map called @count". Second, builtins for the common things you want to know about an event: pid, tid, comm (process name), cpu, nsecs (current monotonic time in nanoseconds), kstack (kernel stack), ustack (user stack), arg0..argN (probe arguments), retval (kretprobe return value). Third, aggregating functions that run in-kernel and produce summary outputs: count(), sum(x), avg(x), min(x), max(x), hist(x) (power-of-two histogram), lhist(x, min, max, step) (linear-bucketed histogram), quantize(x) (alias for hist with nicer bucket labels in some versions). At end-of-script — Ctrl-C, exit(), or the -c "sleep N" wrapper completing — bpftrace dumps every map to stdout in human-readable form.

The bpftrace clause grammar — probe, filter, action — compiled to eBPFA horizontal flow showing a single bpftrace clause split into probe, filter, and action segments. Below, three boxes show what each segment compiles to: an attach point in the kernel, a conditional branch in the verified bytecode, and a map update operation. An arrow points right to a stdout block showing aggregated output at script-end.probe /filter/ { action } — what each piece compiles tokprobe:tcp_sendmsgPROBE/comm == "kite-match"/FILTER{ @ns = hist(nsecs - @start[tid]); }ACTIONattach: kprobe registeredon entry of kernel fnif (strncmp...) return;verifier-checked branchbpf_map_update_elem(@ns, ...)in-kernel histogram updateon Ctrl-C: bpftrace dumps every @map to stdoutpower-of-two histogram, sorted by key
Each piece of the bpftrace clause maps to a piece of the eBPF program the verifier sees. The probe becomes an attach point; the filter becomes a verified conditional branch; the action becomes a sequence of map operations. The user-space side wakes up only at end-of-script to print results.

The minimal one-liner that demonstrates all three pieces is bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'. There is no probe filter and the action is a single map increment keyed by process name. Run it for ten seconds, hit Ctrl-C, and bpftrace prints a sorted table of which processes opened files and how often. The same exercise in strace -e openat -p would either pin to a single PID (missing the cross-process question) or capture every event to userspace at 30× slowdown. The bpftrace version costs under 1% CPU on a busy box and answers the cross-process question by construction.

Why the in-kernel aggregation matters specifically for tools like this: a tracepoint:syscalls:sys_enter_openat fires roughly 50,000–500,000 times per second on a busy production box. Copying every event to userspace via a perf ring buffer costs roughly 200 ns per event in syscall overhead and queue management — at 500k events/s that's 100 ms of CPU per second, or 10% of one core, just to ship the events. Aggregating in-kernel into a hash map keyed by comm reduces the userspace data flow to one dump every N seconds (where N is whatever bpftrace's interval:s:N clause says, default end-of-script only), and the per-event cost drops to ~50 ns of map-update work. The 100× reduction in overhead is what makes "leave it running for an hour" feasible; without it, the tracer becomes the bottleneck and the bug it was meant to find disappears under the observation cost.

Aggregations are the actual product

The reason bpftrace replaced strace and printf-style printk for most production tracing is not the syntax — it is the in-kernel aggregations. A handful are worth knowing in detail, because each maps to a class of question.

count() answers "how often". @[comm] = count(); keyed by process name builds a histogram of which processes generated the event. @[kstack] = count(); builds a histogram of which kernel stack traces produced the event — the foundation of every flamegraph-from-kprobes workflow. The cost is one atomic increment per event, which on per-CPU maps drops to one non-atomic increment.

hist(x) answers "what is the distribution". It allocates a 64-slot map (one slot per power-of-two bucket from 1 ns to ~10 s) and increments the slot corresponding to floor(log2(x)). The output looks like:

@ns:
[1K, 2K)             1287 |@@@@                                                |
[2K, 4K)            14392 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)             8127 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[8K, 16K)             892 |@@@                                                 |
[16K, 32K)             14 |                                                    |
[8M, 16M)              43 |                                                    |

Six bucket rows, immediately readable, with the bimodal shape of "fast common case + tail" jumping out. Compare this to the experience of populating an HdrHistogram from per-event userspace samples — you can see why the in-kernel hist() won. It is the "histogram in one keystroke" idiom.

lhist(x, min, max, step) is the linear-bucket variant for cases where the value range is bounded and you want even bucket widths. lhist(retval, 0, 4096, 64) distributes return values from 0 to 4096 into 64-byte-wide buckets — useful for read-size distributions where the power-of-two bucketing of hist() smears together too much.

avg(x), sum(x), min(x), max(x) answer point-summary questions and are mostly useful when keyed: @us[comm] = avg(elapsed / 1000); gives the per-process average elapsed time. The trap with avg is the same trap as anywhere — it hides the tail. Most production bpftrace scripts prefer hist() over avg() for latency analysis. The shape of the distribution is almost always more useful than its mean.

Maps can be keyed by tuples. @[comm, pid] = count(); keys by both process name and PID, useful when the same name has many instances. @[ustack, kstack] = count(); keys by the joint user-and-kernel stack — the standard idiom for "what called this kernel function and from what userspace context". @latency[probe, comm] = hist(elapsed); runs separate histograms per probe per comm, all in the same map, all dumped at end-of-script. There is no upper bound on key arity; the verifier checks the total key size against the map definition.

A subtle property: maps in bpftrace are typed by first use. The first write @x = 5; makes @x an integer scalar. The first write @y = count(); makes @y a map of count-aggregations. The first write @z[pid] = nsecs; makes @z a map keyed by pid with integer values. Mixing types — assigning a count to a slot you previously set to a hist — is rejected at compile time. This is the source of many beginner errors that read like "type mismatch on map @x"; the fix is to clear the script and redo from scratch, because once the map's type is set early in the script it cannot be reassigned.

Probes are the alphabet

The probe types you will reach for, in roughly the order you will reach for them, are these.

kprobe:<function> and kretprobe:<function> attach to the entry and exit of any kernel function. The argument values are accessible as arg0, arg1, arg2, ... in the entry probe; the return value as retval in the return probe. To turn an entry-time start into an exit-time elapsed, store the start in a per-thread map and subtract on exit:

kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
    @us[comm] = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}

This is the single most-typed bpftrace pattern in production: enter-time stash keyed by tid, exit-time read-and-delete, histogram of elapsed. The delete(@start[tid]) is critical — without it, threads that exit between the entry and the return leak slots in the map until it fills.

tracepoint:<subsystem>:<event> attaches to a stable, source-marked tracepoint. The argument fields are accessible by name through args->: tracepoint:syscalls:sys_enter_openat { @[str(args->filename)] = count(); } reads the filename string argument from the tracepoint's args struct. Tracepoints are ABI-stable; kprobes are not. Always prefer a tracepoint when one exists for the event you want.

uprobe:<binary>:<symbol> and uretprobe:<binary>:<symbol> attach to functions inside a userspace binary. uprobe:/usr/bin/python3:_PyEval_EvalFrameDefault fires every time a Python frame is evaluated; uprobe:/usr/lib/x86_64-linux-gnu/libc.so.6:malloc fires on every libc allocation. The cost per fire is higher than kprobes (a context switch into the eBPF program from userspace) but the visibility into a userspace process is total.

usdt:<binary>:<provider>:<probe> attaches to userland statically-defined tracepoints baked into the binary by its author. Postgres, OpenJDK, Python (with --with-dtrace), and Node.js all ship USDT probes for their internal events. These are stable across versions of the binary, like kernel tracepoints.

profile:hz:<rate> fires at a fixed rate per CPU, regardless of what the kernel is doing. profile:hz:99 { @[ustack] = count(); } is the four-line on-CPU profiler that produces a flamegraph-ready histogram of userspace stacks at 99 Hz. The 99 (not 100) is to avoid lock-step sampling against periodic kernel work.

interval:s:N and interval:ms:N fires once per N seconds (or ms) on one CPU. Used for periodic readouts: print a map every second, then clear it.

BEGIN and END fire once at script start and once at script end. Use BEGIN { printf("tracing tcp_sendmsg, ctrl-c to end\n"); } to print a banner; use END { print(@); clear(@); } to control the final dump format.

software:<event>:<count> and hardware:<event>:<count> fire on perf events — every Nth context-switch, every Nth cache miss, every Nth branch miss. hardware:cache-misses:1000000 fires once every million cache misses; combined with @[ustack] = count(); it gives you a stack-traced view of where cache misses are happening in your code.

The probe namespace is reflected in the kernel's /sys/kernel/debug/tracing/events/ and /sys/kernel/debug/kprobes/list directories; bpftrace -l 'kprobe:tcp_*' lists every kprobe matching the pattern, and bpftrace -lv 'tracepoint:sched:sched_switch' prints the available argument fields for a tracepoint. These two commands are how you discover what's available without reading kernel source.

A real script, end to end

The retransmit-counting script from Chapter 39 used bpftrace from a Python harness. Here is a slightly larger, self-contained example that demonstrates the patterns from this chapter in one piece.

# vfs_read_latency.py
# Trace vfs_read latency by process for 30 seconds, render a per-process
# power-of-two histogram, and identify the slowest tail.
#
# This is the script Karan from the chapter opener could have run after
# the matching-engine pod showed unexplained 38ms p99: the SLA path goes
# write() -> sock_sendmsg -> tcp_sendmsg, but the symmetric question
# "where are reads slow" is the canonical first probe.

import json
import re
import signal
import subprocess
import sys

BPFTRACE_PROG = """
BEGIN {
    printf("tracing vfs_read latency, %d seconds\\n", $1);
}
kprobe:vfs_read /pid != cgroup_id(0)/ {
    @start[tid] = nsecs;
}
kretprobe:vfs_read /@start[tid]/ {
    $elapsed_us = (nsecs - @start[tid]) / 1000;
    @us[comm] = hist($elapsed_us);
    @count[comm] = count();
    @sum_us[comm] = sum($elapsed_us);
    delete(@start[tid]);
}
"""

def run(duration_s: int) -> str:
    """Spawn bpftrace with -c 'sleep N' so the kernel decides when to stop."""
    cmd = ["bpftrace", "-e", BPFTRACE_PROG, "-c", f"sleep {duration_s}"]
    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=duration_s + 10)
    if proc.returncode != 0:
        sys.stderr.write(proc.stderr)
        sys.exit(proc.returncode)
    return proc.stdout

def parse_top_avg(out: str, k: int = 5) -> list[tuple[str, float, int]]:
    """Find the top-k processes by average vfs_read latency."""
    sums = re.findall(r"@sum_us\[(\S+?)\]:\s*(\d+)", out)
    counts = dict(re.findall(r"@count\[(\S+?)\]:\s*(\d+)", out))
    rows = []
    for comm, total in sums:
        n = int(counts.get(comm, "0"))
        if n < 100:    # ignore long-tail noise
            continue
        rows.append((comm, int(total) / n, n))
    return sorted(rows, key=lambda r: -r[1])[:k]

def main() -> None:
    raw = run(duration_s=30)
    top = parse_top_avg(raw, k=5)
    print(f"{'process':<24} {'avg us':>10} {'reads':>10}")
    print("-" * 46)
    for comm, avg_us, n in top:
        print(f"{comm:<24} {avg_us:>10.1f} {n:>10}")
    # Also show the full histogram dump for the worst offender:
    if top:
        worst = top[0][0]
        m = re.search(rf"@us\[{re.escape(worst)}\]:\n((?:.+\n)+?)\n", raw)
        if m:
            print(f"\nlatency histogram for '{worst}':")
            print(m.group(1))

if __name__ == "__main__":
    main()
# Sample run on a c6i.4xlarge in ap-south-1 acting as a Zerodha Kite matching pod:
$ sudo python3 vfs_read_latency.py
tracing vfs_read latency, 30 seconds
process                      avg us      reads
----------------------------------------------
kite-match                    412.7      18394
postgres                       89.3       4271
prometheus                     34.8       1820
node_exporter                  18.2        641
gunicorn                        9.1        492

latency histogram for 'kite-match':
[1, 2)                  812 |@@                                                  |
[2, 4)                 5118 |@@@@@@@@@@@@@@@@                                    |
[4, 8)                15921 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)                3294 |@@@@@@@@@                                           |
[16, 32)                841 |@@                                                  |
[32, 64)                127 |                                                    |
[64, 128)                14 |                                                    |
[128, 256)                3 |                                                    |
[256, 512)                4 |                                                    |
[512, 1K)                14 |                                                    |
[1K, 2K)                 38 |                                                    |
[2K, 4K)                 91 |                                                    |
[4K, 8K)                117 |                                                    |

Walk-through. BEGIN prints a banner with the duration passed via the $1 positional. kprobe:vfs_read /pid != cgroup_id(0)/ stashes the entry timestamp keyed by thread id, but only for processes outside the root cgroup — the filter excludes pid 0 (idle/swapper) noise. kretprobe:vfs_read /@start[tid]/ runs only when there is a matching entry timestamp (the filter /@start[tid]/ reads the map and treats non-zero as truthy); it computes elapsed microseconds, updates a per-comm histogram, increments a per-comm count, accumulates a per-comm sum, and deletes the entry — without the delete, threads that exit mid-read would leak slots forever. parse_top_avg uses regex over bpftrace's stdout dump to compute per-process averages; the threshold n < 100 filters out processes with too few samples to be statistically meaningful. The histogram block at the end of the output reveals the kite-match service's bimodal shape: most reads complete in 4–8 µs, but there is a non-negligible second mode at 4–8 ms. That second mode is the bug.

The output's two modes are what you cannot get from a single average. The 412.7 µs average for kite-match is misleading — it is the consequence of 18,000 fast reads averaged with a few hundred slow ones. The histogram tells the real story: most reads are fast, but the tail extends three orders of magnitude further than expected, and that tail is what's burning the SLA.

Bimodal latency distribution from a real bpftrace run — illustrativeA horizontal histogram with power-of-two buckets on the x-axis from 1 microsecond to 8 milliseconds. The bars rise sharply between 2 and 8 microseconds (fast mode, mostly 4-8 microseconds), drop to near zero between 64 and 512 microseconds, then rise again between 1 and 8 milliseconds forming a second smaller mode. A dashed vertical line at 412 microseconds marks the average, between the two modes where almost no events actually live.vfs_read latency — bimodal: fast mode + tailIllustrative — not measured data1us4us16us64us256us1ms4ms16msavg = 412us (lies)fast modetail modecount (log)
The bimodal shape that bpftrace's hist() reveals at a glance and a single average completely hides. Production tracing lives or dies on whether the operator can see the second mode; the average sits in a bucket where almost nothing actually occurs.

The shape of the bug, once you see it, is recognisable: the kite-match service is occasionally hitting a code path where vfs_read blocks on disk for 4–8 ms instead of completing from page cache in 4–8 µs. The follow-up bpftrace script narrows the cause: kprobe:vfs_read /comm == "kite-match" && @start[tid] == 0/ { @[ustack] = count(); } (a simplified form) captures the userspace call stacks of slow reads, and the top stack turns out to be a periodic configuration-reload that bypasses the in-memory cache. Forty minutes from page to fix.

Why the bpftrace -c "sleep N" wrapper is preferred over Ctrl-C: when bpftrace is invoked with -c <cmd>, it spawns the command, runs the probes for the duration of the command's lifetime, and tears down cleanly when the command exits. With manual Ctrl-C, the SIGINT can race the in-progress probe firings, and on rare paths the per-thread @start[tid] map can leak entries because the kretprobe never fires. The -c "sleep 30" idiom is the deterministic way to get a fixed duration with clean teardown — important when scripting bpftrace from Python harnesses where you want reproducible runs.

Common confusions

Going deeper

The bpftrace tools repository — every tool you wish you had written

The bpftrace/tools/ directory in the upstream repository contains over 100 production-tested scripts shipped as part of bpftrace itself: biolatency.bt (block-IO latency histogram), tcplife.bt (TCP connection lifetime tracker), runqlat.bt (scheduler run-queue latency), oomkill.bt (capture OOM-kill stack traces), mallocstacks.bt (malloc bytes by userspace stack). Each tool is 30–100 lines of bpftrace with extensive comments. Reading these scripts in order is the fastest way to internalise the idioms and the probe vocabulary; most of them are direct translations of the BCC tools from Brendan Gregg's BPF Performance Tools book, ported into the bpftrace one-language form. For a working SRE the practical advice is: install bpftrace from the distro's package, then find /usr/share/bpftrace/tools -name '*.bt' and read every file in alphabetical order over a fortnight. The investment is half an hour an evening for two weeks; the payoff is a permanent vocabulary expansion of what the kernel can be asked.

Why the in-kernel histogram cost stays bounded as event rate climbs: the hist() action compiles to a __builtin_clz (count-leading-zeros) instruction over the value, an array index into a 64-slot per-CPU map, and a non-atomic increment of the slot. The total cost is roughly 8–12 cycles per probe firing on modern x86_64 — under 4 ns at 3 GHz. At 10 million events per second per core that is 4% of one core; at 1 million events/s/core it is 0.4%. The cost scaling is linear in event rate and independent of histogram width, which is why a single hist() over nsecs is the canonical "tell me the distribution" tool — the answer is paid for in advance, not at readout time.

Why bpftrace cannot do everything BCC can

BCC (the bcc Python library) and bpftrace share the same eBPF substrate but make different trade-offs. BCC programs are written in C with Python orchestration; they have full access to eBPF's program-types, helpers, and memory layout. bpftrace programs are higher-level and lose access to a few rare features: writing to BPF_MAP_TYPE_RINGBUF for per-event delivery in the modern shape, attaching to LSM hooks for security policy, certain newer program types (raw_tracepoint, kfunc, BPF iterators) that take time to land in bpftrace's grammar. For 95% of observability scripts bpftrace is enough, and the brevity advantage is decisive. For the remaining 5% — a custom packet processor, a per-event ring buffer with rich payloads, an LSM policy — drop down to BCC or libbpf-bootstrap. Knowing where the line is matters: trying to force bpftrace into a use case it does not support produces frustration. The same person, on the same problem, in the right tool, is fluent in twenty minutes.

The --unsafe flag and the system() action

bpftrace has a system("shell command") action, gated behind the --unsafe flag, that lets a probe spawn a userspace command. This is the foot-gun. A kprobe:vfs_open { system("rm -rf /tmp/foo"); } will fork a shell on every file open — an unboundedly fast loop will fork-bomb the box in seconds and require a hard reboot. The flag exists because there are legitimate uses (a one-shot probe that captures a core dump when a specific kernel error fires) but the failure mode is so severe that the bpftrace developers gated it explicitly. Real production scripts almost never need system(); if you find yourself reaching for it, the right answer is usually a small Python harness around bpftrace that watches the map and triggers the action in userspace where it cannot fork-bomb the host.

Map sizing and the silent-drop failure mode

Every bpftrace map has a maximum size. The default is 4,096 entries; the runtime knob is BPFTRACE_MAP_KEYS_MAX. Once a map fills, further insertions silently fail — bpftrace prints a warning at end-of-script (Lost N events) but the in-flight script continues and produces a map that is missing data. For aggregation maps keyed by something low-cardinality (comm, pid, kernel-stack-id) the default is generous. For maps keyed by something high-cardinality — every IPv4 src/dst pair, every UUID, every conntrack flow — the default fills in seconds and the resulting histogram is a lie. The two fixes: bump the limit at script start (bpftrace -B BPFTRACE_MAP_KEYS_MAX=131072 ...) or change the script's keying to bucket the cardinality before the insert (@[ip & 0xffffff00] to coarsen IPv4 keys to /24). Always read the end-of-script warnings — a script that silently dropped half its events is worse than no script at all, because the histogram looks plausible.

Reproduce this on your laptop

# Reproduce the vfs_read latency demo on a Linux box (5.8+ recommended for CAP_BPF).
sudo apt install bpftrace
python3 -m venv .venv && source .venv/bin/activate
# (no pip install needed — script uses only stdlib)
sudo python3 vfs_read_latency.py
# Generate read load with: dd if=/dev/zero of=/tmp/big bs=1M count=2048
# then in another terminal: cat /tmp/big > /dev/null
# (do this while vfs_read_latency.py is running)

Where this leads next

This chapter introduced the language; the next chapters in Part 6 widen the lens. Chapter 41 (BCC toolchain) covers the Python-and-C framework underneath bpftrace's one-liners — useful when bpftrace's grammar runs out and you need the full eBPF program model. Chapter 42 (kprobes vs tracepoints) is the deep dive into the kernel-side hook taxonomy this chapter only previewed. Chapter 43 (uprobes and USDT) covers the same for userspace. Chapter 44 covers the per-event delivery path when aggregation is not enough. Chapter 45 (eBPF latency histograms in production) is where bpftrace's hist() graduates from incident-response tool to always-on observability primitive.

The single insight to carry forward: bpftrace fits in your head, and that is its product. Once you are fluent — once kprobe:tcp_sendmsg /comm == "kite-match"/ { @ns = hist(nsecs); } flows out of your fingers without thinking — every kernel question becomes a one-line experiment instead of a half-day project. The Karan opener of this chapter is what fluency looks like in production: from "the SLA is breached" to "the histogram is bimodal" in under a minute, because the tool's surface area is small enough that the next line of code is always obvious.

A practical next step: install bpftrace on whatever Linux box you have (any Ubuntu 22.04 or later, any Amazon Linux 2023, any Debian 12 — apt install bpftrace or dnf install bpftrace), then run bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }' for ten seconds. The output is a sorted table of which processes on your machine opened files in those ten seconds. Read it. The first time you see the table — and notice that systemd, snapd, your browser, and a tracker daemon you forgot you installed are all in it — the conceptual switch from "tracing is a special activity" to "tracing is just a question" flips, permanently.

References