BCC toolchain

Aditi runs platform engineering at PhonePe. Her UPI gateway pods are dropping 0.4% of requests with ECONNRESET during the 21:30 IST nightly traffic peak — about 4 lakh failed transactions a day, roughly forty-eight crore rupees of payment volume that has to be retried by the merchant SDK. The errors are not in the application logs because the application never sees them; the kernel's TCP stack is sending the RST before the bytes ever reach userspace. A bpftrace one-liner can tell her which tcp_sendmsg calls are slow, but it cannot tell her which struct sock state machine transitioned to CLOSE-WAIT in the 200 ms before the RST, because the answer requires reading three nested fields out of struct sock, doing a bit of arithmetic on a u32, and emitting a per-event record with a 96-byte payload — and bpftrace's grammar does not let her write a struct walker. She drops down to BCC: she writes thirty lines of restricted C, twenty lines of Python, and at 21:31 the next evening the script is printing a stream of (local_port, remote_ip, sock_state_before, sock_state_after, retransmits) tuples she can grep through. By 22:10 she has the bug — a misconfigured conntrack timeout on the NAT gateway — and by 22:45 it's fixed. BCC is the tool you reach for when the question is exact, the kernel struct is awkward, and the answer needs Python on top to format it.

BCC (BPF Compiler Collection) is a Python framework that compiles C eBPF programs at script startup, attaches them to kernel probes, and exposes BPF maps and ring buffers as ordinary Python objects. You write the kernel-side hot path in C and the orchestration, parsing, and reporting in Python. It is the right tool when bpftrace's grammar is too narrow — struct walking, custom output formats, large per-event payloads — and the wrong tool when a bpftrace one-liner already answers the question.

The two-language model — and why it exists

Every BCC program has two halves that run on opposite sides of the kernel boundary. The kernel half is a small C program — typically 20 to 200 lines — that runs in eBPF inside the kernel, attached to a probe, doing the high-rate work: extracting fields out of a struct sock, computing a latency, updating a hash map, optionally pushing a per-event record into a ring buffer. The userspace half is a Python script that compiles the C source at startup (BCC ships a wrapped LLVM/Clang toolchain), instantiates a BPF object that handles attachment and lifecycle, polls the maps or ring buffers for output, and renders the result.

The split exists because eBPF's verifier rejects almost everything you would write in normal C. The verifier requires that your program halt (so for loops are bounded), that every memory access is provably safe (so you call bpf_probe_read_kernel() to read a kernel pointer rather than dereferencing it directly), that the call graph is shallow (so most helpers can't recurse), and that you cannot allocate (so all state lives in pre-sized BPF maps). The combination makes the C side resemble a state machine inside a switch statement. The Python side, by contrast, is just Python — it has the heap, the standard library, regex, JSON, an argparse, a pandas DataFrame if you want one. By splitting the work across the two languages, BCC keeps the kernel side narrow enough to verify and the userspace side rich enough to be productive.

The phrase "restricted C" is doing real work in that paragraph. eBPF C is a strict subset: no global variables (only BPF maps), no inline assembly outside specific helper macros, no calling arbitrary functions (only the helper set the kernel exposes), no string library, no printf (only the bpf_trace_printk helper, which is rate-limited to a few hundred messages per second across the whole machine and uses a global trace buffer). The first time you write BCC the restrictions feel arbitrary; by the third script they feel natural because they all derive from one rule — the verifier must be able to prove the program halts and never accesses invalid memory in bounded time. Every restriction in the eBPF C dialect is the verifier saying "I cannot prove this construct safe in polynomial time; therefore you cannot write it." The dialect is not C-with-bugs; it is C-with-formal-guarantees.

The BCC two-language architecture: C eBPF program in kernel, Python orchestrator in userspaceA horizontal split between userspace at the top and kernel at the bottom. On the userspace side, a Python script box contains an embedded C string and points to a BCC library box. The BCC library compiles the C string with a wrapped Clang/LLVM and emits eBPF bytecode. The bytecode crosses the userspace-kernel boundary into the verifier, then the JIT, then becomes attached probe machine code. A BPF map sits inside the kernel; arrows show the kernel program updating the map and the Python script reading it back through the BCC library.USERSPACEKERNELscript.pyBPF(text="""...C...""")b["events"].open_perf_buffer(cb)b.perf_buffer_poll()libbccwrapped Clang/LLVMbpf() syscallmap & ringbuf bindingscallback(cpu, data, sz)ctypes.cast → structformat, write to stdoutpure Python from hereverifier + JITaccepts → emits x86_64attaches to kprobe≤512B stack, no recursionattached proberuns on every kprobe firebpf_probe_read_kernel()bpf_perf_event_output()BPF maps + ringbufHASH, LRU_HASH, ARRAYPERF_EVENT_ARRAYRINGBUF
BCC's two-language model: a Python orchestrator embeds a C source string, libbcc compiles it through a wrapped Clang to eBPF bytecode, the verifier and JIT turn it into attached machine code, and the kernel program writes results to maps or a ring buffer that Python reads back. The boundary between Python and C is the boundary between "rich, non-real-time, slow" and "narrow, real-time, fast".

Why the compile-at-startup design matters in practice: BCC's BPF(text="...") constructor takes a few hundred milliseconds to several seconds, depending on how big the C source is and whether kernel headers need parsing. On a Razorpay container that's spawning a tracing pod for incident response, that startup cost is invisible — the operator types the command and waits. On a long-running daemon that wants to attach a tracer for ten seconds and detach, the startup cost is the dominant overhead. This is one of the two structural reasons libbpf-CO-RE was eventually built (the other is portability across kernel versions): pre-compiling the eBPF bytecode at build time, not script-startup time, removes the LLVM dependency and the multi-second compile from the runtime path. BCC remains the right tool for interactive incident response; libbpf-CO-RE is the right tool for tools shipped in production images.

A real BCC script — TCP retransmit attribution

Below is a self-contained BCC script that solves a problem closely related to Aditi's: it captures every TCP retransmit on the box, attributes it to the local PID and the four-tuple, and prints a stream of records. This is roughly the structure of tcpretrans.py from the BCC tools collection, simplified for clarity. It is the canonical shape of a BCC tool — a BPF_HASH for state, a PERF_EVENT_ARRAY for output, two probes (entry and a tracepoint), and a Python open_perf_buffer callback that formats records.

#!/usr/bin/env python3
# tcpretrans_attrib.py
# Attribute every TCP retransmit on the box to the local PID, command,
# and four-tuple. Useful when you see ECONNRESET storms in the application
# and the application's own metrics insist nothing is wrong.

from bcc import BPF
from socket import inet_ntop, AF_INET
from struct import pack
import ctypes as ct
import sys, time

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>

struct event_t {
    u32 pid;
    u32 saddr;
    u32 daddr;
    u16 sport;
    u16 dport;
    u32 state;
    u64 ts_ns;
    char comm[16];
};

BPF_PERF_OUTPUT(events);

int kprobe__tcp_retransmit_skb(struct pt_regs *ctx, struct sock *sk) {
    struct event_t e = {};
    e.pid    = bpf_get_current_pid_tgid() >> 32;
    e.ts_ns  = bpf_ktime_get_ns();
    bpf_get_current_comm(&e.comm, sizeof(e.comm));
    bpf_probe_read_kernel(&e.saddr, sizeof(e.saddr),
                          &sk->__sk_common.skc_rcv_saddr);
    bpf_probe_read_kernel(&e.daddr, sizeof(e.daddr),
                          &sk->__sk_common.skc_daddr);
    bpf_probe_read_kernel(&e.sport, sizeof(e.sport),
                          &sk->__sk_common.skc_num);
    bpf_probe_read_kernel(&e.dport, sizeof(e.dport),
                          &sk->__sk_common.skc_dport);
    bpf_probe_read_kernel(&e.state, sizeof(e.state),
                          &sk->__sk_common.skc_state);
    events.perf_submit(ctx, &e, sizeof(e));
    return 0;
}
"""

class Event(ct.Structure):
    _fields_ = [("pid",   ct.c_uint32),
                ("saddr", ct.c_uint32),
                ("daddr", ct.c_uint32),
                ("sport", ct.c_uint16),
                ("dport", ct.c_uint16),
                ("state", ct.c_uint32),
                ("ts_ns", ct.c_uint64),
                ("comm",  ct.c_char * 16)]

STATES = {1:"ESTAB", 2:"SYN_SENT", 3:"SYN_RECV", 4:"FIN_WAIT1",
          5:"FIN_WAIT2", 6:"TIME_WAIT", 7:"CLOSE", 8:"CLOSE_WAIT",
          9:"LAST_ACK", 10:"LISTEN", 11:"CLOSING"}

def render(cpu, data, size):
    e = ct.cast(data, ct.POINTER(Event)).contents
    src = inet_ntop(AF_INET, pack("I", e.saddr))
    dst = inet_ntop(AF_INET, pack("I", e.daddr))
    dport = (e.dport >> 8) | ((e.dport & 0xff) << 8)
    print(f"{e.ts_ns/1e9:14.3f} {e.comm.decode():<14} {e.pid:>6} "
          f"{src}:{e.sport} → {dst}:{dport} {STATES.get(e.state,'?')}")

b = BPF(text=bpf_text)
b["events"].open_perf_buffer(render, page_cnt=64)
print(f"{'time(s)':>14} {'comm':<14} {'pid':>6} flow state")
print("-" * 80)
try:
    while True:
        b.perf_buffer_poll(timeout=1000)
except KeyboardInterrupt:
    sys.exit(0)
# Sample run on a c6i.4xlarge in ap-south-1 acting as a PhonePe gateway pod:
$ sudo python3 tcpretrans_attrib.py
       time(s) comm             pid flow state
--------------------------------------------------------------------------------
 17413210.041 phonepe-gw      8421  10.4.2.18:443 → 10.4.5.91:51284 ESTAB
 17413210.094 phonepe-gw      8421  10.4.2.18:443 → 10.4.5.91:51284 ESTAB
 17413210.241 phonepe-gw      8421  10.4.2.18:443 → 10.4.5.91:51284 CLOSE_WAIT
 17413210.292 nginx          12104  10.4.2.18:443 → 100.64.7.3:54812 ESTAB
 17413211.018 phonepe-gw      8421  10.4.2.18:443 → 10.4.5.99:51301 ESTAB
 17413211.118 phonepe-gw      8421  10.4.2.18:443 → 10.4.5.99:51301 CLOSE_WAIT
 17413211.418 phonepe-gw      8421  10.4.2.18:443 → 10.4.5.99:51301 CLOSE_WAIT
 17413212.041 phonepe-gw      8421  10.4.2.18:443 → 10.4.7.11:51367 CLOSE_WAIT
 17413212.142 phonepe-gw      8421  10.4.2.18:443 → 10.4.7.11:51367 CLOSE_WAIT

Walk-through. bpf_text is a Python string that BCC will hand to its embedded Clang at script startup. Because it is a string, you can templatise it with f-strings, conditionals, or read snippets from disk — a flexibility bpftrace does not give you. The event_t struct declares the per-event payload that crosses from kernel to userspace; it must match the ctypes.Structure on the Python side byte-for-byte, including padding, which is why low-level BCC scripts often pin the field types with __attribute__((packed)). kprobe__tcp_retransmit_skb uses BCC's name-mangling shortcut: a function named kprobe__<symbol> is auto-attached to the kprobe of <symbol>. The arguments after struct pt_regs *ctx are the kernel function's own arguments, in this case struct sock *sk. bpf_probe_read_kernel is mandatory — you cannot dereference sk->... directly because the verifier rejects unchecked kernel-pointer accesses; the helper does the read with proper fault handling. events.perf_submit ships the event into the per-CPU perf ring buffer, which the userspace open_perf_buffer is reading. render is the userspace callback: it casts the raw bytes to the Event ctypes struct, byte-swaps dport (kernel stores it in network order, like inet_aton output), and prints. The CLOSE_WAIT lines in the output are the bug — three retransmits, each leaving the socket in CLOSE_WAIT, are the signature of an asymmetric close where the peer has half-closed and the local side has not finished its writes.

Why the perf ring buffer is the right output channel here, not a hash map: this is a per-event tool, not an aggregation tool. Each retransmit needs to be reported individually, with full context, in roughly real time. A BPF_HASH would force you to aggregate (count by tuple, count by PID), which would lose the temporal ordering — you would never see the "three CLOSE_WAITs in a row on the same flow" pattern that names the bug. The cost of the per-event channel is real — at high event rates (above ~10⁵/s) the perf ring buffer's userspace consumer can fall behind, and BCC will print "Lost N samples" warnings — but for a low-rate, high-context event like TCP retransmits (typically 10–1000/s on a busy gateway pod), the cost is negligible and the per-event detail is the entire product.

Maps, ring buffers, and the choice between them

BCC programs communicate kernel-to-userspace through one of two channels, and picking the right one is most of the design choice for any non-trivial tool.

BPF maps are kernel-resident hash tables, arrays, or LRUs that the eBPF program writes and the userspace program polls or iterates. Use a map when the question is "what is the current state, summed over all events so far". BPF_HASH(latency_us, u64) keyed by some identifier and updated with the elapsed time gives you a per-key latency total; on script exit you iterate the map in Python with for k, v in b["latency_us"].items(): .... The cost is one update per probe firing (a hash insert or atomic increment) and zero userspace cost per firing — a perf record-shaped trade-off.

Perf event arrays / ring buffers are queues that the eBPF program writes individual records into and userspace reads continuously. Use a ring buffer when the question is "tell me about each event as it happens, with full context". The cost per firing is a copy of your event_t struct into the ring buffer (typically 50–200 ns) plus the userspace processing cost of formatting and printing the record (typically 1–10 µs in Python). At firing rates above ~100k/s the userspace consumer cannot keep up with the producer and the ring buffer drops events — at which point either you switch to map-based aggregation or you raise the page count of the buffer (the page_cnt=64 argument in the script above).

The newer BPF_RINGBUF (Linux 5.8+) is a single shared ring across all CPUs, with explicit ordering guarantees and lower per-event overhead than the legacy PERF_EVENT_ARRAY. New tools should use BPF_RINGBUF; the older BPF_PERF_OUTPUT/open_perf_buffer API exists in the example above because it is what the long-tail of production BCC tools at Razorpay, Cred, Hotstar still use, and recognising the older shape matters when you read existing tools.

A pattern worth committing to muscle memory: two-stage tools that combine both channels. Use a map to track per-key state on the high-rate path (entry timestamp keyed by tid, syscall counts keyed by comm) and a ring buffer to emit a per-event record only on the exit probe, when you have already filtered the events down to interesting ones. The hot path stays cheap because the map update is constant-time; the ring buffer stays uncongested because most probe firings never produce a userspace record. tcplife.bt, runqlat.py, and biolatency.py all use variations of this pattern.

The map types BCC exposes are themselves worth a short tour, because picking the wrong type wastes memory or bottlenecks under contention. BPF_HASH is the default — an unordered hash table with a fixed maximum size; insertions past the size silently fail. BPF_LRU_HASH evicts the least-recently-used entry on insertion, which is what you want for caches keyed on something unbounded (per-flow, per-IP). BPF_ARRAY is a flat array indexed by integer — fast, cache-friendly, but only useful when keys are dense small integers (CPU id, syscall number). BPF_PERCPU_HASH and BPF_PERCPU_ARRAY keep one copy per CPU, eliminating cross-CPU atomic contention on counters at the cost of needing a userspace sum across CPUs at readout — the right choice for high-rate counter-only workloads. BPF_STACK_TRACE stores stack-trace IDs that map to actual PC sequences, enabling stack-keyed aggregations like @[stackid] = count(); for flamegraph-from-kprobes generation. Picking the wrong type rarely causes correctness bugs, but it can turn a 1% overhead tool into a 10% one — which means the tool changes the system enough that the bug it was meant to find moves.

BPF map vs ring buffer — output channel selection by question shapeA two-column comparison. Left column shows a hash map labelled BPF_HASH, with rows of key-value pairs being updated by a probe arrow, and a Python arrow reading the table at end-of-script. Right column shows a ring buffer labelled PERF_EVENT_ARRAY, with the probe pushing individual records onto a queue and Python continuously polling and printing them. A summary row at the bottom labels each as suitable for aggregations or per-event tools respectively.BPF map vs ring buffer — pick by question shapeBPF_HASH (map)"sum over all events so far"key="phonepe-gw"→ 41key="nginx"→ 7key="curl"→ 1probe: increment slotpython: iterate at exitPERF_EVENT_ARRAY (ringbuf)"tell me each event"[evt0] pid=8421 saddr=...[evt1] pid=8421 saddr=...[evt2] pid=12104 saddr=...probe: perf_submit() per firepython: continuous poll
The first design decision in any BCC tool is which output channel to use. Maps win when the answer is an aggregate; ring buffers win when the answer needs per-event detail. Two-stage tools combine both — map for the high-rate path, ring buffer only for the rare events that pass the filter.

An idiom worth calling out: the @ prefix that bpftrace uses for maps is BCC's BPF_HASH(name, key_type, value_type, max_entries) macro. The macro expands at compile time into a kernel struct that the verifier recognises and the runtime allocates. The expansion is what gives BCC its tighter control over map types — you specify BPF_TABLE("lru_hash", u64, struct flow_state, flows, 65536); and you get an LRU hash with 65,536 slots, automatic eviction, and a fixed memory footprint. bpftrace would have given you a hash that grows toward BPFTRACE_MAP_KEYS_MAX and started silently dropping at the limit. The level of control matters when the script will run for hours on a busy box — the difference between a tool that gracefully degrades under load and one that produces lies after the first ten minutes.

Failure modes — the four ways BCC scripts go wrong in production

Every BCC author eventually meets each of these, and recognising them on first sight saves hours of debugging.

The verifier rejection at load time. You run the script, BCC compiles cleanly, then the kernel rejects the bytecode and BPF() raises an exception. The dmesg tail contains a wall of bytecode-level annotations ending with a reason — R3 invalid mem access 'inv', back-edge from insn 42 to 17, program exceeds 1000000 instructions. The fix follows the rejection class: invalid memory access means a missing bpf_probe_read_kernel(); back-edge means an unbounded loop; instruction-count exceeded means the inlined helper expansion got too large and you need to split the function. The diagnostic loop is the same every time — read the rejection, find the C line, fix it, rerun.

The verifier's instruction-count limit deserves a closer note because it is the failure mode most likely to bite when scripts grow. Before Linux 5.2 the limit was 4,096 instructions, after which it was raised to 1,000,000. Sounds generous; in practice an inner loop with a few bpf_probe_read calls expands through the helper inlining to dozens of instructions per source line, and a 50-line C function can hit the cap. The split-the-function fix uses BPF's tail-call mechanism — one program calls another via a program-array map, and the verifier checks each program independently against the limit. BCC exposes this via BPF_PROG_ARRAY and prog_array.call(ctx, idx). Tools like tcpconnect.py use tail calls when the path through the program is long enough to matter; for typical 50-line scripts the limit is invisible.

The kernel-version drift. You write a script that reads sk->sk_state on a 5.10 kernel; you ship the same script to a 6.1 box and it crashes on load because struct sock was reorganised between the two versions and the offset of sk_state changed. The fix on BCC is to use the kernel's running headers (which is why linux-headers-$(uname -r) is in every BCC install command) and to recompile on the target kernel. The fix on libbpf-CO-RE is to use BPF_CORE_READ() macros that emit BTF-relocatable accesses; this is the single biggest reason production tooling teams migrate from BCC to libbpf-CO-RE. For investigation tools that run on one host, BCC's behaviour is fine; for tools shipped to a hundred hosts, it is the wrong abstraction.

The drift is not theoretical. Between Linux 5.4 and 6.1 — the kernel range a typical Indian fintech runs across its fleet because old EKS node pools and new ones coexist — struct sock gained two new fields (sk_scm_recv_flags, sk_bpf_storage), the layout of __sk_common was reshuffled to accommodate them, and several fields moved by 4 to 16 bytes. A BCC script that hard-codes offsets via bpf_probe_read_kernel(&dst, sizeof(dst), &sk->__sk_common.skc_state) survives all these moves because it asks the running kernel's headers for the offset, but the same script frozen as compiled bytecode would silently read garbage. This is why running BCC scripts in a CI pipeline that compiles on one kernel and runs on another is dangerous; the compile and the run must be on matching kernel versions.

The lost-samples ring-buffer overrun. The script runs, output flows for a while, then BCC starts printing Possibly lost 4096 samples. The producer is filling the ring buffer faster than the userspace callback can drain it. Three fixes, in order of preference: filter earlier in the kernel C side so fewer events make it to the ring buffer (most common); raise page_cnt= on open_perf_buffer (multiplies the buffer size, but only buys time, not throughput); switch from per-event ring buffer to in-kernel BPF_HASH aggregation so the userspace cost drops to one read at script-end. The third option is what most production-grade tools end up doing once the rate scales above ~10⁵/s.

The math on page_cnt is worth understanding. Each page is 4 KB, and the buffer is per-CPU. With page_cnt=64 (the typical default for tools doing per-event reporting) on a 16-core box, the total ring buffer footprint is 64 × 4 KB × 16 = 4 MB. That holds roughly 40,000 events of a 100-byte record before overrun. At 10⁴ events/s the buffer holds 4 seconds of slack — enough to ride out a Python GC pause or a brief CPU stall. At 10⁵ events/s the slack is 400 ms — enough only if userspace drains every iteration. Above 10⁵ events/s the buffer is structurally inadequate and the only fix is to drop events at the kernel side via tighter filters, or to switch to aggregation. There is no page_cnt value that saves a tool whose producer rate exceeds its consumer rate by a steady margin; the buffer is a shock absorber, not a solution to a flow imbalance.

The headers-not-installed install failure. apt install bpfcc-tools works, the script runs, BCC complains that it cannot find linux/types.h or net/sock.h. The fix is linux-headers-$(uname -r) — BCC's runtime compile needs the kernel's own headers to resolve struct definitions. On a stripped container image that omitted headers to save disk, BCC will not work; a sidecar pattern (run BCC in a separate debug-image pod that shares the host's /sys/kernel/debug and /proc) is the production-friendly answer.

The sidecar pattern deserves its own paragraph because it is what most Indian fintechs land on once their fleet is fully containerised. The application pod ships with no kernel headers and no LLVM — keeping the image small. A separate debug pod, scheduled on the same node via a nodeSelector, ships with linux-headers, bpfcc-tools, the BCC scripts the team wants on the box, and the privileges to run them (SYS_ADMIN, SYS_RESOURCE, BPF capabilities). The application pod and the debug pod share the host PID namespace via hostPID: true and /sys/kernel/debug via a hostPath volume. When an SRE needs to attach BCC, they kubectl exec into the debug pod, run the script, and see the entire host's events. The cost is one extra pod per node (typically 100MB image, dormant); the benefit is that the application image stays clean and the debug capability is always one exec away.

Why each of these is a structural property, not a bug: BCC made the deliberate choice to compile-on-target, which buys runtime flexibility (your script can read struct fields by name from the running kernel's headers) at the cost of every failure mode above. libbpf-CO-RE made the opposite choice — pre-compile, embed a BTF relocation table — which removes the headers dependency and the version drift but adds build-time complexity and forces every program to opt into BTF. The two trade-offs are not better or worse; they are pointed at different deployment models. Knowing which model your problem fits is what separates a working SRE from one who knows BCC syntax.

Where BCC fits between bpftrace and libbpf

BCC, bpftrace, and libbpf are three layers on the same eBPF substrate, and choosing among them is mostly a question of where on the brevity-versus-control axis you want to be. The earlier chapter on bpftrace covered the high-brevity end; libbpf-CO-RE is the high-control end. BCC sits in the middle and earns its place because the middle is where most production tools live.

A bpftrace one-liner is the right choice when the answer fits in a single clause and a single map type. bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm] = count(); }' answers "which processes are retransmitting" in one line, in 30 seconds, with no compile step. The trade-off is that bpftrace's grammar does not let you walk arbitrary structs, define your own packed payloads, or run rich Python on the userspace side. The instant the question grows a join, a multi-field per-event payload, or a sort-and-format step, bpftrace stops being enough.

A useful exercise to internalise the boundary: write the same observation twice, once in bpftrace, once in BCC. Take "count tcp_retransmit_skb calls per process". The bpftrace version is the one-liner above. The BCC version is fifteen lines of Python embedding eight lines of C. They produce the same answer in the same time. Now extend the observation: "for each retransmit, also report the four-tuple and the socket state". The bpftrace version becomes a printf per event with five args-> accesses and starts hitting the limits of bpftrace's printf formatting. The BCC version adds five fields to its event_t struct and one bpf_probe_read_kernel per field — 100 lines total, fully structured, easy to extend. The exercise teaches the boundary in 30 minutes; the boundary is exactly where bpftrace's printf stops being enough.

BCC takes over there. You write the C exactly as you would in bpftrace's action body — kprobes, helpers, maps — but you also get a full Python program around it. You can read CSV configuration files, query a MySQL database for the list of PIDs you care about, post results to a Slack webhook, render a histogram with matplotlib, package the whole thing as a pip-installable tool. The cost is the LLVM compile at startup (1–3 seconds typical, occasionally up to 10 on slow boxes), the requirement that the kernel headers be installed (which is why production fleets often ship a custom kernel-devel package), and the fact that BCC scripts are tied to the running kernel's struct layouts — a script written against a 5.10 struct will fail on 6.1 if struct sock was reshuffled.

libbpf-CO-RE removes the runtime compile and the kernel-version pinning by pre-compiling at build time and using BTF (BPF Type Format) relocations to handle struct-layout differences at attach time. The cost is more boilerplate (you write the C, the BPF skeleton header, and the userspace program separately), longer build times during development, and a learning curve that takes weeks rather than hours. For shipped production tooling that runs on a fleet of mixed kernel versions, libbpf-CO-RE is the right answer; for incident response and one-off investigations, BCC remains faster to iterate on.

The CO-RE acronym is worth unpacking because it names the actual mechanism. Compile Once, Run Everywhere is a play on Java's "write once, run everywhere" — except instead of a JVM at runtime, the kernel's BTF (a compact debug-info format that ships with every modern kernel) lets a pre-compiled BPF program ask, at load time, "where is the field sk_state in the running kernel's struct sock?" and patch its own offsets. The patching is done by the kernel as part of program load. This is what removes BCC's runtime LLVM dependency. The catch is that the BPF program's source has to use special accessor macros (BPF_CORE_READ, bpf_core_field_offset) so that the offsets are recognisable as relocation targets; ordinary bpf_probe_read_kernel(&dst, sizeof(dst), &sk->...) is not CO-RE-relocatable. Migrating a BCC tool to libbpf-CO-RE is therefore not a syntactic translation; it is a refactor that touches every kernel-pointer dereference in the program.

The decision rule, for a working SRE: bpftrace for the first ten minutes, BCC if the question won't fit in bpftrace's grammar, libbpf-CO-RE only when you're building a tool that will run on someone else's machine. Most SREs at Razorpay, Hotstar, and Zerodha use bpftrace and BCC daily; libbpf-CO-RE is the realm of the platform-tooling team that ships the same tool to a hundred production hosts with mixed kernels.

A worked example of the boundary: imagine you want to trace, per database query, the time spent in the kernel's TCP send path versus the time spent in the application. The question has three parts — attach to a USDT probe in the database to mark the query boundary, attach to kprobe:tcp_sendmsg to capture the kernel time, correlate the two with a per-thread map — and it has a per-event payload (query SQL hash, elapsed kernel ns, elapsed app ns). bpftrace can attach to all three probes and key a map by tid, but emitting the per-event record with the SQL hash forces you into bpftrace's printf(), which gives you a line of stdout per event with no structure. BCC handles the same problem cleanly: the C side does the same probe attachments and map updates, but the per-event payload goes through a perf ring buffer to a Python callback that can write CSV, push to InfluxDB, or render to a real-time terminal dashboard. The difference is not "BCC is more powerful" — both tools touch the same kernel facilities — but "BCC has Python at the edge", and Python at the edge is the difference between a one-shot incident tool and a persistent observability tool that anyone on the team can extend.

A second worked boundary: the eBPF community's own tool migrations. The bcc/tools/ directory was historically the canonical home of production tracing tools at Netflix, Facebook, and the Linux kernel community. Starting around 2020 the same tools began to be rewritten as libbpf-tools/ — the same UX, the same output, the same probes, but pre-compiled and CO-RE-relocatable. The migration is roughly two-thirds done in 2026; tools like execsnoop, opensnoop, biosnoop exist in both forms. The choice for an SRE today is rarely "which tool to use" but "which version of the tool happens to be installed on this host", and either form will give you the same answer. The reason to know the BCC version is that when something is broken about the tool — when it fails to compile, when it loses samples, when it crashes — the BCC version's failure mode is something you can debug from inside the script (it's Python) while the libbpf version's failure mode often requires rebuilding the tool. For incident response, debuggability beats portability.

Common confusions

Going deeper

The BCC tools collection — the canon

The bcc/tools/ directory in the upstream repository ships over 100 production-grade tools, each in the same shape as the script above: an embedded C eBPF program, a Python orchestrator, command-line arguments, a help string, a man page. The ones every SRE should know by name are runqlat.py (scheduler run-queue latency histogram), tcpretrans.py (TCP retransmit attribution, the pattern this article elaborated), biosnoop.py (per-block-IO event trace with PID and latency), execsnoop.py (every execve() on the box, with command line), opensnoop.py (every open() syscall, with filename), tcplife.py (TCP connection lifetime tracker, p50/p95/p99 connection durations), mysqld_qslower.py (slow query attribution at the MySQL USDT probe level), mallocstacks.py (malloc bytes attributed by userspace stack), slabratetop.py (kernel slab allocation rate, top consumers), softirqs.py (softirq time per CPU per softirq class). Reading these in alphabetical order over a fortnight is the same fluency-building exercise as reading the bpftrace tools directory; the BCC tools have richer userspace logic and are worth the closer read for engineers planning to ship their own tools.

Why these tools are still relevant despite libbpf-CO-RE: most of the BCC tools work, today, on every production Linux box at every Indian fintech, and replacing them with libbpf-CO-RE versions is a slow process — the libbpf-tools rewrite of the BCC collection is ongoing in 2026 but not complete. For an SRE answering a page tonight, the BCC version is the one installed and the one with documented behaviour. The eventual migration to libbpf-CO-RE will preserve tool names and command-line interfaces, so muscle memory carries forward; only the failure mode under kernel upgrades changes.

Templating the C source from Python

A pattern that gives BCC its real expressive power: building the C source as a Python f-string with parameters baked in at script startup. The TCP retransmit script could accept a --port 443 argument and bake it into the filter as if (e.dport != htons(443)) return 0;, eliminating the runtime conditional in the kernel program. More usefully: a script could read a list of PIDs from a YAML config, compile a different filter clause per PID, and run them all as separate kprobe attachments on the same script. This is impossible in bpftrace and exactly the reason BCC's design choice — Python building C strings — pays off. The cost is that errors in the templated C produce confusing line numbers (the compiler sees the rendered string, not your template); the discipline is to log the rendered C source on compile failure with BPF(text=src, debug=DEBUG_PREPROCESSOR | DEBUG_SOURCE).

A concrete shape: a Hotstar SRE writing a per-pod TCP latency tracer can read the list of pod IPs from the Kubernetes API at script startup, render a C if ladder that only emits events for those pods, and attach the kprobe — all within the same Python process. Without templating, the C side would have to read every event and filter on a BPF_HASH of pod IPs, paying a hash lookup per event. With templating, the filter is a few cmp instructions in the JITed code, paying nothing at runtime. The programming model — Python configuring C at startup, then C running in the kernel at full speed — is exactly the model that makes BCC the right tool for tools that have to be both rich and fast.

Three Python orchestration idioms recur across well-written BCC tools and pair well with templated C. open_perf_buffer with a closure-captured callback lets you keep render state (cumulative counts, rolling windows, output formatters) in Python without globals; the callback closes over a context object initialised in main(). signal.signal(signal.SIGINT, on_exit) with a clean-up handler that iterates each map and prints a final summary turns Ctrl-C from "abrupt termination" into "produce the exit report"; tools like biolatency.py use this to dump the final histogram. b.attach_kprobe(event="...", fn_name="...") instead of name mangling lets you attach the same C function to multiple kprobes from Python (a single count_event C function attached to ten different syscall entry points), which keeps the C side small and pushes the multi-attach logic to Python where it belongs. Each of these is a small idiom; together they are the difference between a 200-line script that grows over time and a 200-line script that becomes a 2,000-line script the next person cannot read.

The verifier rejection diagnostic loop

Every BCC author eventually meets the verifier rejection: the script tries to compile, the kernel rejects the program, and the error message is a wall of bytecode-level annotations from dmesg. The diagnostic loop has three steps. First, run with BPF(text=src, debug=DEBUG_BPF) so the verifier output goes to stderr instead of just dmesg — this prints the offending bytecode with the rejection reason inline. Second, look for the line R<N> invalid mem access — that is almost always a missing bpf_probe_read_kernel() on a pointer dereference. Third, when the rejection mentions "back-edge" or "infinite loop", a for loop or a while loop has a non-constant bound; rewrite with a constant or use bpf_loop() (Linux 5.17+). The verifier's rejections are precise but unfriendly; building a mental library of the dozen common shapes is the work of a fortnight and pays off forever.

When BCC is the wrong tool

Three classes of problem look like BCC problems but aren't. Userspace-only profiling — flamegraphs of pure CPython, pure Go, pure JVM workloads — is better served by language-specific samplers (py-spy, pprof, async-profiler) that understand the runtime's stack-walking conventions. BCC can attach to userspace symbols via uprobes, but the per-event cost of a userspace probe is much higher than a sampling profiler's stop-and-walk-stack approach, and the runtime-aware tools produce richer flamegraphs because they understand interpreter frames. Distributed tracing — following a request across services — is the realm of OpenTelemetry, not BCC; while you can use eBPF to inject correlation IDs at the syscall layer, you are reinventing a wheel that the OpenTelemetry SDKs already turn. High-cardinality metrics — per-customer SLO tracking with millions of distinct tenants — fits Prometheus or VictoriaMetrics better than a BCC tool, because the storage layer is built for this and BCC's maps will overflow at the scale you need. The skill is recognising when the question is shaped right for BCC; the answer is "any time the answer requires kernel-internal state that the existing per-language tools cannot see".

Reproduce this on your laptop

# Reproduce the TCP retransmit attribution demo on Linux (5.8+ recommended).
sudo apt install bpfcc-tools python3-bpfcc linux-headers-$(uname -r)
python3 -m venv .venv && source .venv/bin/activate
# (BCC's Python bindings are installed system-wide by the apt package; no pip install needed)
sudo python3 tcpretrans_attrib.py
# Generate retransmit pressure (in another terminal) with:
#   sudo tc qdisc add dev lo root netem loss 30%
#   curl -m 5 https://localhost:443/ &  # repeat 50 times
#   sudo tc qdisc del dev lo root

A small operational note worth adding before the closing: the privileges. BCC requires CAP_BPF (Linux 5.8+) or, on older kernels, CAP_SYS_ADMIN, plus access to /sys/kernel/debug/tracing/. On a hardened production host these are not granted to ordinary users; running BCC scripts means either sudo or running inside a privileged sidecar pod. The hardening trade-off is real: opening a host to BCC means giving anyone with that access the ability to read arbitrary kernel memory through a kprobe, which is the same access a kernel module would give. Most fintechs gate BCC behind break-glass procedures — an SRE on call gets the privilege via an ephemeral cert with a 1-hour TTL, and every BCC invocation is audit-logged. The same security model that gates kernel-module insertion is the right model for eBPF observability tooling.

Where this leads next

This chapter introduced BCC; the rest of Part 6 widens the lens. Chapter 42 (kprobes vs tracepoints) is the deep dive into the kernel-side hook taxonomy that bpftrace and BCC both attach to — when to prefer the ABI-stable tracepoint and when the kprobe is your only option. Chapter 43 (uprobes and USDT) covers the userspace symmetric: attaching to functions inside Postgres, OpenJDK, Python, or your own service binary. Chapter 44 covers per-event delivery via the modern ring buffer in production scale. Chapter 45 (eBPF latency histograms in production) is where these tools graduate from incident-response to always-on observability.

The single insight to carry forward: BCC is not a competitor to bpftrace; it is the next step up the staircase when the question is too rich for a one-liner but not yet shipped product. The brevity ceiling of bpftrace and the boilerplate floor of libbpf-CO-RE leave a wide middle that BCC owns. For an SRE at a place like PhonePe, becoming fluent in both bpftrace and BCC means that any question about kernel behaviour can be turned into a runnable answer in minutes — not because either tool is magical, but because the boundary between "I have a question" and "I have output" is a single Python file you have already written six times this month.

A practical next step: pick any of the BCC tools (opensnoop.py is a good first read because it is short and its output is immediately interesting), read the embedded C, read the Python orchestrator, and rewrite the Python half to emit JSON instead of human-readable text. The exercise is forty minutes and forces you to engage with the ctypes-to-struct-to-format pipeline that is the actual product of BCC. After three of these you have internalised the shape, and the next time you need to answer a kernel question that does not fit in a bpftrace one-liner, the BCC scaffold writes itself.

The deeper habit to build, beyond the syntax, is the question of which channel. Every kernel-observability problem reduces to one of two shapes: aggregate over many events into a small summary, or report each interesting event with full context. Once you can name which shape your problem is, the tool choice falls out — bpftrace for aggregations that fit one clause, BCC with BPF_HASH for richer aggregations, BCC with a perf buffer or ring buffer for per-event reporting, libbpf-CO-RE for any of the above when the tool needs to ship to other people's machines. The Aditi opener of this chapter is what fluency looks like at PhonePe scale: a four-hour mystery becomes a thirty-line tool, the tool's output makes the bug obvious, and the next time the same shape of bug shows up the same scaffold is reused with three lines changed. The product is not the tool; the product is the speed at which the operator can move from "something is wrong" to "I am looking at the data that shows what is wrong". BCC's combination of restricted C in the kernel and full Python in userspace is what makes that speed possible for the questions that one-liners cannot answer.

A final habit worth forming: keep a ~/.bcc-tools/ folder with the four or five BCC scripts you have written or adapted for your stack, version-controlled. Every Indian fintech ops team I have seen do this well has the same shape — a tcpretrans.py adapted for their pod IP layout, a runqlat-pid.py filtered to the application's PID, an oomkill.py extended to ship the captured stack to Slack. The folder is the team's institutional memory; new SREs read it on day one and learn the shape of the system through the questions previous SREs needed to ask.

References