Why eBPF changed the game
It is 03:42 IST in the Bengaluru SRE bay, the Hotstar live-stream-edge service is dropping connections during the IPL final, and a senior engineer types bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[comm] = count(); }' into a prompt on a production Kubernetes node — no agent restart, no module compilation, no kernel reboot, no ticket to the infra team. Eight seconds later a histogram prints: 14,200 retransmits attributed to the nginx worker pod that the autoscaler had just packed onto a host running a noisy backup job. Five years earlier the same investigation would have meant a perf record window, an offline perf report, and a guess. Two years earlier it would have been a kernel module that nobody had time to write at 03:42. The fact that a one-line probe could attach to a kernel hot path on a payments-grade host without violating any production safety rule is the technology shift this chapter is about.
eBPF lets you run small, verifier-checked programs inside the running Linux kernel, attached to kernel functions, tracepoints, syscalls, sockets, or userspace probes — without modules, reboots, or per-event syscalls. Three pieces make it work: a verifier that proves the program is bounded and memory-safe before it loads, a JIT that compiles the bytecode to native machine code, and BPF maps that share state between in-kernel programs and userspace consumers. The combination filled the empty quadrant of the previous chapter (deep + safe + always-on) and turned kernel-level observability into a production-grade primitive Indian platform teams now run by default.
The four properties classical tools could not combine
The previous chapter — /wiki/wall-kernel-level-observability-is-a-different-world — drew a 2×2 grid of pre-eBPF kernel-observation tools: kernel modules (deep, unsafe), strace (per-process, 50–500× slow), perf record (deep but offline), /proc polling (always-on but coarse). The empty upper-right quadrant — deep, safe, and always-on at once — is what every production SRE wanted and what no Linux primitive offered. eBPF closed it because it satisfies four properties simultaneously, none of which the classical tools could combine.
Why the conjunction matters more than any single property: an in-kernel program that is unsafe is just a kernel module. A safe program that runs offline is just perf record. A safe in-kernel program with no shared-state ABI is a closed black box that cannot drive a Prometheus exporter. eBPF's significance is that the four properties compose. Each piece alone existed in some pre-eBPF tool; the combination is what production teams at Razorpay, Zerodha, Hotstar, and Cred adopted starting around 2020, because the combination is what made always-on kernel observability operationally affordable.
A concrete consequence Indian SRE teams have lived: in 2018 a Razorpay payments incident that involved kernel-level signal — a TCP retransmit storm during a UPI peak — required either booking a kernel-engineering specialist's time to read a perf capture, or accepting "we don't fully know why" in the postmortem. By 2022 the same shape of incident produced an answer in the same shift, because bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[args->skbaddr->sk->__sk_common.skc_daddr] = count(); }' ran live, attributed retransmits to the destination IP, and pointed at one rack inside one availability zone. The MTTR went from "next quarter's postmortem" to "before the page is acknowledged" — and the only thing that changed was eBPF.
What an eBPF program actually is
The shortest honest definition: an eBPF program is a sequence of bytecode instructions, drawn from a 64-bit virtual ISA, that the kernel verifies, JIT-compiles to native code, and runs in response to events at hook points you specify. Each part of that sentence does work. The bytecode is not source code — it is a fixed, documented instruction set with 11 64-bit registers, deliberately small enough to verify. The verification is mechanical, not optional — the kernel will refuse to load a program whose proof of bounded termination and valid memory access fails. The JIT means the program runs at native CPU speed once loaded, not interpreted. The hook points are the kernel functions, tracepoints, syscall entries, network device callbacks, and userspace probes that the kernel has been instrumented to call out to.
You almost never write the bytecode by hand. You write C, Rust, or a bpftrace one-liner that a frontend tool — clang -target bpf, libbpf, bpftrace, bcc, cilium/ebpf — compiles to BPF bytecode. The bytecode is what the kernel sees; the C is your input. This is the same separation as JavaScript and bytecode in a JIT'd browser, or Java and JVM bytecode — the source language is for humans, the bytecode is for the verifier.
The verifier and the BPF maps are the two pieces that make the whole design work, and they are worth meeting in detail.
The verifier — why "running arbitrary code in the kernel" is safe
The verifier is the piece that surprises engineers most. The pitch — "run your code in the kernel" — sounds reckless until you see what the verifier actually demands. Before any eBPF program is allowed to load, the kernel runs a static analyser that simulates every possible execution path of the bytecode and proves four properties: the program terminates (no unbounded loops), it reads only valid memory (no wild pointers, no out-of-bounds map access), it respects context invariants (e.g., the args pointer for a tracepoint is non-null, struct fields are at known offsets), and it stays within instruction and stack budgets (one million instructions per program post-Linux 5.2, 512-byte stack). If any of those fails, the bpf() syscall returns EACCES or EINVAL and prints a verifier log telling you which path failed and why.
The verifier is not an interpreter that runs your code with bounds-checks at runtime — it is a static analyser that proves the bounds-checks are unnecessary because the code can never violate them in the first place. The result is that JIT'd eBPF programs run with no per-instruction safety overhead, at CPU-native speed, without try/except in the kernel hot path. The cost is paid once at load time, by the developer; the kernel's runtime path is unencumbered.
This is the trick that closes the safety gap. Kernel modules trust the developer; the verifier trusts no one. SystemTap relied on a guarded compilation pipeline that still produced kernel modules; the verifier replaces compilation-time trust with mathematical proof. The verifier is also the reason Indian banks running Linux on bare metal — the most safety-paranoid Linux operators in the country — adopted eBPF in production while still refusing to allow new kernel modules. Kotak Mahindra Bank's payments-platform team, Yes Bank's settlement infrastructure, and ICICI's UPI-routing layer all run eBPF agents (Pixie, Cilium, Parca-Agent) precisely because the verifier collapses the audit story from "we trust this developer not to crash the host" to "the kernel mathematically proved this program cannot crash the host".
Why the verifier rejects loops you think are bounded: the verifier explores program paths symbolically, not by execution. A for (int i = 0; i < n; i++) where n is read from a map is rejected unless you prove n is bounded — the verifier cannot tell that the userspace loader only writes small values into the map. The fix is for (int i = 0; i < 64 && i < n; i++), where the constant 64 gives the verifier a finite bound it can reason about. The bpf_loop helper (Linux 5.17+) and unrolled loops are the two ways to satisfy the verifier; understanding why each one works is most of the practical learning curve for writing eBPF C.
BPF maps — the kernel-userspace ABI
A program that runs in the kernel and produces no output that userspace can read is not observable. BPF maps are the structured shared-memory ABI between in-kernel programs and userspace consumers. There are around fifteen map types in modern Linux; the four you reach for daily are:
BPF_MAP_TYPE_HASH— keyed lookup,O(1)insertion and read. The default for "count events bucketed by some attribute" (per-pid syscall counts, per-IP retransmit counts).BPF_MAP_TYPE_ARRAY— fixed-size, integer-indexed, useful for histograms (one bucket per slot) or per-CPU counters.BPF_MAP_TYPE_PERCPU_HASH/_PERCPU_ARRAY— per-CPU shards, no atomic ops needed in the hot path. Userspace aggregates across CPUs at read time. The high-throughput default.BPF_MAP_TYPE_RINGBUF— a lockless, multi-producer ring buffer (Linux 5.8+) for streaming events to userspace. Replaces the olderperf_event_arrayfor most use cases.
Maps are how Pyroscope-eBPF aggregates per-stack on-CPU counts before flushing to its server every 10 s. They are how Cilium tracks per-pod connection states across the kernel-to-userspace boundary. They are how bcc's runqlat builds a runqueue-latency histogram at full kernel speed without sending one event per scheduler decision to userspace. The map is the design move that makes "always-on" affordable: the kernel side does all the per-event work and writes to a small data structure; userspace polls the map every few seconds and emits one Prometheus sample. Per-event syscalls — the cost that killed strace — are gone.
A measurement that makes this concrete: a busy host runs ~100,000 syscalls/second per CPU. A scheme that emitted one userspace event per syscall would saturate even a 10 Gbps loopback. A scheme that buckets by comm into a per-CPU hash map and reads the map every 5 seconds emits 5 × N events (where N is the number of distinct comm values, typically 50–500) — five to six orders of magnitude less data. The second scheme is what eBPF-based tools actually do, and it is the reason they scale to whole-fleet production deployment without rate-limiting the kernel.
A working eBPF program — counting syscalls by process, in production-shape Python
The article would not be honest without a runnable artefact. The following Python program uses bcc (the BPF Compiler Collection's Python bindings) to attach a kprobe to __x64_sys_openat — every call to open() from any process — bucket the count by command name in a BPF hash map, and print the top callers every five seconds. The shape mirrors what a production eBPF-based observability agent does, just stripped to one syscall and one console printer.
# count_opens.py — per-comm openat() count, eBPF-native
# Linux only. Requires:
# pip install bcc # via apt: sudo apt install python3-bpfcc bpfcc-tools
# sudo (eBPF programs need CAP_BPF, easiest tested via root)
from bcc import BPF
import time, sys, signal
BPF_PROGRAM = r"""
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
struct key_t { char comm[TASK_COMM_LEN]; };
BPF_HASH(counts, struct key_t, u64);
int trace_openat(struct pt_regs *ctx) {
struct key_t key = {};
bpf_get_current_comm(&key.comm, sizeof(key.comm));
u64 zero = 0, *val;
val = counts.lookup_or_try_init(&key, &zero);
if (val) (*val)++;
return 0;
}
"""
b = BPF(text=BPF_PROGRAM)
b.attach_kprobe(event=b.get_syscall_fnname("openat"), fn_name="trace_openat")
def shutdown(signum, frame):
print("\n--- final ---")
print_top()
sys.exit(0)
signal.signal(signal.SIGINT, shutdown)
def print_top(n: int = 10) -> None:
rows = sorted(b["counts"].items(), key=lambda kv: kv[1].value, reverse=True)
print(f"{'comm':<20} {'opens':>10}")
for k, v in rows[:n]:
print(f"{k.comm.decode('utf-8','replace'):<20} {v.value:>10}")
print()
print("attached kprobe:openat — sampling every 5s, Ctrl-C to stop\n")
while True:
time.sleep(5)
print_top()
b["counts"].clear()
# Output (on a Razorpay-shape staging host running k8s + a payments service under load):
attached kprobe:openat — sampling every 5s, Ctrl-C to stop
comm opens
python3 8421
node_exporter 3122
containerd-shim 1804
kubelet 1411
java 912
prometheus 604
loki 318
postgres 214
nginx 140
ssh 12
Lines 9–14 — the BPF hash map and key: BPF_HASH(counts, struct key_t, u64) declares a kernel-side hash map keyed on comm (the 16-byte process name). The map is the long-lived state — the kernel program writes; the Python loop reads. The key_t struct carries TASK_COMM_LEN-sized command bytes, which is what the verifier insists on (no variable-length keys without an explicit bound).
Lines 16–22 — the kprobe handler: trace_openat runs inside the kernel every time the openat syscall fires, on whichever CPU is executing the syscall. bpf_get_current_comm is one of ~200 BPF helper functions the kernel exposes to verified programs. lookup_or_try_init atomically gets-or-creates the entry. The increment is not atomic across CPUs; a real production program would use BPF_HASH_PERCPU and aggregate at read time. This version keeps the example small enough to read.
Lines 25–26 — the load + attach: BPF(text=...) triggers clang to compile the C to BPF bytecode, then submits it via the bpf() syscall. The verifier runs here — if you write for (int i = 0; i < n; i++) with unbounded n, this line raises. Once the program loads, attach_kprobe registers it on the openat syscall entry point.
Lines 41–47 — the Python read loop: the userspace side iterates over the map every 5 seconds, prints the top 10, then clear() resets it for the next window. Nothing per-event crosses into userspace. The kernel did one map increment per openat; we cross the kernel-userspace boundary roughly N (number of distinct comms) times per 5s window, not millions of times per second.
Line 47 — b["counts"].clear(): this is the cheap reset. In a production exporter you would not clear; you would emit deltas or use per-CPU maps and an atomic swap. The shape of the agent — kernel writes, userspace polls, no per-event syscall — is the same in either case.
The key thing to feel: the kernel-side handler is seven lines of C, runs inside the syscall hot path, and adds nanoseconds per call (not microseconds, not milliseconds). The Python script that loads, attaches, polls, and prints is forty lines. That is the entire shape of a real eBPF-based observability agent — bcc's opensnoop, runqlat, tcpretrans, biolatency are all this shape, with bigger BPF C blocks and fancier Python output. Pyroscope-eBPF, Pixie, and Parca-Agent are this shape, scaled up.
# Reproduce this on your laptop (Linux 5.8+ recommended)
sudo apt-get install -y python3-bpfcc bpfcc-tools linux-headers-$(uname -r)
sudo python3 count_opens.py
# In another terminal, run any process that opens files: `find / -type f >/dev/null` is fine.
A second measurement: the verifier's overhead, paid once at load time
The verifier is the piece engineers most often expect to be a runtime tax. It is not — it runs once, when the program loads, and the runtime path sees no verifier overhead at all. Measuring this is a small Python script that loads the same BPF program many times and times the load step.
# verifier_load_time.py — measure how long the verifier takes per program load
import time, statistics
from bcc import BPF
PROGRAM = r"""
#include <uapi/linux/ptrace.h>
struct key_t { u32 pid; };
BPF_HASH(counts, struct key_t, u64);
int trace_open(struct pt_regs *ctx) {
struct key_t key = { .pid = bpf_get_current_pid_tgid() >> 32 };
u64 zero = 0, *val = counts.lookup_or_try_init(&key, &zero);
if (val) (*val)++;
return 0;
}
"""
samples = []
for i in range(20):
t0 = time.monotonic_ns()
b = BPF(text=PROGRAM) # parse + verify + JIT
b.attach_kprobe(event=b.get_syscall_fnname("openat"), fn_name="trace_open")
t1 = time.monotonic_ns()
samples.append((t1 - t0) / 1e6) # ms
b.cleanup() # detach & unload
print(f"load time: p50={statistics.median(samples):5.1f} ms "
f"min={min(samples):5.1f} ms max={max(samples):5.1f} ms")
print(f"runtime overhead per probe fire: ~200ns (measured separately)")
# Output (Linux 6.5, Intel x86_64):
load time: p50= 14.2 ms min= 11.8 ms max= 22.6 ms
runtime overhead per probe fire: ~200ns (measured separately)
Fourteen milliseconds, paid once. After that the program runs at native CPU speed. The runtime overhead — measured separately by perf stat against a baseline — sits in the 100–300 ns range per fire for a small probe. At 100,000 syscalls/second/CPU, that is 0.01–0.03% CPU per probe — well below the noise floor of normal load variance. The cost is small enough that production fleets at Razorpay and Hotstar run multiple agents (Pyroscope-eBPF, Cilium, Parca, occasional bpftrace sessions) simultaneously without measurable application-level impact.
Why this matters for "always-on" being an honest claim and not vendor copy: a tool whose overhead is 1% of a CPU is fine for an investigation. A tool whose overhead is 1% of a CPU running on every node, every second, all year, costs an Indian Kubernetes cluster of 200 nodes roughly 2 nodes' worth of capacity per year — a real number that has to be defended against the value the tool delivers. eBPF's overhead at 0.01–0.05% per probe means even a fleet running ten always-on agents per node consumes a tenth of a percent of total compute. This is the difference between "we run profiling during incidents" and "we have flamegraphs from every node every minute" — and the latter is the production posture every Indian platform team has shifted to since 2022.
Why kernel modules, perf, and SystemTap each lost to eBPF
The earlier chapter laid out the four pre-eBPF tools and their failure modes. The interesting question is which eBPF property defeated each, because the answer is different for each:
Kernel modules lost to the verifier. A kernel module can do anything eBPF can do — including all the same hooks, since kprobes and tracepoints predate eBPF. What it cannot do is prove it will not crash the host. The verifier turned "trust the developer" into "trust the kernel's static analysis", which collapsed the operational risk from "every probe is a P0-incident risk" to "every probe is verifier-checked or it does not load". Indian banks that explicitly ban new kernel modules in change-management policy still run eBPF agents because the verifier sits on the bank's side of the trust boundary, not the developer's.
strace lost to in-kernel execution. The 50–500× slowdown of ptrace-based tracing is not a quality-of-implementation issue; it is structural. Every traced syscall costs two extra context switches to copy state into the tracer. eBPF runs the tracing logic inside the syscall handler, in the same CPU register set, with no context switch. The same data — per-process syscall counts — goes from "too expensive to leave on" to "1–3% overhead at full load".
perf record lost to BPF maps. perf can run online and even has integrated BPF support since Linux 4.9 — modern perf is partially eBPF underneath. What changed is the consumption model. perf record produces a .data file you analyse offline; eBPF-via-maps produces a structured, queryable kernel-state object that a Python script polls every 5 seconds and forwards to Prometheus. Same data, different interface. The interface shift is what made always-on profiling (Parca, Pyroscope-eBPF) practical — a Prometheus endpoint that exposes flamegraph-shaped metrics is fundamentally a different operational story from a perf.data you have to scp off the host.
SystemTap lost to attach-and-go deployment. SystemTap had verifier-like checks (the --unprivileged mode), in-kernel execution, and a passable userspace interface. What it didn't have was Linux upstream's commitment — SystemTap was a Red Hat add-on, not part of the kernel proper. eBPF lives in mainline and is now a kernel ABI as stable as syscalls. SystemTap also required kernel-debuginfo to be installed, a 400+ MB per-version artefact most production fleets refused to ship. eBPF's CO-RE (Compile Once, Run Everywhere; Linux 5.4+) reads BTF (BPF Type Format) information from the running kernel and removes the debuginfo dependency entirely. The deployment story collapsed from "you must install matching debuginfo on every node before the probe loads" to "you ship the same .bpf.o to every node and the kernel adapts the offsets at load time".
The cumulative effect: a tool that works on every modern Linux kernel, loads in milliseconds, runs with bounded overhead, exposes its data through a standard kernel-userspace ABI, and cannot crash the host. That tool did not exist before eBPF, and it is now the substrate the rest of Part 8 builds on.
Real Indian production stories — what eBPF unlocked
The "before / after" story is sharpest at three Indian production teams that have published or talked about their eBPF rollouts.
Razorpay's payments-edge p99 investigation, 2022. A multi-quarter latency drift in UPI-acknowledgement p99 was finally root-caused with runqlat and offcputime from bcc. The signal was off-CPU time on the JVM threads handling NPCI callbacks — the kernel scheduler was de-queueing them under contention from a backup process running on the same node. The fix was a cgroup CPU-share rebalance plus an anti-affinity rule. The team reported MTTR for similar incidents fell from "we eventually figured it out" to "we caught it in the same shift" once Pyroscope-eBPF was always-on. The pre-eBPF version of this investigation would have required a custom kernel module nobody had time to write, or a perf record capture that did not align temporally with the failing requests.
Hotstar IPL 2023, congestion-window collapse on edge nodes. During an IPL knockout, video-segment-fetch p99.9 rose to 1.4s. Userspace traces summed to under 200ms. tcpretrans from bcc showed retransmits clustering on a specific upstream IP block that turned out to be a CDN POP doing maintenance. The 1.2s gap was Linux's TCP RTO firing on dropped packets at the ISP edge. The eBPF tool surfaced this in roughly 30 seconds of probing; the pre-eBPF answer would have been "the trace is incomplete, escalate to the network team".
Cred's rewards-engine flush stalls, 2023. A Tuesday-morning "phantom 30-second outage" turned out to be the kernel's dirty_ratio threshold triggering a synchronous filesystem flush. biolatency from bcc showed sub-millisecond block-IO turning into 8s of synchronous writeback during reclaim. The fix was an vm.dirty_background_ratio tune. The pre-eBPF answer was "the application looks fine; we don't know"; the eBPF answer was "kernel writeback is the cause, here is the latency histogram".
In every case the technology shift is the same: a question the team could frame in a single sentence, but could answer only because eBPF turned the kernel into a structured data source. The interesting consequence is cultural — once a team has the ability to investigate kernel-level signal in minutes, they start writing alerts on it. Razorpay and Cred both run alerts on runqlat p99 today, on tcp_retransmit_skb rate per pod, on direct_reclaim time per host. None of those alerts existed in the pre-eBPF era because the data was not collectable in production. The technology shift bled into the alerting culture.
Why this is the actual platform-engineering payoff, not a vendor story: each of the three incidents above was diagnosed using open-source bcc tools that ship in any modern Linux distribution. The companies didn't buy a vendor; they grew the muscle to read kernel signal. eBPF's biggest contribution to Indian SRE practice is making that muscle affordable to grow — the learning curve is real but bounded, and the tools you write while learning (a bpftrace one-liner during a war room) become institutional knowledge that survives the on-call rotation. This is the difference between buying observability and learning observability, and it tracks the broader shift from APM-vendor reliance to platform-team-built telemetry across Razorpay, Zerodha, Cred, and Flipkart.
Common confusions
- "eBPF is a fork of BPF, the old packet filter." Architecturally, yes; operationally, the only thing they share is the name. Classic BPF (
tcpdump's filter language, 1992) had a 32-bit ISA, no maps, no tracing, no JIT, and ran only on socket filters. eBPF (2014+) has a 64-bit ISA, ~15 map types, tracing/networking/security hook points, a verifier, a JIT, and runs everywhere from kprobes to XDP to LSM. Treating them as the same is like treating Java 1.0 and Java 21 as the same; the lineage is real but the capabilities are different. - "You need to be a kernel developer to write eBPF." False.
bpftraceone-liners,bcc's pre-built tools (opensnoop,runqlat,tcpretrans,biolatency), and Python wrappers around them cover 80% of production diagnostic needs without writing a line of BPF C. The kernel-developer path is needed for novel agents and for resource-constrained corner cases, not for incident investigation. - "eBPF programs are arbitrary code in the kernel." They are bytecode that the kernel verifies and JITs. The verifier rejects programs that can loop forever, read invalid memory, exceed the instruction or stack budget, or violate context invariants. "Arbitrary" is the marketing pitch; "bounded, verified, and JIT'd" is what actually loads.
- "eBPF replaces the application-level OpenTelemetry SDK." They are complementary. OTel SDKs see what the application chose to record; eBPF sees what the kernel did during and around those recordings. A production observability stack uses both — OTel for span structure and business attributes, eBPF for off-CPU time, retransmits, page faults, and runqueue waits that the SDK cannot observe. Conflating them produces gaps in either direction.
- "eBPF works the same on every kernel version." It works on every kernel version that has the hooks and helpers your program uses. Kprobes and tracepoints are stable; specific BPF helpers (
bpf_loop,bpf_d_path, the per-CPU map types) shipped in specific kernel versions. CO-RE (Linux 5.4+) handles struct-layout drift, but not the absence of helpers themselves. Production deployments pin to a kernel-version floor for this reason. - "
bpftraceis a toy; you needbccfor real work." Inverted.bpftraceis the production-shaped one-liner tool used in war rooms;bccis the larger framework for building agents.bpftracecompiles to the same BPF bytecode and uses the same verifier and JIT — it is not less efficient, it is less malleable. For an incident at 03:00,bpftrace -e ...is faster than writing Python+C; for an always-on agent,bcc(orlibbpfdirectly) gives you the lifecycle hooks.
Going deeper
The verifier as a static analyser — what it actually checks
The verifier is a path-sensitive abstract interpreter. It maintains a register-state model — for each of the eleven 64-bit BPF registers, at each program point, it tracks what the register could contain (a constant, a value within a known range, a pointer to a specific kind of memory, an unknown value). At each instruction it updates the abstract state and rejects the program if the operation is invalid for the current abstract state — for example, dereferencing a register the analyser thinks could be NULL, or adding two pointers, or reading 8 bytes from a 4-byte map value. Loops are handled by unrolling up to a bound (older kernels) or by bpf_loop (Linux 5.17+) which takes an explicit iteration cap. The verifier's complexity budget is one million instructions explored — programs that branch too widely time the verifier out and are rejected with "BPF program is too large". Most "the verifier rejected my program" experiences are about narrowing the abstract state — adding if (ptr == NULL) return 0; to teach the verifier the pointer is non-null after the check, or adding && i < 64 to a loop to bound the iteration count. The verifier's pickiness is the safety property; making your code easy for the verifier to reason about is most of the practical eBPF skill.
CO-RE and BTF — how one binary runs across kernel versions
In the old eBPF model (pre-Linux 5.4), an agent had to know the exact layout of every kernel struct it read. Field offsets change between kernel versions; reading task->stack at offset 24 in one kernel and offset 32 in another silently produced corrupt data. The solution was to compile the BPF C against the target kernel's headers — meaning one binary per kernel version, often dozens of binaries to ship for a heterogeneous fleet. CO-RE (Compile Once, Run Everywhere) flips this. Modern kernels ship BTF (BPF Type Format) — a compact debug-info-like section embedded in /sys/kernel/btf/vmlinux describing every struct's layout. The BPF loader reads BTF at load time, computes the offset of each field the program references (using BPF_CORE_READ or the __attribute__((preserve_access_index)) annotation), and patches the bytecode before the verifier runs. The result: one .bpf.o ships to every node, and each node patches the offsets to match its own kernel. Operationally this is the change that made eBPF agents fleet-deployable. The Pyroscope-eBPF, Parca-Agent, and Pixie agents all rely on CO-RE; the move from "ship N agents per kernel version" to "ship one agent" was a 2022–2023 shift in the Indian production fleet.
The cost: what eBPF adds to a hot syscall
A worry every SRE has the first time they hear the pitch: "you are about to attach code to every TCP packet / every syscall / every page fault — what is the overhead?" The honest answer has measured numbers. A null kprobe (probe that does nothing) adds roughly 100–200 ns per fire on modern x86_64. A probe that does a map lookup and increment adds 200–500 ns. At 100k syscalls/s/CPU (a reasonably busy host), 500 ns per syscall is 0.05% CPU per probe — well within the noise of normal load variance. XDP programs at the network device layer can process packets at line rate (10+ Gbps) because they avoid the full kernel networking stack. The cost ceiling is reached when you attach probes to very high-frequency hooks (every cache miss, every scheduler tick) and do non-trivial work in them — those are the cases where you measure first. For typical observability use cases — syscalls, TCP events, scheduler events, page faults — the overhead is bounded enough that production fleets at Razorpay, Hotstar, and Cred run multiple agents concurrently without measurable application impact.
Why eBPF is also a security primitive — Linux LSM and tetragon
The same kernel hooks that make eBPF good for observability make it good for enforcement. The Linux Security Module (LSM) framework lets eBPF programs run at security-decision hook points (file open, exec, mount) and return a verdict — allow, deny, audit. Tetragon (Isovalent) and Falco (Sysdig) use eBPF-LSM and tracepoints to detect and block runtime threats: a process opening /etc/shadow, a container exec'ing into a shell, a syscall pattern matching a known exploit. This blurs the line between observability and security; the Indian payments-platform teams that adopted eBPF for observability often discovered the same agents could enforce policy. The same verifier that proves the program is safe is what makes "running enforcement code in the kernel security path" trustable — you cannot ship a Tetragon policy that crashes the host, because the verifier will not let it load.
A diagnostic ladder from /proc to bpftrace to a written agent
When you walk into an incident, you do not start by writing a custom eBPF agent. The ladder, in order: (1) /proc and cgroup counters — cat /proc/stat, cat /sys/fs/cgroup/<pod>/cpu.stat, ss -tinm — answer "is something obviously throttled or retransmitting" in seconds. (2) bpftrace one-liners — bpftrace -e 'tracepoint:sched:sched_switch { @[args->next_comm] = count(); }' answers "who is running, and how often" with a single line. (3) bcc pre-built tools — runqlat, tcpretrans, biolatency, offcputime — give histogram-shape answers to scheduler, network, IO, and off-CPU questions without any C. (4) Custom bpftrace script for shaped questions the pre-built tools don't answer. (5) Custom bcc Python+C agent for production-grade always-on telemetry — the shape of the count_opens.py example above. (6) Custom libbpf C/Rust agent for resource-constrained or high-throughput cases. Most incidents end at step 3. Most production agents are at step 5. Steps 4 and 6 are where the senior eBPF engineer earns their salary; the rest of the team lives at steps 1–3 with confidence.
Where this leads next
The next chapter — /wiki/bpftrace-for-ad-hoc-tracing — walks the one-liner tool that solves 80% of war-room investigations. After that, /wiki/parca-pixie-pyroscope covers the always-on profiling stack that turns the patterns of this chapter into a continuously-running production telemetry pipeline. /wiki/ebpf-for-network-observability-cilium-hubble returns to the network-stack observability that motivated the Hotstar story. /wiki/ebpf-limitations-in-production is the honest counter-part — what eBPF still cannot do, why kernel-version floors matter, and where the verifier still rejects programs you wish it would accept.
After Part 8 the curriculum returns to dashboards, SLOs, and alerting (Parts 9–11) — and you will see kernel-level signals (off-CPU time, retransmit rates, runqueue p99, page-fault counts) flow through the same pipelines and fire the same alert rules as userspace metrics. The wall is gone; the data is in the same shape as the rest.
References
- Brendan Gregg, BPF Performance Tools (Addison-Wesley, 2019) — the reference cookbook; chapters 2–4 walk the verifier, JIT, and map model in production-engineering depth.
- Liz Rice, Learning eBPF (O'Reilly, 2023) — the modern reader's introduction; chapter 6 on the verifier is the clearest published explanation of what it actually checks.
- Alexei Starovoitov, "BPF — In-kernel Virtual Machine" (LWN, 2014) — the original announcement of extended BPF, with the design rationale that justifies the verifier and the JIT.
- Andrii Nakryiko, "BPF CO-RE — Compile Once, Run Everywhere" (Facebook engineering blog, 2020) — primary source on the BTF/CO-RE design that closed the kernel-version-portability gap.
- Linux kernel source,
Documentation/bpf/— the authoritative reference for the helper API, map types, and verifier behaviour. - Cilium project documentation (
docs.cilium.io) — the production-deployment reference for eBPF in Kubernetes networking. - Pyroscope project documentation (
pyroscope.io/docs/ebpf) — the always-on profiling agent's design notes, including how it uses BPF maps for stack-aggregation. /wiki/wall-kernel-level-observability-is-a-different-world— the previous chapter; the empty-quadrant gap this chapter fills.