eBPF architecture: verifier, JIT, maps
At 03:14 IST on a Saturday, Razorpay's payments-API pod starts dropping 0.3% of requests with ECONNRESET. The CPU profile is clean, the wall-clock profile is clean, the application logs say nothing. Aditi, the on-call SRE, suspects a kernel-side issue — maybe a TCP retransmit storm, maybe a conntrack table fill — but the only way to confirm is to instrument the kernel itself. Six years ago this would have meant rebuilding the kernel with a custom tracepoint, deploying it to one canary pod, hoping the bug reproduces, and explaining to the security team why the production fleet is running an unsigned kernel. Tonight Aditi types bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm, kstack] = count(); }', waits eight seconds, hits Ctrl-C, and sees a stack trace that names the offending function. The bug closes in eleven minutes. The thing that made the difference was not a smarter SRE; it was a kernel feature that lets you load a program into the running kernel, run it on every retransmit, aggregate the results in a hash map, and have the kernel itself prove — before the program ever runs — that it cannot crash, leak memory, or loop forever.
eBPF is a sandboxed virtual machine inside the Linux kernel. You write a small program in restricted C (or generate eBPF bytecode directly), the kernel's verifier statically proves the program is safe to run, the JIT compiles it to native machine code, and the program executes at hooks like syscalls, kprobes, tracepoints, and network ingress. Maps move data across the user/kernel boundary without copy_to_user. Verifier + JIT + maps is the trinity that makes "instrument any kernel function in production at <1% overhead" actually true.
Why a sandbox in the kernel was the only answer
Before eBPF, the choices for kernel observability were three, and all of them were bad. You could write a kernel module — full access to kernel memory, full ability to crash the box, no formal safety, every distribution requiring a different Module.symvers. You could use kprobes from the perf toolchain — limited to fixed event types, no in-kernel aggregation, every event copied to userspace and re-parsed there. Or you could use SystemTap, which compiled probes into kernel modules at runtime and was, structurally, choice one wearing a friendlier shirt. None of the three were acceptable for a production fleet of 5,000 pods that could not afford a kernel panic per quarter.
The breakthrough that became eBPF, around 2014, was the realisation that the safety property mattered more than the expressiveness. If the kernel could prove — before running a program — that the program had no unbounded loops, no out-of-bounds memory accesses, no null-pointer dereferences, then the program could be loaded into the kernel by an unprivileged user (with CAP_BPF) and run at speeds indistinguishable from native code. The trade-off: the program had to be small (initially 4,096 instructions; now up to 1 million), had to be written in a restricted C dialect (no unbounded loops, no recursion, no global mutable state), and had to pass through a verifier that would reject anything it could not prove safe. In exchange, you got: in-kernel execution at native speed, no copy-to-userspace per event, and a guarantee that a misbehaving probe would be rejected at load time rather than panicking the kernel at 03:14.
The architecture that came out of this trade-off has three pieces, in this order: a bytecode format the user-space toolchain emits, a verifier that statically analyses the bytecode and rejects anything unsafe, a JIT compiler that turns the verified bytecode into native machine code, and maps that the program reads from and writes to. This chapter walks each of the three.
A historical note that is genuinely useful for orienting yourself in kernel source: the "B" in eBPF stands for "Berkeley" — the lineage goes back to the 1992 Berkeley Packet Filter that tcpdump uses to push filter expressions into the kernel for cheap per-packet matching. Classical BPF was tiny (32-bit, two registers, no maps, no calls), but it established the principle: a small, verifiable bytecode that runs in-kernel beats every alternative for performance-critical observability. eBPF is the same idea generalised — 64-bit, 11 registers, calls, maps, JIT, helpers, attach points across the entire kernel — and the lineage is why the syscall is still called bpf() and the bytecode still has a bpf_insn representation that mirrors classical BPF's structure. Knowing this much history helps when reading kernel source: the kernel/bpf/core.c interpreter still contains code paths that date to 1992, alongside the modern verifier and JIT.
The verifier — a static analyser standing between you and a panicked kernel
The verifier is the piece that earns eBPF its production trust. It is a static analyser that runs on the bytecode at load time, before a single instruction has executed in-kernel. Its job is to prove three properties — termination, memory safety, and bounded resource use — and to reject any program whose proof it cannot construct. The proofs are not symbolic in the academic sense; they are constructed by walking the program's control-flow graph, simulating every possible execution path with a tracking abstract domain over register types, and checking that no path violates the rules.
Termination is the property that surprises engineers coming from kernel-module backgrounds. The verifier does not allow unbounded loops. For Linux versions before 5.3 there was no general loop construct at all — every loop had to be unrolled by clang at compile time, with an #pragma unroll and a constant trip count. Linux 5.3 added bounded loops: a for loop is allowed if the verifier can prove the loop variable's range is finite. Linux 5.17 added bpf_loop, a kernel helper that takes a callback and a max iteration count, sidestepping the verifier's per-instruction budget for very long loops. The general principle remains: every loop must have a verifier-provable upper bound, because a program that runs at every packet on a 25 Gbps NIC cannot afford even one infinite loop.
Memory safety is enforced through register types. The verifier tracks every register's type — SCALAR_VALUE, PTR_TO_CTX, PTR_TO_MAP_VALUE, PTR_TO_PACKET, PTR_TO_STACK, and others — across the program. A pointer obtained from bpf_map_lookup_elem is initially PTR_TO_MAP_VALUE_OR_NULL and cannot be dereferenced until the program explicitly compares it to NULL. After the comparison, on the branch where the value is non-null, the verifier narrows the type to PTR_TO_MAP_VALUE and the dereference is permitted. On the other branch, the dereference would be rejected. This is the source of the "you must always null-check map lookups" idiom every eBPF tutorial repeats — the rule isn't a coding convention, it's a verifier rule.
Why the verifier rejects programs that look obviously correct: the verifier's analysis is path-sensitive but type-domain-conservative. If a register's value depends on a kernel pointer the verifier cannot reason about (say, a field read from task_struct whose offset varies across kernel versions), the verifier marks the result as SCALAR_VALUE of unknown range. Any subsequent use of that scalar as an array index — arr[scalar] — is rejected because the verifier cannot prove the index is in bounds. This is why CO-RE (Compile Once, Run Everywhere) was invented: BTF-based field relocations let the verifier know the precise offset of task_struct->tgid on the running kernel, narrowing the type and letting the program load on kernels the developer never tested against.
The verifier's instruction budget is a separate hard limit: each program is allowed up to a million instructions of verification work (Linux 5.2+; earlier kernels capped at 96k). A loop with N iterations counts as N×body instructions, and the verifier walks every path. A program with three nested 1000-iteration loops blows the budget instantly. The practical advice is: keep the program tight, push the heavy lifting into user-space readers of the map, and treat the verifier's "instruction limit reached" error as a sign that the program is doing too much in-kernel.
Memory accesses to the kernel itself go through helpers, not direct dereferences. A program that wants to read a field from task_struct doesn't write task->pid — it writes bpf_probe_read_kernel(&pid, sizeof(pid), &task->pid). The helper performs the read inside a fault-handling envelope: if the address turns out to be invalid (a bad page table entry, a freed object), the helper returns -EFAULT and the program continues. A direct dereference of an arbitrary kernel pointer would page-fault and panic the kernel; the helper makes the access survivable. CO-RE relocations in modern (Linux 5.5+) eBPF make this read syntactically transparent — you write BPF_CORE_READ(task, pid) and the toolchain expands it to a probe-read with the correct offset for the running kernel — but the underlying mechanism is still a fault-survivable helper.
The verifier also limits the program's stack usage. Each eBPF program has a fixed 512-byte stack — enough for a handful of local variables, not enough for big local arrays. A common beginner trap is to declare a char buf[256] on the stack for a probe-read, then add a second char path[256], blowing the budget. The fix is per-CPU array maps as scratch buffers: allocate a BPF_MAP_TYPE_PERCPU_ARRAY of size 1, with a value-type that holds whatever scratch state you need, and read/write it instead of stack memory. The pattern looks idiosyncratic the first time you see it and becomes second nature once you accept the 512-byte limit as a hard wall.
A useful mental model: the verifier is the compiler's "you cannot do this" pass, except instead of refusing to compile, it refuses to load the already-compiled bytecode. This is necessary because the toolchain (clang) is in user-space and cannot be trusted; the verifier runs in-kernel and is the kernel's own line of defence. An adversarial user can hand-craft eBPF bytecode that no clang would emit; the verifier still has to reject it.
The verifier has grown substantially since the 2014 design — the original was a few thousand lines of C and rejected anything beyond the simplest control flow; the 2026 verifier is closer to 25,000 lines and handles bounded loops, function calls, dynamic pointer arithmetic with range tracking, and Spectre-style speculative-execution mitigations. Each capability the eBPF community wanted (richer programs, larger programs, programs that look more like idiomatic C) cost the verifier maintainers months of work to extend the proof system without weakening the safety guarantee. The tension is permanent: every new feature loosens the constraints, every loosening risks a soundness bug, and a soundness bug in the verifier is a privilege-escalation vector the entire fleet inherits. This is why the eBPF community is conservative about adding new helpers and new program types; "just let the program do X" is rarely just one change.
The JIT — turning verified bytecode into x86_64 / ARM64
Once the verifier accepts a program, the JIT compiler translates the eBPF bytecode (an 11-register, 64-bit-ish RISC-like ISA) into the native instruction set of the host CPU — x86_64 on most servers, ARM64 on AWS Graviton and most Indian-cloud cost-optimised instances. The translation is largely 1:1, because the eBPF ISA was designed in 2014 with this in mind: 11 registers (R0–R10), the same calling convention as x86_64 (R1–R5 for arguments, R0 for return value), 64-bit words, and a small instruction set that maps cleanly onto modern CPUs.
The performance consequence: a JITed eBPF program runs at speeds roughly within 2–3× of an equivalent function written natively in C and called directly from the kernel. For a kprobe firing once per syscall — millions of times per second on a busy server — this is the difference between a usable observability tool (1–2% overhead) and one that itself becomes the bottleneck.
A concrete shape: an eBPF program that increments a hash-map counter on every tcp_retransmit_skb call compiles, after JIT, to roughly 60–80 native instructions including the prologue, the map-lookup helper call (which is itself a regular C function in the kernel), the atomic increment, and the epilogue. On a 3 GHz core that's around 30–50 ns per invocation. On a server doing 10,000 retransmits per second the overhead is 0.5 ms of CPU per second — 0.05% of one core. The same instrumentation done by attaching a kprobe handler that copies every event to userspace via a perf ring buffer would cost 10–100× more, because the per-event copy and parse cost dominates.
The JIT also performs a small set of optimisations: dead-code elimination on branches the verifier proved unreachable, constant folding when the verifier has narrowed a register's range to a single value, and instruction selection that prefers cheap forms (e.g. xor reg, reg to zero a register instead of mov reg, 0). The JIT does not do register renaming, instruction scheduling, or any of the heavyweight optimisations a production C compiler would. It doesn't need to: the bytecode is already small, the hot paths are short, and the verifier's analysis has already removed many of the optimisations a JIT would otherwise need to apply.
On older kernels (before 4.15 on x86, before 4.16 on ARM) the JIT was off by default and eBPF programs ran in an interpreter — a bytecode dispatch loop. The interpreter is roughly 10× slower than the JIT and is the reason early eBPF benchmarks looked unimpressive. Modern kernels enable the JIT by default; you can confirm via cat /proc/sys/net/core/bpf_jit_enable (returns 1 for JIT, 0 for interpreter, 2 for JIT-with-debug). On any production-grade Indian cloud (AWS ap-south-1, Azure Central India, GCP Mumbai) the JIT is on; on older on-prem kernels it occasionally is not, and that is worth checking before claiming "eBPF is too slow for our environment".
Why the 2–3× of native gap exists at all, given that eBPF bytecode maps almost 1:1 to x86_64: the gap comes from two sources. First, every helper call (bpf_map_lookup_elem, bpf_probe_read_kernel, bpf_get_current_pid_tgid) is a real C function call into the kernel, with the calling-convention save/restore overhead and the function-call indirection a hand-written kernel function would inline away. Second, the JITed code includes verifier-mandated bounds checks before every memory access — the verifier proves the access is in-bounds, but emits a runtime check anyway as a defence against speculation attacks (Spectre v1 hardening). On Linux 5.16+ the JIT can elide these checks for verified-safe regions when bpf_unpriv_secure_seq is set, narrowing the gap to ~1.3×, but the safer default leaves a small overhead in place. For a tool firing 100k times per second this 2× overhead is invisible; for a tool firing 10M times per second per core (the high-frequency-trading or NIC-line-rate use case) the overhead matters and motivates BPF_F_TEST_RUN-style benchmarking before deployment.
Maps — the only mutable state, and the user/kernel bridge
An eBPF program cannot allocate memory, cannot use global mutable variables, and cannot call into arbitrary kernel code. The one thing it can do is read and write maps — kernel-resident key-value stores that live independently of the program and persist across program invocations. Maps are the model for both aggregation (count occurrences in-kernel without copying every event to user-space) and communication (the user-space agent reads the map periodically to harvest results).
The map types form a small but well-chosen set: BPF_MAP_TYPE_HASH (general key-value), BPF_MAP_TYPE_ARRAY (fixed-size, indexed by integer), BPF_MAP_TYPE_PERCPU_HASH and BPF_MAP_TYPE_PERCPU_ARRAY (one copy per CPU, no atomics needed), BPF_MAP_TYPE_LRU_HASH (auto-evicting), BPF_MAP_TYPE_RINGBUF (lock-free per-event delivery to user-space), and BPF_MAP_TYPE_PERF_EVENT_ARRAY (the older perf-buffer variant). Each type has different performance characteristics — per-CPU maps are essentially free for hot updates because there's no cross-CPU contention, while regular hash maps require an atomic compare-and-swap on every insert.
The user-space side reads maps via the bpf syscall (BPF_MAP_LOOKUP_ELEM, BPF_MAP_GET_NEXT_KEY) or, more idiomatically, through libbpf wrappers. The cost is a single syscall per batched read; modern eBPF supports BPF_MAP_LOOKUP_BATCH which fetches up to N entries in one call, amortising the syscall overhead.
# trace_retransmits.py — count TCP retransmits per process using bpftrace and parse the result.
# This is the script Aditi from the chapter opener could have run live during the incident.
# It demonstrates the full eBPF lifecycle: load a program, attach to a kprobe, aggregate
# in a map, read the map back into Python, format the output.
import json
import re
import signal
import subprocess
import sys
import time
# bpftrace program: on every tcp_retransmit_skb, increment a hash-map slot keyed by
# (process name, kernel stack id). The hash-map is a BPF_MAP_TYPE_HASH; bpftrace
# manages the user-space readback for us.
BPFTRACE_PROG = """
kprobe:tcp_retransmit_skb {
@retransmits[comm, kstack] = count();
}
interval:s:1 {
print(@retransmits);
clear(@retransmits);
}
"""
def run_bpftrace(duration_s: int = 10) -> str:
"""Spawn bpftrace, capture its stdout, send SIGINT after duration_s."""
proc = subprocess.Popen(
["bpftrace", "-e", BPFTRACE_PROG, "-f", "json"],
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
time.sleep(duration_s)
proc.send_signal(signal.SIGINT)
out, err = proc.communicate(timeout=5)
if proc.returncode not in (0, -signal.SIGINT):
sys.stderr.write(err.decode())
sys.exit(proc.returncode)
return out.decode()
def parse_top(jsonl: str, k: int = 5) -> list[tuple[str, int]]:
"""Parse bpftrace's JSON output and return the top-k (comm, count) pairs."""
counts: dict[str, int] = {}
for line in jsonl.splitlines():
if not line.strip():
continue
rec = json.loads(line)
if rec.get("type") != "map":
continue
for key, val in rec.get("data", {}).get("@retransmits", {}).items():
comm = re.match(r'"([^"]+)"', key).group(1)
counts[comm] = counts.get(comm, 0) + int(val)
return sorted(counts.items(), key=lambda kv: -kv[1])[:k]
def main() -> None:
print(f"# Tracing tcp_retransmit_skb for 10s on {sys.platform} ...")
raw = run_bpftrace(duration_s=10)
top = parse_top(raw, k=5)
print(f"{'process':<24} {'retransmits':>12}")
print("-" * 40)
for comm, n in top:
print(f"{comm:<24} {n:>12}")
if __name__ == "__main__":
main()
# Sample run on a c6i.4xlarge in ap-south-1 acting as the Razorpay payments-API pod:
# (run as root or with CAP_BPF + CAP_PERFMON on Linux 5.8+)
$ sudo python3 trace_retransmits.py
# Tracing tcp_retransmit_skb for 10s on linux ...
process retransmits
----------------------------------------
gunicorn 1247
postgres 83
redis-server 14
node_exporter 4
containerd 1
Walk-through. BPFTRACE_PROG is a 6-line bpftrace script that compiles, under the hood, to eBPF bytecode and loads it via BPF_PROG_LOAD. The kprobe:tcp_retransmit_skb clause attaches the program to the entry of the kernel function tcp_retransmit_skb; the program fires every time TCP decides a packet needs retransmitting. @retransmits[comm, kstack] = count(); is bpftrace's syntactic sugar for "look up the entry keyed by (current process name, kernel stack) in a hash map, increment it, store it back" — the hash map is a BPF_MAP_TYPE_HASH allocated by bpftrace at load time. interval:s:1 runs once per second and dumps the current map contents as JSON, then clears the map. run_bpftrace is a Python harness that spawns the bpftrace subprocess, waits 10 seconds, sends SIGINT to flush the final map state, and returns the captured stdout. parse_top parses the JSON-lines output into a (comm, count) table. The whole exercise — load the program, run it for 10 seconds, see which process is suffering retransmits — costs <1% CPU on a busy box and produces a strictly more useful answer than tcpdump | grep retransmit, which would dump tens of thousands of packets to userspace and force you to grep through them post-hoc.
The map abstraction is what makes this cheap. The kernel side writes 1247 increments into a single hash-map slot (or 5 slots, one per process); the user-space side reads 5 entries. There is no per-event userspace copy, no parsing of full packet headers, no tcpdump-style PCAP file. The aggregation happens in-kernel at JITed native speed.
Why per-CPU maps matter for high-frequency hooks: a regular hash map's update path uses an atomic compare-and-swap, which on a 64-core EPYC under contention can serialise to thousands of cycles per update during cache-line ping-pong. A per-CPU map gives each CPU its own slot, removing the atomics entirely; updates take ~10 cycles and scale linearly with cores. The user-space side reads all per-CPU values and sums them. For any kprobe firing more than ~100k times per second per core (every syscall, every context switch, every packet), per-CPU is the right default and the only one that survives at NIC line rate.
A subtle property: maps outlive programs. You can detach an eBPF program, the map's contents stay alive, and a fresh program loaded later can pin to the same map and pick up where the first left off. This is the foundation of map-pinned observability — long-running daemons like Cilium, Pixie, and Pyroscope manage maps as first-class objects in /sys/fs/bpf/, and individual programs come and go.
There is a related design decision about where the aggregation should happen. An eBPF program can do arithmetic, conditional branches, hash map lookups, and stack walks — but it cannot do anything that requires sleeping, allocating, or holding a kernel mutex. So the rule is: aggregate the smallest summary you need in-kernel (count, sum, histogram bucket); push the heavy interpretation (formatting, joining, alerting) to user-space. A bpftrace one-liner that prints a count() per comm is the canonical shape; a bpftrace one-liner that tries to build a percentile histogram in-kernel and emit a JSON document per second is asking the verifier for forgiveness it will not grant. The Pyroscope continuous profiler aggregates (pid, kernel_stack_id, user_stack_id) → count in-kernel — three integers and an atomic increment per probe firing — and lets the user-space agent join stack IDs to symbols at readout time. That division of labour is what keeps the in-kernel cost bounded and the user-space cost amortised.
The choice of map type is one of the few real engineering decisions the eBPF programmer makes. A BPF_MAP_TYPE_HASH is the right default for sparse, key-driven aggregation — counts per process, per kernel stack, per IP. A BPF_MAP_TYPE_ARRAY is right when keys are small dense integers (per-CPU index, histogram bucket, error code). A BPF_MAP_TYPE_LRU_HASH is right when keys can grow unboundedly and you want auto-eviction — tracking active TCP connections, for instance, where new connections appear constantly and old ones go away. A BPF_MAP_TYPE_RINGBUF is right when you genuinely need per-event delivery to user-space (full packet headers, full stack traces with arguments) and aggregation in-kernel is not enough — but the cost is back to roughly one user-space wake-up per N events instead of one per second. For a Hotstar streaming-router team running eBPF probes at 25 Gbps line rate, the choice of PERCPU_HASH over plain HASH is the difference between 1% overhead and 14% overhead at the same probe coverage; the difference is the cost of cross-core atomics on the hash bucket.
Hooks — where a program actually runs
A verified, JITed program with maps is still inert until it is attached to a hook. The hook is the kernel event that calls into your program; the choice of hook determines what your program gets to see. The hook taxonomy is a small list with very different semantics:
- kprobe / kretprobe — attach to the entry or exit of any non-inlined kernel function. The program receives the function's argument registers as its context. This is the most flexible hook (any kernel function is fair game) and the most fragile (function names and signatures change between kernel versions; a kprobe on
tcp_v4_connectworks on Linux 5.10 but might be renamed on 6.6). - tracepoint — attach to a stable, in-source-code-marked event point. Tracepoints are explicitly maintained by kernel maintainers as ABI-stable;
tracepoint:sched:sched_switchwill exist and have the same fields on Linux 5.4 and 6.6. Always prefer tracepoints over kprobes when one exists for the event you want. - uprobe / uretprobe — the user-space equivalent, attaching to functions in user-space binaries. Useful for tracing into a Python interpreter (
uprobe:/usr/bin/python3:_PyEval_EvalFrameDefault) or a Postgres backend. The cost is higher than kprobe — a context switch from the running process into the eBPF program — but it makes the entire user-space process tree observable. - USDT (Userland Statically Defined Tracing) — explicit tracepoints baked into a user-space binary by its author (Postgres, JVM, Python all ship USDT probes). Stable across versions of the binary, like kernel tracepoints.
- XDP — runs at the earliest possible point in the kernel's network receive path, before any allocation. The classic high-performance hook: programs at this layer can drop, redirect, or pass packets at line rate. Cilium's L4 load balancer lives here.
- tc (traffic-control) — slightly later in the network path than XDP, with access to the fully-formed
sk_buff. More expressive, slightly slower. - cgroup hooks — fire on socket creation, bind, connect within a cgroup. The hook for per-pod policy enforcement on Kubernetes.
- fentry / fexit — Linux 5.5+ replacements for kprobe with lower overhead, using BPF trampolines. Where available, prefer them.
- iter programs — Linux 5.8+ programs that the kernel runs once per item in a kernel data structure (every
task_struct, every TCP socket, every cgroup). Useful for snapshot-style observability: list all sockets currently in TIME_WAIT, dump every running task's scheduler stats. Different from event-driven hooks; these are pull, not push. - LSM hooks — attach to Linux Security Module hooks for sandboxing and policy enforcement, used by tools like
bpfilterand Cilium's network policy engine.
The choice of hook is the one architectural decision that does not have a clean default — every observability question has to start with "where in the kernel's lifecycle does the event I care about happen?" and the answer determines which hook the program attaches to. Chapters 42 and 43 walk the kernel-side and user-space-side hooks in detail; Chapter 44 covers how the per-event delivery path (perf buffer vs ring buffer) interacts with hook frequency.
A useful rule of thumb when picking between kprobe and tracepoint: tracepoints are like a public API, kprobes are like reaching into private internals. If the kernel maintainers added a tracepoint for the event you care about, they have promised to keep it stable; if you reach in via kprobe instead, you are on your own when the kernel refactors that function. Production tools that need to span many kernel versions choose tracepoints almost universally; one-off diagnostic scripts use kprobes freely because their lifetime is the length of the incident.
Common confusions
- "eBPF is the same as a kernel module." It is not. A kernel module has full access to kernel memory and runs at native ring-0; an eBPF program runs in a sandboxed VM, has access only to maps and helpers, and is statically verified before load. A bug in a kernel module panics the kernel; a bug in eBPF (almost always) gets the program rejected by the verifier at load time. The whole point of eBPF was to get the kernel-extensibility benefits without the kernel-module risks.
- "The verifier is just a linter; you can disable it." You cannot. The verifier is part of the
bpf()syscall's load path. Even root cannot bypass it. Thebpf_jit_hardensysctl controls additional hardening (constant blinding, JIT spray defences); the verifier itself is always on. This is intentional: the verifier is the safety property, not a development convenience. - "eBPF programs can do anything because they have CAP_BPF." They can do exactly what their attach point and helper allowlist permits. A
kprobeprogram can read kernel memory viabpf_probe_read_kernelbut cannot callkmalloc. AnXDPprogram can drop packets but cannot send arbitrary packets to userspace. The capability gates the load, not the per-helper authorisation. - "BPF maps are like shared memory between kernel and userspace." Closer to "an RPC interface". Userspace cannot directly read a map's memory — every read goes through the
bpf()syscall, which performs the lookup safely under the kernel's ownership rules. The illusion of shared memory works because the syscall is fast andBPF_MAP_LOOKUP_BATCHamortises it; the underlying mechanism is kernel-mediated, not mmap-shared. - "If my program loads, it is correct." Verifier acceptance proves the program does not crash, leak, or loop. It does not prove the program is semantically right — that it counts the right events, attaches to the right hook, or interprets the kernel struct field correctly. CO-RE relocations help with the last one but cannot save you from "I attached to
tcp_retransmit_skbwhen I meanttcp_retransmit". Verifier-accepted programs can still be wrong; they just cannot be unsafe. - "eBPF is a Linux thing only." Most production eBPF is Linux, but a Windows port (
ebpf-for-windows, started 2021, Microsoft-led) ships a verifier-and-JIT pair on top of Hyper-V. The bytecode is the same; the hooks differ. For a Linux-fleet engineer in 2026 the practical answer is "Linux", but the design has crossed the OS boundary.
Going deeper
CO-RE — Compile Once, Run Everywhere
The original eBPF model required compiling the program against the exact kernel headers of the target machine, which made distribution a nightmare — Razorpay's fleet might run five kernel versions across staging, prod, and edge boxes, and a different .bpf.o was needed per version. CO-RE fixes this with BTF (BPF Type Format): a compact debug-info representation that the kernel itself emits and that the eBPF program references at load time. When the program reads task->pid, the toolchain emits a relocation pointing at the task_struct.pid field by name, not by offset. At load time, libbpf rewrites the relocation with the field's actual offset on the running kernel — read from /sys/kernel/btf/vmlinux. The result: one binary loads on Linux 5.4, 5.10, 5.15, 6.1, 6.6, with no recompilation. Most production observability vendors (Pyroscope, Cilium, Pixie, Datadog) ship CO-RE binaries by default in 2026; the only fleets without CO-RE are those still on Linux <5.4, which are an increasingly niche population.
The mechanism is more general than just field offsets. CO-RE relocations also cover: enum value rewriting (an enum constant's numeric value can change between kernel versions; BPF_CORE_ENUM_VALUE reads the running kernel's value), conditional field existence (bpf_core_field_exists lets a program take different paths based on whether a struct field exists at all), and type-safe type information (the task_struct of one kernel really is a different type from the task_struct of another, and CO-RE knows the difference). Combined, these primitives let one binary span a span of kernel versions wide enough to cover an entire production fleet — including the kernels you have not yet booted but will inherit when the next Ubuntu LTS lands. A team that adopts CO-RE-aware tooling once stops thinking about kernel-version compatibility for years afterward.
Tail calls and program-to-program calls
A single eBPF program is capped at 1M verified instructions. For programs that need more (a complex packet-processing pipeline at XDP, a multi-stage tracing pipeline), the answer is tail calls: a program can call into another program via bpf_tail_call, which jumps (rather than nests) into the second program. The second program runs in the same context and inherits the same stack budget; the verifier verifies each program independently. Tail calls allow effectively unbounded composition while keeping each piece small enough to verify. The Cilium project's XDP datapath is built almost entirely out of tail-called program chains — packet classification, L4 load balancing, conntrack, encap — each a separate verified program.
A subtler form is bpf-to-bpf function calls (Linux 4.16+), which are real subroutine calls within a single program rather than program-to-program jumps. The verifier verifies callees first and treats them like inlinable functions. This is what lets modern eBPF programs look like reasonably structured C — splitting work across helpers — rather than the giant unrolled blobs the early-eBPF era produced.
The composition story matters in practice because real production eBPF programs are large. Cilium's XDP datapath spans roughly 30,000 lines of C across hundreds of separately-verified programs; Pyroscope's continuous-profiler program tree includes a dozen attached probes that share a single ringbuf. Without tail calls and bpf-to-bpf, every one of these would have to fit in a single 1M-instruction program, which is operationally impossible. The architecture's "small, verifiable units composed via well-defined transitions" is the only reason eBPF scales from one-line bpftrace scripts to fleet-wide observability stacks; it is the same property, applied recursively.
What lives in /sys/fs/bpf/ — pinning and the BPF filesystem
By default an eBPF program and its maps die when the user-space loader exits. Pinning lets you mount-bind them into a special filesystem at /sys/fs/bpf/, where they persist across loader restarts. A daemon can pin its maps once at startup, then detach, restart, re-attach, and pick up the same map state — useful for long-running observability stacks that survive their own crashes. The pinned maps are visible via bpftool map list and bpftool map dump pinned /sys/fs/bpf/my_map, which is the operator's debugging interface for "what does this eBPF program think is going on right now". For incident response on a Razorpay or Hotstar pod where you suspect a stuck eBPF program, bpftool prog list followed by bpftool prog dump xlated id <N> shows the verified bytecode of every loaded program, and bpftool map dump shows live map contents. This is the eBPF equivalent of ps and cat /proc/<pid>/status.
The verifier's most common failure modes — and what to do
The three error messages every new eBPF programmer eventually hits, in roughly the order they hit them:
- "R1 invalid mem access 'inv'" — you dereferenced a pointer the verifier could not narrow to a valid type, almost always because you forgot to null-check a
bpf_map_lookup_elemresult. Fix: addif (val == NULL) return 0;before the deref. - "BPF program is too large; processed N insn" — you blew the verification budget, usually with too-large unrolled loops or too-deep nested if-chains. Fix: replace the loop with a
bpf_loop()helper call, or move the heavy work to user-space. - "loop detected" — you wrote a loop the verifier could not bound. Fix: add an explicit
if (i >= MAX) break;with a constantMAX, or use a boundedforloop with a clang-known trip count.
A practical workflow is to keep the verifier's log around — libbpf exposes it via bpf_object__load_xattr with kernel_log_level=2 — and read it like a compiler diagnostic. The log shows the path the verifier took, the register types it inferred, and the exact instruction where the proof failed. After the third or fourth verifier rejection you start reading these logs the way C programmers read clang errors: as the most direct route to understanding what the verifier knows about your program.
The shape of the log is worth knowing in advance. Each block corresponds to one program path the verifier explored, and each instruction is annotated with the inferred register types after that instruction. A line like R1=ctx(off=0,imm=0) R2=map_value(off=0,ks=4,vs=8,imm=0) reads as "R1 is a pointer to the program's context; R2 is a pointer to a map value with key-size 4 and value-size 8". When the verifier hits an instruction whose operand types are incompatible with the operation, it prints the failing line and the diagnostic. The recipe: run bpftool prog load my_prog.bpf.o /sys/fs/bpf/my_prog log_level 2 2>verifier.log, open verifier.log, search for error:, walk backwards reading register types until you find the one that does not match what you expected. The mismatch is the bug. After a dozen rejections this becomes routine; the verifier stops feeling like an opaque adversary and starts feeling like a strict but consistent reviewer.
Reproduce this on your laptop
# Reproduce the retransmit-counting demo on a Linux box (5.8+ recommended for CAP_BPF).
sudo apt install bpftrace linux-tools-common
python3 -m venv .venv && source .venv/bin/activate
# (no pip install needed — the script uses only stdlib)
sudo python3 trace_retransmits.py
# In another terminal, generate retransmits with: tc qdisc add dev lo root netem loss 5%
# then curl localhost:8080 in a loop. Remove the qdisc with: tc qdisc del dev lo root
When eBPF is not the right answer
The case for eBPF is strong enough that it is easy to forget the cases where it isn't. Three are worth knowing.
First, single-machine, single-binary debugging on your own laptop. If you own the source, you can rebuild, you have hours, and you want full-fidelity tracing — gdb, strace -e raw, perf record -g, or just printf are simpler tools and will produce a cleaner answer than spinning up a bpftrace one-liner. eBPF earns its keep when you do not own the source (the kernel, a vendor binary, a black-box service), or when the trace must run continuously in production at low overhead. On your dev box, gdb is still the right first reach for "why did this function return -1?".
Second, per-event delivery at very high event rates. If you genuinely need every kernel-event payload in user-space — a full tcpdump-equivalent capture for forensic analysis, a DTrace-style script that rejects aggregation as "lossy" — then BPF_MAP_TYPE_RINGBUF works up to a point but loses events under sustained pressure once the userspace reader falls behind the producer. For 25 Gbps NIC line-rate per-packet capture, the right answer is still hardware-assisted (DPDK, AF_XDP zero-copy) rather than ringbuf alone. eBPF's strength is aggregation in-kernel; if you cannot aggregate, you are using the tool against its grain.
Third, anything requiring blocking calls or memory allocation in-kernel at probe time. eBPF programs cannot sleep (with narrow exceptions for BPF_F_SLEEPABLE programs on tracing hooks), cannot allocate, and cannot acquire kernel locks held over context switches. A workload that genuinely needs to call into a kernel allocator or wait on a kernel mutex during a probe is asking for a kernel module, not eBPF. The good news: 95% of observability use cases do not need any of these things, which is why the verifier's restrictions cost almost nothing in practice.
Where this leads next
This chapter set up the architecture; the next seven chapters in Part 6 each take one piece of it and turn it into a working tool. Chapter 40 (bpftrace: the awk of production) covers the high-level scripting language that hides verifier and map mechanics behind a one-line program — it is what Aditi typed in the chapter opener. Chapter 41 (BCC toolchain) is the lower-level Python+C framework underneath bpftrace. Chapters 42 and 43 walk the hook taxonomy: kernel kprobes and tracepoints, then user-space uprobes and USDT. Chapter 44 (perf buffer vs ring buffer) is about the per-event delivery path when aggregation is not enough. Chapter 45 puts it all together with eBPF-driven latency histograms — the tool that bridges from Part 6 (eBPF) into Part 7 (tail latency).
The single insight to take from this chapter: eBPF is not magic, it is a sandboxed VM whose safety guarantees come from a static verifier and whose performance comes from a JIT. Once you internalise the trinity — verifier, JIT, maps — every error message, every helper restriction, every map-type choice in the rest of Part 6 lands as a consequence of a property you already understand.
A practical first step before reading the next chapter: log into a production-shaped Linux box (any reasonably modern AWS, Azure, or GCP instance in ap-south-1 / Mumbai will do) and run bpftool prog list. On any modern distribution, on any cloud provider, on any cluster running Datadog or Pyroscope or Cilium, the output will show dozens of already-loaded eBPF programs you have been running without realising. Each line is a program some daemon loaded with the architecture this chapter described — verifier passed, JIT compiled, maps allocated, hooks attached. The infrastructure is already on your machine; the next seven chapters teach you to operate it.
A closing note for SREs who carry the pager: every hour invested in becoming fluent with eBPF tooling pays back the first time it shortens an outage. The Razorpay opener of this chapter is not hypothetical — it happens, in some form, to every team running production Linux at scale, every month. The team that has eBPF muscle memory closes incidents in 11 minutes. The team that does not closes them in 90.
References
- Brendan Gregg, BPF Performance Tools (Addison-Wesley, 2019) — the canonical cookbook; chapters 2 and 4 cover the architecture from the operator's view.
- The Linux Kernel
bpf-design-Q-A.rstandverifier.rst— the maintainers' description of what the verifier does and why. - Alexei Starovoitov, "BPF: the next big thing" (LWN, 2014) — the introduction of eBPF, written when it was still controversial.
- Andrii Nakryiko, "BPF CO-RE reference guide" — the practical guide to CO-RE relocations and BTF.
- The
libbpf-bootstraprepository — the modern reference for writing CO-RE eBPF programs in C with libbpf. bpftoolreference manual — the operator's tool for introspecting loaded programs and maps.- /wiki/continuous-profiling-in-production — the previous part's closing chapter, where eBPF first appears as the engine behind off-CPU profilers.
- /wiki/bpftrace-the-awk-of-production — the next chapter, where the tooling on top of this architecture becomes the day-to-day interface.