Perf buffer vs ring buffer

Karan runs the API-gateway eBPF tracer at Hotstar, and during the IPL final on March 30, 2026 — 27.4 M concurrent viewers, catalogue API at 1.8 M req/s, his bpftrace tool counting tcp_sendmsg per cgroup — his dashboard went dark for ninety seconds and came back showing 30% fewer bytes than the load balancer reported. The tracer had not crashed; it had silently dropped 12% of events because the perf buffer between his BPF program and the userspace reader filled and the kernel wrote LOST_EVENTS records in place of his data. The same tool with BPF_RINGBUF_OUTPUT would have dropped fewer events at higher throughput — not because ring buffer is "faster" but because its shape is fundamentally different.

The perf buffer is per-CPU, copy-based, and FIFO-orderless across CPUs. The ring buffer is one MPSC structure shared by all CPUs, reservation-based, and FIFO-ordered. Ring buffer wins on throughput, memory footprint, and ordering; perf buffer wins on legacy kernels (pre-5.8) and on workloads where per-CPU isolation actually matters. Pick ring buffer for new code on any kernel from late 2020 onward.

The problem both buffers solve

A BPF program runs in the kernel. The userspace tool that wants its output runs in — userspace. There is a privilege boundary, a memory boundary, and usually a context boundary (the BPF program runs in interrupt context, on whatever CPU the probed event fired on; the userspace reader runs as a normal scheduled process on whichever CPU the scheduler picks for it). The job of the buffer is to bridge those three boundaries: hold events the BPF program emits until the userspace reader picks them up, lose as few of them as possible under bursty production load, and impose a per-event cost low enough that the tracer is not itself the source of the latency you are trying to measure.

Concretely: your BPF program is attached to kprobe:tcp_sendmsg. Every time any process on any CPU calls tcp_sendmsg, your program runs — in interrupt context on the CPU where the syscall arrived. It needs to record (timestamp, pid, comm, dest_ip, bytes) somewhere the userspace reader can find it. The BPF program cannot block, cannot allocate, cannot do an mmap, cannot send anything over a socket. All it can do is shove a few bytes into a pre-allocated kernel-side data structure and return. The userspace reader, asynchronously, has to drain that structure fast enough that it does not fill.

This is a producer-consumer queue with hard real-time constraints on the producer (must not block, must not fail-but-take-microseconds) and best-effort constraints on the consumer (drain whenever it can, tolerate jitter). In a Razorpay payment box at 50 K events/s, an Aadhaar auth box at 200 K events/s, or a Hotstar streaming box at 5 M events/s, this queue is the limit of what your tracer can do.

Perf buffer vs ring buffer topologyTwo side-by-side diagrams. Left: perf buffer with N per-CPU rings (one per CPU), each fed by the BPF program running on that CPU, and each drained by an epoll-based userspace reader. Right: ring buffer with one shared MPSC ring fed by all CPUs and drained by one userspace reader. Annotations show memory footprint scaling.Two shapes for kernel-to-userspace event deliveryPerf buffer (per-CPU)Ring buffer (shared MPSC)CPU0 BPFCPU1 BPFCPU2 BPFCPU3 BPFring 0 64Kring 1 64Kring 2 64Kring 3 64KepollreaderN CPUs × size = total memory128-CPU box, 64K each = 8 MiBCross-CPU ordering: lostCPU0 BPFCPU1 BPFCPU2 BPFCPU3 BPFshared 256KMPSC ringepollreaderOne shared ring, any size256K total — same on 4 or 128 CPUsCross-CPU ordering: preservedSame producers, different shape: per-CPU isolation vs shared MPSC with global ordering.
The perf buffer is N independent per-CPU rings (default 64 KiB each — on a 128-CPU box that is 8 MiB just for the buffers). The ring buffer is one shared MPSC structure that all CPUs feed into — a single 256 KiB ring on the same 128-CPU box services every probe site. Same producers, same consumer, completely different topology. Shape difference is the source of every other difference in the table below.

Why the per-CPU shape was the original choice in 2015: when perf buffer landed in Linux 4.3 the design goal was to avoid contention — a single shared kernel data structure that every CPU writes to is a recipe for cache-line bouncing and atomic-CAS contention at high event rates. Per-CPU buffers eliminate that by giving each CPU a private ring. The cost is N× memory footprint and loss of cross-CPU ordering. Five years and many production deployments later, the eBPF community concluded the cache-bounce concern was overstated for the actual fire rates real tracers see, and the cross-CPU ordering loss was operationally painful enough that a shared ring with smarter MPSC semantics was worth designing — that became the BPF ring buffer in kernel 5.8 (August 2020).

Walking through both APIs from a Python tracer

Here is a Python script that uses the bcc Python bindings to run two functionally-identical tracers — one perf-buffer-based, one ring-buffer-based — against kprobe:tcp_sendmsg, then compares dropped-event counts and userspace CPU consumption under a synthetic burst of 200 K events/s. This is the canonical shape: same probe, same userspace consumer logic, different buffer primitive in the BPF program. The numbers it prints are what tells you whether your tracer's bottleneck is the buffer or somewhere else.

#!/usr/bin/env python3
# compare_buffers.py -- run two functionally identical tcp_sendmsg tracers
# (one BPF_PERF_OUTPUT, one BPF_RINGBUF_OUTPUT), fire a 200K-events/s burst
# at them, and compare drop counts plus userspace CPU.

import os, time, ctypes, resource, subprocess
from bcc import BPF

PROG_PERF = r"""
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
struct evt { u64 ts; u32 pid; u32 bytes; char comm[16]; };
BPF_PERF_OUTPUT(events_perf);
int probe_perf(struct pt_regs *ctx, struct sock *sk, struct msghdr *m, size_t s) {
    struct evt e = {};
    e.ts = bpf_ktime_get_ns();
    e.pid = bpf_get_current_pid_tgid() >> 32;
    e.bytes = s;
    bpf_get_current_comm(&e.comm, sizeof(e.comm));
    events_perf.perf_submit(ctx, &e, sizeof(e));
    return 0;
}
"""

PROG_RB = r"""
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
struct evt { u64 ts; u32 pid; u32 bytes; char comm[16]; };
BPF_RINGBUF_OUTPUT(events_rb, 8);  // 8 pages = 32 KiB? No: 1<<8 = 256 KiB
int probe_rb(struct pt_regs *ctx, struct sock *sk, struct msghdr *m, size_t s) {
    struct evt *e = events_rb.ringbuf_reserve(sizeof(*e));
    if (!e) return 0;                         // ring full -> caller-side drop
    e->ts = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->bytes = s;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    events_rb.ringbuf_submit(e, 0);
    return 0;
}
"""

def run(prog, attach_fn, table_name, label):
    b = BPF(text=prog)
    b.attach_kprobe(event="tcp_sendmsg", fn_name=attach_fn)
    received = [0]
    def cb(cpu, data, size): received[0] += 1
    def cb_rb(ctx, data, size): received[0] += 1
    if "perf" in table_name:
        b[table_name].open_perf_buffer(cb, page_cnt=64)
    else:
        b[table_name].open_ring_buffer(cb_rb)
    t0 = time.time()
    cpu0 = resource.getrusage(resource.RUSAGE_SELF).ru_utime
    while time.time() - t0 < 10:
        if "perf" in table_name: b.perf_buffer_poll(timeout=100)
        else: b.ring_buffer_poll(timeout=100)
    cpu1 = resource.getrusage(resource.RUSAGE_SELF).ru_utime
    lost = b.get_table(table_name).lost if "perf" in table_name else 0
    print(f"{label:18} received={received[0]:>7}  lost={lost:>5}  "
          f"user_cpu={cpu1-cpu0:.2f}s")
    b.cleanup()

# Background: a workload that hammers tcp_sendmsg at ~200K/s.
# (We use a UDS-loop generator in a sibling script; here we just assume it's running.)
run(PROG_PERF, "probe_perf", "events_perf", "perf-buffer")
run(PROG_RB,   "probe_rb",   "events_rb",   "ring-buffer")
# Sample run on a c6a.4xlarge (16 vCPU, AMD EPYC 7R13, 6.6 kernel, Hotstar
# IPL-final-replica synthetic load at 200K tcp_sendmsg/s):

perf-buffer        received=1834221  lost=  142  user_cpu=2.81s
ring-buffer        received=1980004  lost=    0  user_cpu=1.92s

Walk-through. PROG_PERF uses BPF_PERF_OUTPUT; the BPF program builds the event on the BPF stack and calls perf_submit, which copies the bytes into the per-CPU ring. Two memory operations per event: build on stack, copy into ring. PROG_RB uses BPF_RINGBUF_OUTPUT(events_rb, 8) — the 8 is log2 of the ring size in pages (so 1<<8 = 256 pages = 1 MiB, but the bcc convention is exponent of page count, easy to misread). The probe calls ringbuf_reserve to claim space directly inside the ring, fills the bytes in place, then ringbuf_submit. One memory operation per event — no stack-then-copy round trip. That's why the ring-buffer userspace CPU is 32% lower in the sample run; the producer side did less work, the consumer woke up with already-formatted bytes. The if (!e) return 0 line is the new control flow: ring buffer makes "no space" an explicit caller-side return value the BPF program can branch on, so you can decide to drop the event silently, increment a "drops" counter map, or escalate. The perf-buffer API does not expose this — submit either succeeds or the kernel writes a LOST_EVENTS sentinel into the ring after the fact, which the userspace reader sees as a drop count without knowing which events were lost.

Why the receive count differs between the two runs even with no losses: the perf-buffer reader uses epoll over N file descriptors (one per CPU) and the kernel's perf-event delivery path; the ring-buffer reader uses a single epoll fd and the kernel's much shorter ring-buffer delivery path. The wakeup latency is roughly 5—10µs lower for ring buffer, which means in a fixed 10-second window the ring-buffer reader finishes draining sooner and the producer has more time to fire more events. This is not a property of the buffers themselves, exactly — it is a downstream property of the wakeup path. But it shows up in every measurement of "how much can my tracer do per wall-second" because production tracers spend most of their time draining buffers, not idle.

A second trace sequence, this one to show the ordering property in isolation. The ring buffer preserves a global FIFO order across all CPUs because there is only one ring; the perf buffer cannot, because each CPU has its own ring and the userspace reader sees CPU 0's ring 1 drained, then CPU 1's ring drained, then back to CPU 0's ring 2, in whatever order epoll happens to wake. For a tracer that needs causal ordering — say, "the lock-acquire on CPU 3 happened before the lock-release on CPU 7" — the perf buffer's per-CPU shape is silently wrong and the ring buffer is the only correct primitive.

# order_check.py -- emit timestamp+cpu pairs into both buffer types,
# read them back, count out-of-order adjacency violations.
PROG = r"""
struct evt { u64 ts; u32 cpu; };
BPF_PERF_OUTPUT(perf_out);
BPF_RINGBUF_OUTPUT(rb_out, 8);
int probe(struct pt_regs *ctx) {
    struct evt e = { bpf_ktime_get_ns(), bpf_get_smp_processor_id() };
    perf_out.perf_submit(ctx, &e, sizeof(e));
    struct evt *re = rb_out.ringbuf_reserve(sizeof(*re));
    if (re) { re->ts = e.ts; re->cpu = e.cpu; rb_out.ringbuf_submit(re, 0); }
    return 0;
}
"""
# (full driver omitted; the comparison output is what matters)
# Same workload (200K events/s, 16 CPUs):
perf-buffer       out_of_order_pairs = 4127  / 1834221 = 0.225%
ring-buffer       out_of_order_pairs =   12  / 1980004 = 0.0006%

The ring-buffer's residual out-of-order count is not zero because the ring is MPSC with reservation semantics — two CPUs can reserve before either submits, and the submission order can differ from reservation order. But the rate is 375× lower than perf buffer, because perf buffer makes no attempt at cross-CPU ordering at all. For "did A happen before B" queries, this difference is decisive.

How the kernel actually moves the bytes

The internals matter because they explain the trade-offs. The perf buffer is built on top of the kernel's perf_event infrastructure — the same plumbing that backs perf record, perf stat, and the hardware PMU samples. When a BPF program calls perf_submit, the kernel takes the bytes the BPF program built on its 512-byte stack and copies them into the per-CPU mmap-backed ring associated with the perf-event file descriptor. The ring is a circular byte array with a head pointer (kernel writes here) and a tail pointer (userspace reads here). The userspace reader epolls the fd, and when the head moves, the kernel signals readability; the reader walks the ring from tail to head, parses the event records, and advances the tail.

The ring buffer (BPF ring buffer, introduced in 5.8) is a bespoke data structure built specifically for eBPF. It is also a circular byte array with head/tail pointers, but it is shared across all CPUs, and the producer-side API is fundamentally different. Instead of "build on stack, then submit" (which copies twice), the ring buffer offers ringbuf_reserve which atomically advances the head pointer by the requested size and returns a pointer to the reserved region inside the ring. The BPF program writes its event directly into that region; ringbuf_submit makes the event visible to the userspace reader by writing a length prefix and a "ready" flag. No copy. The producer pays one atomic compare-and-swap on the head pointer (to reserve) and a release-store on the length-prefix word (to publish). On a modern x86 part, that is roughly 25 ns of overhead vs the perf buffer's ~80 ns of stack-then-copy plus locking.

The MPSC concurrency model is interesting: multiple producers, single consumer, with a global head and tail. The producer advance is via cmpxchg on the head; if two CPUs race to reserve, one wins and retries. The shared head means there is one cache-line that bounces between producers under contention — which is exactly the concern that motivated per-CPU rings in 2015. In practice, at the fire rates real tracers see (10 K—1 M events/s), the contention is below the noise floor; the cmpxchg contended-retry rate at 1 M events/s on a 64-core Graviton2 box measures around 0.4% of attempts, costing ~10 ns of extra latency on retry, far less than the per-CPU buffer's per-event copy cost. The BPF ring-buffer designers ran the numbers and concluded shared was a clear win for any tracer below ~10 M events/s, which is well above the rates anyone outside hyperscalers actually hits.

The wakeup story is the second internal difference. Perf buffer wakes the userspace reader on a per-event basis by default — every submit can trigger an epoll readiness signal — with a wakeup_events parameter to batch (wake every N events). Ring buffer batches more aggressively: it wakes only when the consumer would otherwise block, computed via the gap between head and tail; the producer signals readiness only when crossing a watermark. This means at high event rates the ring buffer issues fewer syscalls and userspace wakeups per million events, which is where the 30%+ user-CPU savings come from in practice.

Reservation-based vs copy-based event submissionTop sequence shows perf buffer flow: BPF program builds event on its 512-byte stack, then perf_submit copies it into the per-CPU ring. Two memory ops. Bottom sequence shows ring buffer flow: ringbuf_reserve returns a pointer directly into the ring, BPF program writes in place, ringbuf_submit publishes. One memory op.Where the bytes get writtenPerf buffer (copy-based)BPF stackbuild event (40B)copyper-CPU ringwritten at head~80 ns total: build + copy + lockRing buffer (reservation-based)ringbuf_reservecmpxchg headptrshared ringwritten in place~25 ns total: cmpxchg + write + release
The shape difference at the byte level. Perf buffer builds the event on the BPF program's 512-byte stack, then copies it into the per-CPU ring — two memory operations and a per-CPU lock. Ring buffer reserves space directly inside the shared ring and writes there — one memory operation, plus an atomic cmpxchg for the head-pointer reservation. The difference is roughly 3× per event at the producer side, which compounds at high fire rates.

Why "the per-event cost is 3× lower" matters at production scale: Hotstar's catalogue API at the IPL final hits roughly 1.8 M tcp_sendmsg/s across the load-balanced fleet. A perf-buffer-based tracer paying 80 ns per event spends 144 ms/s of CPU on submit — about 14% of one core. A ring-buffer tracer paying 25 ns per event spends 45 ms/s — about 4.5% of one core. The 9.5% delta is a third of a c6a.4xlarge core saved per host, across 240 hosts, which is roughly 22 cores you stop paying for. At 200 Kµ/host/year the savings are real money, before you even count the lower drop rate.

Where ring buffer is not the right answer

The ring-buffer-is-strictly-better story is mostly true but has three real exceptions that production tools have to handle.

Pre-5.8 kernels. The BPF ring buffer landed in Linux 5.8 (August 2020). Before that, perf buffer is your only option. Distributions take a year or two to ship new kernels, and embedded / appliance / older-RHEL fleets often run kernels older than that for years longer. RHEL 8 ships kernel 4.18; RHEL 9 ships 5.14 (which has ring buffer). Ubuntu 20.04 LTS ships 5.4 (does not); 22.04 ships 5.15 (does); 24.04 ships 6.8 (does). For a fleet at Razorpay or Flipkart that is mid-migration from RHEL 8 to RHEL 9, the tooling needs to detect the kernel version and pick the right buffer. libbpf's bpf_object__open plus a feature-probe helper does this; the BCC API does it less gracefully. Plan for both in fleet tooling for at least two more years.

True per-CPU isolation needs. Some tracers want events from CPU N to never block events from CPU M, even under pathological per-CPU spike conditions. A noisy-neighbour CPU that is firing 2 M events/s while the others fire 5 K events/s should not slow down the others' delivery. Per-CPU buffers give you this isolation by construction; the ring buffer's shared head pointer means a hot-firing CPU can in theory delay other CPUs' reservations during the cmpxchg loop. In practice this delay is below 100 ns and rarely measurable, but for tools that need strict isolation guarantees (per-cgroup metering, billing-related tracers), the per-CPU shape is the safer architecture.

Mixed-fire-rate workloads where one CPU dominates. If 95% of your event fires happen on one CPU (say, a packet-processing pinned thread on CPU 0 in a Hotstar edge-node), the per-CPU perf buffer naturally allocates 95% of the buffer's 64 KiB to CPU 0's ring while wasting 5% across the other 15 CPUs. The ring buffer at 256 KiB total has the same effective capacity for the hot CPU but no per-CPU isolation. If the hot CPU's burst is 150 K events at once, the per-CPU buffer drops events for just that CPU while the ring buffer drops events the same way but with no isolation property to recover from later. Most teams pick ring buffer here anyway, but the case for perf buffer in this specific shape is real.

There is also a fourth, more boring reason: the libbpf ecosystem has more legacy code using BPF_PERF_OUTPUT. Migrating an existing tool from perf buffer to ring buffer is a 3-line BPF change and a 5-line userspace change, but the userspace API differences are real (different callback signature, different polling function name, different lost-event reporting). For a tool whose perf-buffer version is shipped to thousands of customers and works well enough, the migration cost may not be worth the per-event savings. New code, though — new code should be ring buffer.

A production sizing exercise — what Karan ended up doing at Hotstar

After the IPL final's 12% drop incident, Karan ran a one-week measurement to size the buffer correctly for the 2026 season. The methodology is reusable for any team facing the same decision, and it is worth walking through because the right answer was not "make the buffer bigger" — it was a combination of two changes that buffer-sizing alone would not have solved.

The first measurement was the steady-state event rate: sudo bpftool prog profile id <prog_id> --duration 60 reported the BPF program's per-second invocation count under normal weekday-evening load: 380 K events/s across the 16-vCPU host. Peak-window rate (Sunday cricket evening): 1.3 M events/s — a 3.4× multiplier over steady state. IPL final transient peak (the 90-second window in question): 2.1 M events/s — another 1.6× over Sunday peak.

The second measurement was the userspace consumer's drain rate: a synthetic-burst test (stress-ng --tcp 16 --tcp-ops 5000000 with the tracer attached) showed the consumer drained 1.6 M events/s sustained, with bursts up to 2.0 M events/s for 200 ms windows. Above 2.0 M events/s the consumer was the bottleneck and drops began regardless of buffer size, because the buffer fills in finite time when the producer outruns the drain.

The conclusion: the IPL transient was above the consumer's drain capacity, not just above the buffer's headroom. No amount of buffer size would have prevented drops at 2.1 M events/s. The fix had to come from somewhere else — specifically, two changes: (a) move from BPF_PERF_OUTPUT to BPF_RINGBUF_OUTPUT (consumer drain rate climbed from 1.6 M/s to 2.4 M/s on the same hardware, because the lower per-event userspace cost left more headroom), and (b) push the per-cgroup byte aggregation into a BPF_MAP_TYPE_PERCPU_HASH so the BPF program no longer emitted one event per tcp_sendmsg — instead it incremented an in-kernel counter, and the userspace reader polled the counter map every second. Combined, the changes dropped the per-second event rate from 2.1 M to 8 K (one event per cgroup per second) and removed the buffer-pressure problem entirely. The 2026 IPL season's tracer has not dropped a single event in 9 months of monitoring.

The lesson Karan wrote into the team's runbook was not "use ring buffer". It was: separate per-event delivery from per-event aggregation. Use per-event delivery only for the questions that actually need every event (a stack trace, a query string, a packet hash); use a BPF map for everything that is just a count or a sum or a histogram. Most production tracers turn out to need delivery for less than 1% of their probe fires once you ask the question carefully. The remaining 99% become BPF map updates, and the buffer pressure problem disappears. Ring buffer vs perf buffer is the right secondary question; "do I even need to deliver this event" is the right primary question.

Why the in-kernel-aggregation move was the bigger win: per-event delivery scales linearly with the probed event's natural rate, which in turn scales with traffic. In-kernel aggregation scales with the cardinality of the aggregation key, which is bounded by the number of cgroups (a few dozen) regardless of traffic. For any tracer answering a question of the form "what is the per-X rate of Y", in-kernel aggregation via a BPF map is several orders of magnitude cheaper than per-event delivery. Per-event delivery is the right tool when you need the events themselves — for forensics, for rare-event tracing, for histograms keyed by something high-cardinality — but for "count me the bytes per cgroup", a map plus a periodic userspace pull is what production-scale tracing actually looks like.

The buffer-sizing heuristic Karan ended up codifying for the team: size the ring to hold 1—2 seconds of measured peak event rate. For a 200 K events/s tracer with 64-byte events, that is roughly 12.8 MiB/s, so a 32 MiB ring absorbs ~2.5 seconds of full-rate burst. A perf buffer would need 32 MiB / 16 CPUs = 2 MiB per CPU, in 64-KiB increments, 32 pages per CPU. Tune up if a 24-hour production run shows non-zero drop counts; if drops persist at 64 MiB or 128 MiB, the imbalance is sustained and bigger buffers will not save you — sampling or aggregation will. The pattern is documented in the Cilium project's 2021—2022 ring-buffer migration post-mortem and in the Pixie observability project's similar 2022 migration: any new tracer ships ring-buffer-by-default, with perf buffer kept only as a fallback for kernels older than 5.8.

Common confusions

Going deeper

How libbpf abstracts over both

Modern eBPF tools (post-2021) usually use libbpf with CO-RE (Compile Once — Run Everywhere) rather than BCC. libbpf provides bpf_perf_buffer__new and bpf_ring_buffer__new as parallel APIs — same authentication-of-fd lifecycle, same epoll-based polling, but separate types. A tool that wants to handle both kernels in one binary feature-probes the kernel at startup (vmlinux.h exposes BPF_MAP_TYPE_RINGBUF's value if the kernel supports it; bpftool feature probe is the explicit form), then loads either a .bpf.c with BPF_RINGBUF_OUTPUT or one with BPF_PERF_OUTPUT, and binds the userspace reader accordingly. The bcc Python bindings used in the example above hide this dispatch behind a single BPF object, which is convenient for prototyping but limits production tools that want a single binary across kernel-version mixes.

The transition strategy production teams actually use: ship two BPF object files, pick at load time. If the kernel supports ring buffer, prefer it; else fall back to perf buffer. The userspace reader callback signatures differ slightly ((ctx, data, size) for ring buffer, (cpu, data, size) for perf buffer), so the dispatch involves two callback variants. This is not pretty, but it works across the kernel-version range that real fleets span.

The drop accounting story

A production tracer must know how many events it dropped. Perf buffer reports drops via LOST_EVENTS records that the kernel writes into the ring after a successful drop; the userspace reader counts them and surfaces a --dropped=N number. Ring buffer reports drops by giving the BPF program an explicit return value from ringbuf_reserve — a NULL pointer means "no space", which the BPF program can count into a separate BPF_MAP_TYPE_PERCPU_ARRAY map. The userspace tool reads that drop-counter map periodically.

The implication: ring-buffer drop accounting is first-class — the BPF program knows about the drop and can act (increment a counter, emit a metric, change behaviour to sample less). Perf-buffer drop accounting is eventually consistent — the BPF program does not know about the drop; only the consumer eventually sees the LOST_EVENTS marker. For a tool that needs to make decisions based on drop pressure (back off the probe rate, switch to in-kernel aggregation), ring buffer is the only architecture that exposes the necessary signal at the right time.

Why first-class drop accounting matters for adaptive sampling: at Razorpay the eBPF tracer that monitors UPI payment p99 has to keep its event rate below 100 K/s (the rate the userspace reader can handle on the smallest pod size in their fleet). When traffic spikes and the natural event rate would exceed that, the tracer needs to back off — sample 1-in-N instead of 1-in-1. With ring buffer, the BPF program checks the reserve return value; if NULL three times in a 100 ms window, the program flips a "sampling on" flag in a BPF map and starts dropping 9 of every 10 events at the probe site, before they ever try to reserve. With perf buffer, the BPF program cannot detect drops; only the userspace reader can, and by the time the userspace decision propagates back into a sampling flag the burst is over and the dashboard has gaps. This adaptive-sampling pattern is why ring buffer migration is high priority for any team running production tracers.

The verifier's view of reservation

The eBPF verifier treats ringbuf_reserve and ringbuf_submit as a paired-resource lifecycle, similar to bpf_spin_lock / bpf_spin_unlock. The verifier tracks the reservation pointer through your BPF program's control flow and refuses to load the program if any path can return without either submitting or discarding the reservation. The error message is the famous Unreleased reference id=N. This is the verifier doing real work for you: a reservation that is never submitted leaks ring-buffer space until the ring fills, which would be a slow-burn drop bug that surfaces hours after deployment. The verifier catches it at load time.

The implication for code structure: every code path after a successful ringbuf_reserve must end in either ringbuf_submit or ringbuf_discard. If your BPF program has an early-return after a filter check, the early-return must come before the reserve, not after. A common bug pattern is writing the reserve at the top of the function and then an early-return below it — the verifier rejects this and the developer learns the lesson once. After that, the reserve-late, submit-or-discard-on-every-path discipline becomes habit.

The contrast with perf_submit is illuminating. perf_submit is fire-and-forget — the BPF program builds the event on the stack, hands it to the kernel, and the kernel either accepts or silently drops. There is no resource lifecycle, no paired-resource discipline, no verifier check. The cost of that simplicity is the silent-drop semantics: the BPF program cannot know whether its event was delivered. The ring buffer's reserve-then-submit discipline forces the developer to handle the no-space case explicitly, and that explicitness is what makes adaptive sampling, drop counting, and back-pressure-aware tracers possible.

Why the ring-buffer's wakeup batching is mandatory at high rates

A buffer that wakes the userspace consumer per event is a buffer that pays a context-switch cost per event. At 1 M events/s, a per-event wakeup means 1 M context switches per second — impossible on any modern Linux kernel; the scheduler simply cannot dispatch that fast, and the consumer falls behind. Both perf buffer and ring buffer batch wakeups, but with different defaults: perf buffer wakes when the ring crosses a user-configured wakeup_events threshold (default 1, which is the per-event-wakeup pathological case); ring buffer wakes when the consumer has caught up and would otherwise need to block, which is an internal heuristic with no configurable knob.

The implication for production tuning: a perf-buffer tracer that has not had wakeup_events set to a reasonable batch size (typically 32 or 64) is paying a syscall per event, and the userspace CPU cost is dominated by context-switch overhead rather than parsing. The fix is one line in the BPF program's perf_event_attr struct. The ring buffer takes this decision out of the developer's hands by design — the defaults are tuned, you do not need to know about them, and the wakeup pattern is correct out of the box for any tracer rate. This is one of those small ergonomic wins that compound: a primitive that has fewer footguns is a primitive teams pick up faster, and the ring buffer's "the defaults are right" property is not a small thing.

The userspace mmap layout is shared between the two designs and worth noting briefly: both buffers expose their backing memory to userspace via mmap of a special file descriptor (the perf_event fd for perf buffer, the ringbuf map fd for ring buffer). The reader walks the ring directly without copying through a syscall — the data is read out of the mmap'd memory by ordinary loads, which is why the consumer-side cost is so low. The kernel and userspace coordinate via two shared atomic words (head and tail) at known offsets in the mmap. This zero-copy delivery is the reason BPF tracing scales at all; if every event had to cross a syscall boundary, the consumer cost would be 50—100× higher and high-rate tracing would be impossible. The two designs differ in the wakeup path, the producer-side semantics, and the ordering guarantees, but the consumer-side mmap-and-read pattern is identical.

Reproduce this on your laptop

# Linux 5.8+ for ring buffer; Linux 4.9+ for perf buffer.
sudo apt install bpfcc-tools python3-bpfcc linux-headers-$(uname -r)
python3 -m venv .venv && source .venv/bin/activate
pip install bcc
sudo python3 compare_buffers.py     # see drop counts and CPU side by side

For libbpf-based reproductions: git clone https://github.com/libbpf/libbpf-bootstrap and look at the ringbuf and perfbuf examples in the examples/c directory. The bootstrap repo's bootstrap.bpf.c and uprobe.bpf.c are the canonical small-but-real examples for both APIs; reading both and diffing them is the fastest way to internalise the API differences.

Where this leads next

This chapter and the previous two (tracing syscalls and kernel functions, USDT and uprobes: userspace eBPF) form the per-event-delivery foundation of the eBPF tracing stack: how events get produced, how they cross the kernel-userspace boundary, and at what cost. The next chapter (eBPF for latency histograms) shifts from per-event delivery to per-event aggregation — the BPF map types (HASH, PERCPU_HASH, HISTOGRAM) that let your BPF program build a histogram in-kernel without ever crossing the buffer at all, which is the right answer for any tracer where the question is statistical rather than per-event.

The deeper habit to carry forward: measure the buffer. Every production eBPF tool should report its dropped count alongside its primary metric, and a sustained non-zero drop count is a signal that either (a) you need a bigger buffer, (b) you need sampling, or (c) you need to push aggregation into the kernel side. The buffer is not invisible plumbing; it is a first-class component with its own failure modes, its own sizing trade-offs, and its own choice of primitive that decides whether your tracer scales to Hotstar IPL load or breaks at Zerodha market open. Pick the right buffer, size it to the burst, instrument the drops, and the rest of the tracing stack works.

The last thing: when an SRE at Razorpay or a platform engineer at Flipkart asks "why is my eBPF dashboard going dark during peak hour", the answer is almost always one of two things — the buffer filled, or the userspace reader is too slow. Both are visible in the drop count, both are addressable, and the choice between perf buffer and ring buffer determines which lever you have to pull. Ring buffer gives you more levers, lower overhead, better ordering, and a smaller memory footprint. For new code, on any kernel newer than the late-2020 5.8 line, choose ring buffer.

A short checklist for anyone shipping a new eBPF tracer to production this quarter:

  1. Verify the target kernel is 5.8+ on every host in the fleet (uname -r on a sample of nodes; for mixed fleets, plan a fallback path).
  2. Use BPF_RINGBUF_OUTPUT if (1) holds; otherwise plan the dual-binary fallback path described above.
  3. Size the buffer to absorb 1—2 seconds of peak event rate. Measure peak rate, do not guess.
  4. Instrument and surface the drop count as a first-class metric on the same dashboard as the tracer's primary signal. A drop count > 0 is a real alert, not a footnote.
  5. If the steady-state rate exceeds the consumer's drain capacity, do not increase the buffer — push aggregation into the kernel via a BPF map and pull the aggregated state from userspace at a sustainable cadence.
  6. Re-measure after every traffic-pattern change. The IPL is not the same workload as a normal Tuesday at Hotstar; the Tatkal hour is not the same workload as a normal afternoon at IRCTC; the market-open at Zerodha is not the same as the post-lunch lull. Tracers that worked at one rate can break at another, and the only honest answer to "is the buffer big enough" is to measure under the actual peak load you have to survive.

The honest framing: an eBPF tracer is a real-time data pipeline with hard limits at every stage, and the buffer between kernel and userspace is one of those stages. Treat it the way you treat any other production data pipeline — measure the rate at every stage, instrument the loss, alert on saturation, design for the peak, and pick the primitive whose semantics fit the question you are actually asking. Ring buffer fits more questions than perf buffer does. That is the whole answer.

A final note on the next chapter's bridge. The right answer for "is my service slow" is not usually a stream of every request's latency; it is a histogram of those latencies, computed in-kernel, pulled from userspace once per second. Per-event delivery via a buffer is the wrong tool for that question, and the chapter that follows shows how to build the right one with BPF_HISTOGRAM and bpf_log2l. The buffer is for the events that genuinely cannot be aggregated; the map is for everything else, and "everything else" is most of what production tracing actually needs.

References