Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

eBPF for latency histograms

Aditi at ParakhTrade is paged at 09:14 IST on a Wednesday: the order-match engine's p99.9 just jumped from 380 µs to 4.2 ms, twelve minutes before the cash-equity market opens, and her per-event eBPF tracer that streams every order-match latency to userspace has been silently dropping 18% of samples for the last hour. The dashboard is showing a confident, wrong p99.9 computed from a biased subsample. The fix is not a bigger buffer; it is to delete the per-event delivery entirely and replace it with a kernel-side histogram that her BPF program updates in place — one log-base-2 bucket increment per match — and pulled into userspace once per second. The 18% drop rate goes to zero, the userspace CPU drops by 90%, and the p99.9 readout becomes truthful again. This chapter is about that pattern: histograms in the kernel, the hist() map, and why this is the right shape for any latency question.

A latency histogram in the kernel is a BPF_MAP_TYPE_HASH (or BPF_MAP_TYPE_PERCPU_HASH) keyed by a log-base-2 bucket index, valued by a counter. The BPF program computes bpf_log2l(latency_ns) on each event and atomically increments the bucket. Userspace reads the whole map once per second and prints percentiles. No event-per-event delivery, no buffer pressure, no coordinated omission — the histogram cost is constant in event rate and proportional to bucket count, which is bounded.

Why per-event delivery is the wrong shape for latency questions

A latency histogram answers a statistical question: what is the distribution of request_latency_ns over the last second? The honest minimum information needed to answer it is one number per bucket per second — for a 64-bucket log2 histogram that is 64 8-byte counters, 512 bytes total per second. The per-event-delivery shape ships every individual sample to userspace, computes the histogram there, and then throws every individual number away. At ParakhTrade's order-match rate of 1.4 M matches/s during market open, that is 1.4 M × 16 bytes = 22 MiB/s of buffer traffic, every byte of which the userspace reader bins into the same 512 bytes of histogram state. You are paying 45000× the bandwidth you actually need.

The cost compounds in three places. First, the buffer itself fills under burst — market-open, IPL final, Tatkal hour — and starts dropping events, which means the histogram is computed from a biased subsample (the kernel drops the events it can fit least, which correlates with the events you most want to see). Second, the userspace CPU spent parsing event records is wasted work; the BPF program already had the latency value in a register, and userspace reconstructs it only to throw it away after one bucket increment. Third, the network of context switches between BPF and userspace adds wakeup latency that competes with the workload you are trying to measure — the tracer becomes the source of the slowdown it is paid to detect.

In-kernel aggregation flips every one of these. The BPF program keeps the histogram state in a BPF map; the increment is one atomic instruction; the userspace reader pulls the whole map's 64 counters once per second via a single syscall. Per-event cost: one cache-line-resident atomic. Per-second cost: 64 reads. The cost is decoupled from event rate. A tracer running at 100 events/s and a tracer running at 10 M events/s have nearly the same userspace cost.

Same probe site, same workload, two completely different costs. Per-event delivery (left) ships every latency sample through the buffer; the userspace reader bins them and throws the raw values away. The in-kernel histogram (right) bins the value at probe time, increments one counter in a BPF map, and userspace pulls the 64 counters once a second. The right side's cost is bounded by bucket count, not event rate.

Why log2 buckets are the right granularity for latency: latency distributions span many orders of magnitude — a fast cache hit is 100 ns, a slow disk seek is 10 ms, that is five orders apart. Linear buckets at 100 ns granularity would need 10⁵ buckets to cover the range, which is too many to read out per second. Log2 buckets give you 17 buckets to cover the same range (one per binary order of magnitude) with 50% relative error per bucket — good enough for percentile estimation, and small enough that the userspace pull stays cheap. HdrHistogram refines this to ~3 significant decimal digits with sub-buckets, but the log2 base case is what the BPF helper bpf_log2l gives you for free in 4 instructions, and that is what hist() in bpftrace uses under the hood.

Building one in Python with bcc

Here is a working tracer that measures block_rq_complete latency — how long each block-I/O request spent in the kernel queue plus the device — and prints a histogram once a second, with p50/p99/p99.9 derived from the bucket counters. It is the canonical shape of every latency tracer you will ever write with eBPF.

#!/usr/bin/env python3
# blk_lat_hist.py -- per-second block I/O latency histogram via in-kernel aggregation.
# Runs with: sudo python3 blk_lat_hist.py
import time, ctypes
from bcc import BPF

BPF_TEXT = r"""
#include <uapi/linux/ptrace.h>
#include <linux/blk-mq.h>

BPF_HASH(start, struct request *, u64);
BPF_HISTOGRAM(dist);                     // log2 bucket counters, 64 entries

int trace_start(struct pt_regs *ctx, struct request *rq) {
    u64 ts = bpf_ktime_get_ns();
    start.update(&rq, &ts);
    return 0;
}

int trace_complete(struct pt_regs *ctx, struct request *rq) {
    u64 *ts = start.lookup(&rq);
    if (!ts) return 0;                   // race or missed start probe
    u64 delta_us = (bpf_ktime_get_ns() - *ts) / 1000;
    dist.increment(bpf_log2l(delta_us));
    start.delete(&rq);
    return 0;
}
"""

b = BPF(text=BPF_TEXT)
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_account_io_done",  fn_name="trace_complete")

def percentiles_from_hist(hist_map):
    # bucket k holds count of samples whose value is in [2^k, 2^(k+1)) microseconds
    items = [(k.value, v.value) for k, v in hist_map.items() if v.value > 0]
    items.sort()
    total = sum(v for _, v in items)
    if total == 0: return {}
    seen, out, targets = 0, {}, {0.50: None, 0.99: None, 0.999: None}
    for k, v in items:
        seen += v
        for p, slot in list(targets.items()):
            if slot is None and seen >= total * p:
                # midpoint of the log2 bucket = 2^(k+0.5) us; rough but standard
                out[p] = int((1 << k) * 1.414)
                targets[p] = "done"
    return out, total

print("Tracing block I/O latency. Ctrl-C to stop.")
try:
    while True:
        time.sleep(1)
        pcts, total = percentiles_from_hist(b["dist"])
        print(f"--- {time.strftime('%H:%M:%S')}  samples={total} ---")
        b["dist"].print_log2_hist("usecs")
        if pcts:
            print(f"  p50={pcts.get(0.50, '?')} us  "
                  f"p99={pcts.get(0.99, '?')} us  "
                  f"p99.9={pcts.get(0.999, '?')} us")
        b["dist"].clear()
except KeyboardInterrupt:
    pass

# Sample run on a c6i.4xlarge (16 vCPU, NVMe, 6.6 kernel) under fio:
# fio --name=mix --rw=randread --bs=4k --iodepth=32 --runtime=60 --filename=/dev/nvme1n1

Tracing block I/O latency. Ctrl-C to stop.
--- 09:14:01  samples=23814 ---
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 12       |                                        |
         8 -> 15         : 1843     |*****                                   |
        16 -> 31         : 14201    |****************************************|
        32 -> 63         : 6128     |*****************                       |
        64 -> 127        : 1389     |***                                     |
       128 -> 255        : 198      |                                        |
       256 -> 511        : 38       |                                        |
       512 -> 1023       : 5        |                                        |
  p50=22 us  p99=90 us  p99.9=181 us

Walk-through. BPF_HASH(start, struct request *, u64) is the start-time map: keyed by request pointer, valued by the nanosecond timestamp at which the request entered the queue. This is the standard "two-probe" pattern for latency — one probe records start, the other reads start and computes delta. BPF_HISTOGRAM(dist) is bcc's macro that expands to a BPF_MAP_TYPE_HASH keyed by a 64-bit bucket index, valued by a 64-bit counter, with a helper print_log2_hist that knows how to render it. bpf_log2l(delta_us) is the 4-instruction kernel helper that computes floor(log2(x)) for a 64-bit unsigned integer; on x86 it lowers to a single BSR (bit-scan-reverse) plus a constant subtract. dist.increment(bucket) atomically adds 1 to the bucket counter, using BPF_F_NO_PREALLOC semantics — no allocation in the hot path, the map's storage was reserved at load time. The userspace loop time.sleep(1) then b["dist"].clear() gives a per-second tumbling histogram; if you want a sliding window or a long-running cumulative one, drop the clear(). The percentile derivation scans the sorted buckets, runs a cumulative count, and reports the lower-edge value of the bucket containing each target percentile — standard log2-histogram percentile estimation, accurate to within 50% (i.e. the true p99 is somewhere between the reported value and 2× that value).

Why we record the timestamp on blk_mq_start_request rather than at submit time: blk_mq_start_request is the kernel's "I am about to dispatch this request to the driver" hook, which is the moment after queuing is done and before the device actually services it. The latency we measure is dispatch-to-completion, which is what the application-visible "how long did my I/O take" actually is. If we recorded on blk_mq_alloc_request we would also include the queue-wait time, which is interesting but a different question; for a "device latency" histogram, dispatch-to-complete is the standard convention used in biolatency from bcc's distribution. Pick the probe pair that matches the latency definition you care about, document it in the tool's --help, and your histogram will mean what people think it means.

A second example, more compact, using bpftrace syntax to make the same histogram in 3 lines. This is the form Brendan Gregg's bpf-perf-tools-book uses throughout, and it is the form you will see most often in production on-call docs at Indian fintechs.

sudo bpftrace -e '
kprobe:blk_mq_start_request { @start[arg0] = nsecs; }
kprobe:blk_account_io_done  { @us = hist((nsecs - @start[arg0]) / 1000);
                              delete(@start[arg0]); }
interval:s:1                { print(@us); clear(@us); }'

@us:
[8, 16)            1843 |@@@@@                                         |
[16, 32)          14201 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)           6128 |@@@@@@@@@@@@@@@@@@@@                          |
[64, 128)          1389 |@@@@                                          |
[128, 256)          198 |                                              |
[256, 512)           38 |                                              |
[512, 1024)           5 |                                              |

Same shape, same buckets, different syntax. hist() is the bpftrace builtin that wraps a BPF_HISTOGRAM-shaped map; print(@us) reads it into stdout, clear(@us) resets it. For an on-call SRE who needs a one-liner to characterise block I/O during an incident, this is the form. For a tracer that has to ship to production with proper drop accounting, dynamic bucket sizing, and integration with a metrics pipeline, the bcc/Python form above is the form.

Per-CPU vs shared maps — the contention question

Aditi's first attempt at the order-match histogram used BPF_HASH for the bucket map, which is a single shared hashmap across all CPUs. At 1.4 M increments/s during market open, the cmpxchg-on-one-cache-line contention across 16 cores cost her 4% of CPU on the cmpxchg retries alone. The fix was BPF_PERCPU_HASH — a per-CPU hashmap where each CPU has its own bucket-counter copy, no atomic needed for increments, and the userspace reader sums across CPUs at read time. The CPU cost dropped to 0.3%. The histogram is unchanged because addition is commutative; the per-CPU split is invisible to anyone reading the aggregate.

The trade-off with per-CPU maps is read-time cost. Userspace has to read N copies of every bucket and sum them, which is N× the syscall traffic and N× the parse cost — on a 64-core box, reading a 64-bucket map means 64 buckets × 64 CPUs = 4096 reads per second. For a histogram updated at 1.4 M/s and read at 1 Hz, the math still favours per-CPU by orders of magnitude (the alternative is 1.4 M atomic-cmpxchg per second), but for low-rate tracers the per-CPU overhead can dominate and shared maps are the right answer.

The crossover point is roughly 10 K events/s per CPU: below that, shared maps win because the read-side cost of summing per-CPU dominates. Above that, per-CPU wins because the increment-side contention dominates. Most production latency tracers run well above the crossover, so the rule of thumb is "use per-CPU; switch to shared only if a measurement shows per-CPU is more expensive." Both bcc's BPF_HISTOGRAM and bpftrace's hist() default to per-CPU storage internally, which is the right default for the rates real tracers see.

The shared map (left) gives one set of 64 bucket counters that all CPUs increment via atomic cmpxchg — cheap to read, expensive to write at high rates. The per-CPU map (right) gives every CPU its own 64-bucket array — expensive to read (sum N copies), free to write (no atomic). For latency tracers that fire above 10 K events/s per CPU, per-CPU is unambiguously the right answer.

How the percentiles get computed honestly

A log2 histogram with 64 buckets covers 0 to 2⁶⁴ microseconds, more than enough for any latency. The percentile estimation procedure is: sort the buckets by index, compute the cumulative count, find the bucket where the cumulative count first crosses total * p, and report a value within that bucket. Three choices for the within-bucket value are common:

Lower edge (2^k): conservative, always reports a value smaller than the true percentile. Useful when reporting "your p99 is at least this much".

Geometric midpoint (2^(k+0.5) ≈ 2^k * 1.414): the unbiased estimator under a log-uniform distribution within each bucket. The standard for bcc's percentile helpers and for back-of-envelope work.

Upper edge (2^(k+1) - 1): pessimistic, always reports a value larger than the true percentile. Useful when an SLO threshold needs to be guaranteed not violated.

The error bound is 50% relative under any of the three, which is the price of log2's coarse bucketing. For SLO monitoring this is usually acceptable — an SLO of "p99.9 < 200 ms" is met or violated by a wide margin most of the time, and a 50% bucket error rarely flips the decision. For SLA-bound payment processing where the threshold is tight and the violation cost is high, you want HdrHistogram's sub-bucketing, which trades CPU and memory for tighter buckets — we cover the BPF-side HdrHistogram pattern in the Going-deeper section below.

The honest-percentile property of in-kernel histograms is one of their underrated benefits. A coordinated-omission-corrupted percentile from a load tester is a real number that lies; a log2-bucket percentile from a BPF histogram is an honest range. When Aditi's incident report at ParakhTrade says "p99.9 was between 4.0 and 5.7 ms during the incident", that is a statement about the world's actual behaviour. When the dashboard says "p99.9 was 4.2 ms" without uncertainty, that is a fiction (the same fiction percentile dashboards usually show, but a fiction). The bucket bounds are the truth, and putting them in the postmortem is the discipline that distinguishes performance engineers from people who quote numbers from Grafana.

Why the bucket-edge convention matters for incident response: the SRE on-call at PaisaBridge during a payment-latency incident has minutes, not hours, to decide if a deploy needs rollback. They look at the dashboard, see p99.9 = 240 ms, threshold is 200 ms, decide to roll back. But the true p99.9 with log2 buckets is somewhere in [128, 256] ms — the SLO might or might not be violated. If the dashboard reports the lower edge (128 ms), they conclude "fine, no rollback" and a real violation slips through. If it reports the upper edge (256 ms), they roll back conservatively and lose deploy velocity. The midpoint (180 ms) is the right report for "make a calibrated decision under uncertainty"; the dashboard should also surface the bucket bounds for the rare case where the decision is on the edge. This is one of those small choices that compound into real production outcomes; getting it right is part of what mature performance engineering means.

Real systems — what biolatency, runqlat, and bitesize teach you

The bcc tools repository ships three foundational latency-histogram tracers that every Indian SRE on-call should have in their toolkit. They are not toys; they are how Brendan Gregg debugs production at Streamora, and the same patterns apply at SetuStream, ParakhTrade, and PaisaBridge.

biolatency measures block-I/O latency exactly as the example above does, with extras: per-disk filtering (-D), per-flag-bit filtering (read vs write vs flush), cumulative or incremental modes, and a --millisecs flag to bucket in ms instead of µs. The first thing to run when an incident points at "the disks are slow" is sudo /usr/share/bcc/tools/biolatency -mD 1 — it gives you a 1-second-tumbling histogram per disk in milliseconds, and from the bucket distribution you can tell within seconds whether the disks are actually slow or whether the filesystem is queuing.

runqlat measures scheduler runqueue latency — how long a runnable task waits before getting on-CPU. This is the latency you see in user-space request handlers when CPU saturation is the cause of slowdown. The probes are kprobe:ttwu_do_wakeup (start: when the task becomes runnable) and tracepoint:sched:sched_switch (complete: when the task is dispatched). At SetuStream's IPL final, runqlat showed p99 = 12 ms during the spike, which directly diagnosed CPU saturation as the cause of the user-visible slowdown — not network, not disk, not the application code. One histogram, one diagnosis.

bitesize measures block-I/O size distribution — not latency, but request size in KiB — using the same BPF_HISTOGRAM machinery. It is the tool that answers "are my I/Os small or large?", which in turn tells you whether the filesystem is doing readahead well, whether the application is using small synchronous writes, and whether the I/O scheduler is merging requests effectively. Same primitive, different metric.

The lesson across all three: the histogram-in-a-BPF-map pattern is one primitive that solves a wide class of latency-and-distribution questions. Once you internalise it — record start, compute delta on complete, log2 into a per-CPU map, pull from userspace once a second — you can write a tracer for any "how is this latency distributed" question in 30 lines. Network round-trip latency. TCP retransmit interval. Garbage collection pause durations. Lock-hold times. Page-fault service times. Each is a 30-line bcc script and a 1-second pull cadence away from a production-grade observability primitive.

The pattern's third payoff is composability. You can run a dozen of these histograms simultaneously on the same host with negligible overhead, because each one's cost is dominated by 64-bucket reads at 1 Hz, not by event traffic. At PaisaBridge, the platform team runs 14 BPF histograms continuously on every payment-API host (block-I/O latency, runqueue latency, TCP RTT, page-fault latency, allocation latency, lock-hold latency, GC pause, syscall latency for 6 hot syscalls). The aggregate userspace overhead is under 0.5% of one core. The corresponding per-event tracers would not fit on the same host at all — they would saturate the buffer at peak load and start dropping. The right primitive turns "we cannot afford to monitor that" into "we monitor that continuously", which is what mature observability looks like.

Common confusions

"BPF histograms are HdrHistograms." They are not. A BPF_HISTOGRAM is a log2-bucket histogram with 50% relative error; HdrHistogram has sub-buckets within each power-of-2 range giving ~3 significant decimal digits. HdrHistogram-in-BPF is possible (see Going deeper) but it costs more memory and more cmpxchg-retry contention; the default bpf_log2l-based histogram is the workhorse.
"Reading a per-CPU map gives you N entries to display." It does not. The userspace API (bpf_map_lookup_elem) for a per-CPU map returns an N-element array of values for one key, where N is the number of CPUs. You sum across N to get the aggregated count for that key. The map looks per-key shared from the userspace perspective; the per-CPU split is below the API.
"Clearing the histogram every second loses tail data." Only if your reader is faster than your tail. If you read the map at 1 Hz and the bucket counter accumulates between reads, every event in the second is in the map at read time. The clear()-then-sleep(1)-then-read sequence is correct; the only loss is for tail events that arrive at exactly the same nanosecond as the read, which is below noise for any practical use.
"bpf_log2l is bucketing in nanoseconds." It is not bucketing — it is computing floor(log2(x)) of whatever you pass in. If you pass nanoseconds, the buckets are powers of 2 nanoseconds; if you pass microseconds, the buckets are powers of 2 microseconds. The unit is a property of the value you pass, not of the helper. Pick the unit at the probe site.
"In-kernel histograms eliminate coordinated omission." They do not by themselves — CO is a property of how the load generator measures, not of how you record. But they make CO impossible at the measurement side by recording every event the kernel sees, regardless of any consumer's pace; the only events not in the histogram are events that did not happen. For workload-side CO (where the load generator pauses on slow responses), you still need wrk2/vegeta-style constant-rate generation.
"BPF histograms are atomic across the histogram." They are atomic per-bucket, not across the histogram. The userspace reader reading bucket K and then bucket K+1 may see updates that landed between the two reads, so the reported total may not equal the sum of buckets at any single instant. In practice this is irrelevant because the reader runs at 1 Hz and the rate of inter-read updates is small relative to the second's accumulation; for tools that need a snapshot, a per-second clear + read + reset pattern gives consistent slices.

Going deeper

HdrHistogram in BPF for sub-percent percentile error

Some workloads need percentile error tighter than 50%. Payment-flow SLOs at PaisaBridge say "p99.9 must be < 200 ms" with a 5 ms tolerance, which a log2 histogram (50% bucket error = 100 ms at the 200 ms range) cannot resolve. The solution is HdrHistogram's idea: within each power-of-2 range, subdivide into M linear sub-buckets. With M=128, the relative error drops to 1/128 = 0.78%; the bucket count grows from 64 to 64×128 = 8192, which is still small enough to ship over the userspace pull at 1 Hz.

Implementing HdrHistogram in BPF is straightforward: replace BPF_HISTOGRAM with a BPF_PERCPU_ARRAY of size 8192, and replace bpf_log2l(x) with a small function that computes (top_bit_index << 7) | sub_bucket(x). The sub_bucket is a few shifts and a mask. The whole thing is ~20 BPF instructions, well within the verifier's complexity limits. Cilium's monitoring stack uses exactly this pattern for connection-tracking latency; the scheduler at Meta uses it for runqueue latency. The pattern is two years old and battle-tested in production at hyperscaler scale.

The tradeoff is read-side cost: 8192 buckets × N CPUs at 1 Hz = up to 500 K reads/s on a 64-core box. Still cheap relative to per-event delivery, but no longer trivial. The right answer for tight-SLO workloads; the wrong answer for low-rate "rough characterisation" tracers. Pick based on the SLO width.

The verifier and the histogram code path

The eBPF verifier examines every code path before allowing the program to load, and BPF histograms are simple enough that they verify quickly — the path through bpf_log2l and BPF_HASH.increment is straight-line code with one map lookup, which the verifier handles in microseconds. But there is a non-obvious gotcha for the two-probe pattern: the start map's lifetime. If the start probe puts an entry in the map for a request and the complete probe never runs (because the request was cancelled, or the complete kprobe missed due to instruction patching, or kernel bugs in some 5.x stable kernel), the start entry leaks until the map fills.

The fix is bounded: use a map size large enough for the in-flight set (typically 10 K—100 K entries depending on workload), and rely on the eviction policy of BPF_MAP_TYPE_LRU_HASH rather than BPF_MAP_TYPE_HASH. The LRU variant evicts the least-recently-used entry when full, so a leaked entry from a missed-complete probe gets cleaned up automatically. The cost is one extra LRU-pointer update per insert, ~5 ns, negligible. This is the production-grade variant of every two-probe latency tracer; the bcc tool sources mostly use plain BPF_HASH because the missed-complete rate is tiny in practice, but for a tracer that runs continuously on a high-traffic host, LRU is the safer choice.

Why the LRU variant is not the bcc default: the LRU map type was added later (kernel 4.10) than the basic hash map (always supported), and the LRU has slightly more memory overhead per entry. For tools meant to run for short periods (a 60-second biolatency invocation during incident response), the leak is bounded by the duration and a regular hash works. For continuously-running production tracers, LRU is the right primitive; the bcc tool sources document this in their own production-deployment notes.

Kernel 6.6+: ringbuf-aware histograms

A 2024 patch added a hybrid pattern: a histogram backed by a per-CPU array, plus a ringbuf-style "wakeup userspace at watermark" trigger. When a bucket counter crosses a configured threshold (say, 10 events in the >1 ms bucket in any 10-second window), the BPF program emits one event into a small ring buffer, which the userspace reader uses as a wake-up signal to immediately pull the histogram and emit a metric or alert. Without the watermark, the userspace reader polls at 1 Hz; with it, the reader sleeps until something interesting happens.

This pattern is useful for long-running tracers where most seconds are quiet and you want to avoid the constant 1-Hz overhead. The cost is one extra map lookup per increment (to check the threshold), and the pattern is not yet wrapped by bcc or bpftrace — you write it directly with libbpf. It is documented in the Linux kernel's samples/bpf/ examples and in Daniel Borkmann's 2024 LPC talk on "BPF observability beyond polling".

Reproduce this on your laptop

# Linux 4.9+ for basic BPF_HASH, 5.8+ for ringbuf, 4.10+ for LRU.
sudo apt install bpfcc-tools python3-bpfcc linux-headers-$(uname -r)
python3 -m venv .venv && source .venv/bin/activate
pip install bcc

# Run the tracer in one terminal:
sudo python3 blk_lat_hist.py

# Drive load in another:
sudo apt install fio
sudo fio --name=t --rw=randread --bs=4k --iodepth=32 \
        --filename=/dev/nvme1n1 --runtime=60 --time_based

# You'll see the per-second histogram update with each fio run's load profile.

For the bpftrace one-liner version, install sudo apt install bpftrace and run the inline expression shown earlier. For the bcc tool variants:

sudo /usr/share/bcc/tools/biolatency -mD 1
sudo /usr/share/bcc/tools/runqlat 1
sudo /usr/share/bcc/tools/bitesize

These are pre-built tools in the bcc package; reading their source (/usr/share/bcc/tools/biolatency is a Python script, ~120 lines) is the fastest way to internalise the production-grade two-probe pattern.

Where this leads next

This chapter built the in-kernel histogram on top of two primitives we covered earlier: BPF maps as the storage substrate, and BPF probes (kprobe/tracepoint) as the event source. The next chapter, BPF maps as the data plane, zooms out from histograms to the full vocabulary of map types — HASH, ARRAY, LRU_HASH, RINGBUF, STACK, QUEUE, BLOOM_FILTER — and the trade-offs between them. After that comes in-kernel aggregation patterns beyond histograms: top-K, sliding windows, percentile sketches, and the Cilium/Pixie production patterns that combine these into observability stacks.

The deeper habit to carry forward: the right shape for a measurement question is the shape that minimises information loss at the measurement site. Per-event delivery is a high-bandwidth shape that throws away the question's structure, then reconstructs it expensively in userspace. In-kernel histograms are a low-bandwidth shape that incorporates the question's structure (a histogram) into the measurement primitive itself. Whenever you find yourself building a userspace pipeline that streams raw events to a histogram-aggregator, the first question to ask is: "could the kernel build the histogram for me?". Most of the time, the answer is yes, and the cost reduction is two to three orders of magnitude. Per-event delivery is for the questions that genuinely need every event — forensics, stack traces, rare-event tracing — and a tracer that uses it for everything else is a tracer that will not survive its first traffic spike.

The third habit: do not lie about percentiles. A log2 histogram tells you the bucket; the true percentile is somewhere in the bucket; the dashboard must reflect that uncertainty or the on-call will mis-decide. HdrHistogram for tight SLOs, log2 for everything else, and the bucket bounds in the postmortem — that is the discipline.

A short checklist for any latency-histogram tracer you ship to production this quarter:

Use BPF_PERCPU_HASH or BPF_PERCPU_ARRAY for the histogram storage; switch to shared only if a measurement shows per-CPU read overhead dominates.
Use BPF_MAP_TYPE_LRU_HASH for the start-time map in any continuously-running tracer; the regular BPF_HASH leaks slowly under missed-complete events and bounds the tool's runtime.
Pull the histogram at 1 Hz and report percentiles with bucket bounds, not single-point estimates. The dashboard should surface the uncertainty.
Document which probe pair defines the latency — "queue-wait + dispatch" vs "dispatch only" — so consumers of the metric know what it measures.
Before deploying, run the tracer at 2× expected peak event rate and verify the userspace CPU stays under 1% of one core. If it does not, switch to per-CPU storage or HdrHistogram-with-watermark.
Add the tracer's drop count (if any — the histogram itself does not drop, but the start map can fill) as a first-class metric. Sustained drops are a sign of a missed-complete bug, not a buffer-pressure bug, but they show up as data quality problems either way.

The honest framing: an eBPF latency tracer is a measurement primitive that can be either cheap and truthful or expensive and corrupted. The histogram-in-a-BPF-map shape is the cheap-and-truthful version. Use it everywhere a percentile is the question; reach for per-event delivery only when "every event" is genuinely the question. The ParakhTrade order-match tracer Aditi shipped after the 09:14 incident uses this pattern; it has run continuously for nine months without dropping a sample, and the dashboard's p99.9 has been within 0.78% of the true value every minute of that time.

References

Brendan Gregg, BPF Performance Tools (Addison-Wesley, 2019) — the canonical text for biolatency, runqlat, bitesize, and the histogram-in-BPF-map pattern. Chapters 8 (filesystems) and 16 (case studies) are the most relevant.
Brendan Gregg, "Latency Heat Maps" — the visualisation pattern that turns a stream of histograms into a 2-D heatmap, the natural display for the per-second tumbling histogram in this chapter.
Gil Tene, "How NOT to Measure Latency" — the coordinated-omission talk; required reading for anyone reporting percentiles. The in-kernel histogram pattern eliminates measurement-side CO; the talk explains why workload-side CO still matters.
HdrHistogram project — the sub-bucketed histogram library that motivates the BPF HdrHistogram pattern in the Going-deeper section.
bcc tools repository: biolatency, runqlat, bitesize sources — the production-grade reference implementations of the patterns in this chapter.
/wiki/perf-buffer-vs-ring-buffer — the per-event delivery primitive whose limitations motivate moving to in-kernel aggregation.
/wiki/coordinated-omission-and-hdr-histograms — the workload-side measurement discipline that complements in-kernel aggregation.
/wiki/tracing-syscalls-and-kernel-functions — the probe sources that feed the histogram's two-probe pattern.