Wall: kernel-level observability is a different world

It is 02:14 IST and the Razorpay payments-API on-call is staring at a Tempo span tree for a stuck UPI transaction whose HTTP handler span is 4,812 ms — but the inner spans (database 11 ms, NPCI outbound 38 ms, Redis idempotency 2 ms) add up to only 51 ms, leaving 4,761 ms with no spans inside it at all. The trace is not broken; every span the application emitted is present and accounted for. The missing 4.7 seconds simply happened in a place the application could not see — the kernel was scheduling, TCP was retransmitting, page faults were resolving, the memory cgroup was throttling, and none of those events ever crossed into userspace where the OpenTelemetry SDK could record them, which is the wall Part 8 is built around.

Userspace instrumentation can only see what your process chose to record between syscalls. Below the read/write/futex boundary lives a separate operating system — scheduler, network stack, page cache, locks, cgroup limits — that runs in privileged mode, dwarfs userspace in code volume, and routinely owns the milliseconds your traces cannot account for. The classical answer (kernel modules, /proc polling, strace, perf record) is either unsafe in production or too expensive to leave on. eBPF is the answer the rest of Part 8 is about; this chapter is the wall that explains why a fundamentally different observation primitive was needed.

The userspace ceiling: what your spans literally cannot see

A span records what your application code did between two timestamps it took itself. Both timestamps come from clock_gettime(CLOCK_MONOTONIC), both are recorded in Python (or Go, or Java) memory, both are eventually flushed to an OTLP exporter. Everything in that span — the work, the wait, the failure — is something the process itself executed. The instant the process makes a syscall and blocks, the kernel takes over, and the kernel does not phone home to your tracer.

A recvfrom() that takes 800 ms because the TCP receive buffer is empty looks identical, from the userspace span's perspective, to a recvfrom() that takes 800 ms because the kernel scheduler de-queued your thread for 700 ms and only spent 100 ms on the actual receive. Both produce the same span: start=t, end=t+800ms, name=db.query. The application has no way to distinguish "I was waiting for data" from "I was waiting for a CPU to run on" from "I was waiting for the page cache to fault in my heap". Your trace says 800 ms in db.query, which is true and useless.

Illustrative — not measured data. Your span starts and ends in userspace, but its duration includes kernel time during which your tracer is asleep. A span of 800 ms could be 800 ms of work, 100 ms of work plus 700 ms off-CPU, or 50 ms of work plus 750 ms in TCP retransmit recovery. From the userspace tracer's point of view, all three look identical.

Why this is structural, not a bug in OTel: the OpenTelemetry SDK is a library inside your process. It runs on the same threads your application runs on. When the kernel preempts those threads, the SDK is asleep with them. There is no way for in-process code to record events that happen while the process is not running. This is not a feature gap — it is a consequence of the userspace/kernel split that every Unix-derived OS has had since the 1970s. Adding more spans inside your application cannot fix it; the missing time is in another address space.

A practical consequence at Indian-scale: the Hotstar streaming team observed during IPL 2023 that their video-segment-fetch p99.9 sat at 1.4 s while every userspace span summed to under 200 ms. The 1.2 s gap was Linux's TCP RTO (retransmission timeout) firing on packets dropped at the ISP edge. No Java instrumentation could see it, because the JVM was happily blocked in epoll_wait while the kernel did the retransmit. Only tcpretrans from bcc-tools revealed the cause, and only because someone thought to look below the wall.

The same pattern recurs across the Indian observability war stories that pre-date the eBPF rollout. Zerodha's Kite trading platform tracks p99.9 order-acknowledgement latency at market-open (09:15 IST) as one of its public SLOs; for two quarters in 2022 the team watched the SLO drift up by 4–8 ms with no userspace cause, then traced it to the kernel's __schedule returning slowly under contention from a backup process running on the same host. PhonePe's UPI deduplication service hit a multi-second latency cliff during one Diwali season that turned out to be mm_compaction (kernel huge-page compaction) running on the host. Cred's rewards-engine had a "phantom 30-second outage" every Tuesday morning that was a kernel-level filesystem flush stalling on a dirty_ratio threshold. None of these were findable from inside the application; all of them produced wrong, vague postmortems until someone looked at kernel-level signal.

What lives below the line — the second operating system

The kernel is not "the same operating system, but lower". It is closer in design to a different OS that your processes happen to run on top of. Linux ships ~30 million lines of C; the typical Python web service ships ~50,000. The ratio of kernel code to your code is roughly 600:1, and almost none of that 600:1 is annotated with spans, exposed via Prometheus, or written to logs you collect. The kernel's own observability primitives (/proc, tracepoints, kprobes) exist, but they are not what your APM agent reads.

The events that happen below the wall fall into a few categories that recur in every production incident:

Scheduling. Your thread is on the runqueue, not executing. Every wall-clock millisecond it waits is invisible time inside whatever userspace span it was about to execute. On a CPU-saturated host with 80% utilisation, mean runqueue latency for a normal-priority thread can hit 5–15 ms; the 99th percentile climbs into the 100s of ms during scheduler-class contention or noisy-neighbour bursts. The kernel's CFS scheduler picks the next runnable task by virtual-runtime ordering, so high-throughput contention (many threads, none with explicit priority hints) produces fair-but-slow scheduling — every thread gets a turn, but every thread also waits its turn. Userspace sees this as "everything is mysteriously slow at peak QPS".

Network stack internals. TCP connection setup, congestion-window probing, retransmits, SACK reassembly, receive-buffer bloat, packet drops at the NIC ring, GRO/GSO segmentation. Each is a normal kernel mechanism, each adds milliseconds, none of them are recorded by your application. A 200 ms recvfrom on a connection where Linux's congestion window collapsed because of a single dropped packet looks identical, in a userspace trace, to a 200 ms recvfrom on a perfectly healthy connection — the difference is only visible at tcp_retransmit_skb or tcp_rcv_established, kernel functions your tracer cannot reach.

Memory pressure. Page faults that resolve from the page cache (fast) vs from disk (slow). Memory cgroup memory.high triggering reclaim before allocation succeeds. Transparent huge-page collapse. Swap-in. A malloc that takes 90 ms because the kernel had to evict pages first looks like a 90 ms span and tells you nothing. Memory pressure is the most insidious of the kernel-time costs because the userspace effect — slow allocation, slow reads — looks like normal application slowness. Only the kernel knows that the slowness is reclaim, not work.

Lock contention. futex_wait, the userspace-kernel hybrid lock, is the wait primitive behind every Java synchronised block, Python threading.Lock, Go mutex. The wakeup latency is kernel-controlled. A lock-contention storm shows up in userspace as "everything is slow" and only via kprobes on __schedule or futex_wake can you tell whose wakeup is being delayed by whom.

cgroup throttling. A container with cpu.max=200000 100000 (2 cores) that requests more is throttled by the kernel's CFS bandwidth controller — frozen for the rest of the period. The throttle quanta are 100 ms by default; a throttled task's userspace spans show 100 ms of "stuck" time with no userspace event correlating to it.

Why naming these out matters before Part 8 starts: each of these is a place where the right Part-8 tool — a tracepoint, a kprobe, a USDT probe, a bpftrace one-liner — can instrument the kernel to emit an event that your userspace tracer would never have produced. The reader who has not internalised "the kernel is running a separate set of programs that affect my latency" will not understand why eBPF was worth inventing. eBPF is not "more APM"; it is a fundamentally different observation primitive aimed at this layer.

A pattern worth recognising in Indian production runbooks: the most-quoted phrase in postmortems from the 2022–2024 era of platform-engineering at Razorpay, Cred, Swiggy, and Dream11 is some variant of "we could see the symptom but not the cause". When the postmortem author gets specific, the cause is below the wall in roughly half the major-incident postmortems read for this curriculum: scheduler-induced runqueue wait, network-stack retransmit storm, page-cache eviction during a deploy that touched a noisy neighbour, futex contention when a hot lock's owner was preempted. The platform teams that have moved to always-on eBPF profiling (Pyroscope-eBPF, Pixie, Parca-Agent) have shifted those postmortems from "we eventually correlated it through process of elimination" to "the off-CPU flamegraph showed the cause in 30 seconds". The cost of not observing below the wall is paid in incident MTTR, every quarter, until the wall is torn down.

A measurement: how blind is userspace, in seconds?

The right way to feel the wall is to write a Python program that does no real work but is preempted, and measure the gap between wall clock time and on-CPU time. The kernel exposes per-thread on-CPU time in /proc/self/task/<tid>/stat; subtracting it from wall clock tells you how many milliseconds your process was alive but not running.

# userspace_blind.py — measure the gap between wall clock and on-CPU time
# pip install (no extra deps; uses stdlib only)
import os, time, threading, statistics

CLK_TCK = os.sysconf(os.sysconf_names["SC_CLK_TCK"])  # usually 100 jiffies/sec

def cpu_jiffies(tid: int) -> tuple[int, int]:
    """Return (utime, stime) in jiffies for thread `tid`."""
    with open(f"/proc/self/task/{tid}/stat") as f:
        # field 14 = utime, 15 = stime; tokenise carefully because comm has spaces
        line = f.read()
        rparen = line.rfind(")")
        fields = line[rparen + 2 :].split()
        return int(fields[11]), int(fields[12])  # utime, stime (0-indexed after rparen+2)

def measure_one_burst() -> tuple[float, float]:
    """Run a 100ms compute burst inside a noisy host; return (wall_ms, on_cpu_ms)."""
    tid = threading.get_native_id()
    u0, s0 = cpu_jiffies(tid)
    t0 = time.monotonic_ns()
    end_at = t0 + 100_000_000  # 100ms wall budget
    x = 0
    while time.monotonic_ns() < end_at:
        x += 1                     # tight loop; on a quiet CPU this is 100ms on-CPU
    t1 = time.monotonic_ns()
    u1, s1 = cpu_jiffies(tid)
    wall_ms = (t1 - t0) / 1e6
    on_cpu_ms = ((u1 - u0) + (s1 - s0)) * 1000.0 / CLK_TCK
    return wall_ms, on_cpu_ms

# Spawn 8 noisy CPU hogs to simulate a contended host (like a shared k8s node)
def hog(stop_event: threading.Event):
    while not stop_event.is_set():
        for _ in range(10_000_000): pass

stop = threading.Event()
hogs = [threading.Thread(target=hog, args=(stop,), daemon=True) for _ in range(8)]
for h in hogs: h.start()

samples = [measure_one_burst() for _ in range(50)]
stop.set()

walls = [w for w, _ in samples]
on_cpus = [c for _, c in samples]
gaps = [w - c for w, c in samples]

print(f"wall p50={statistics.median(walls):6.1f} ms  p99={max(walls):6.1f} ms")
print(f"on-CPU p50={statistics.median(on_cpus):6.1f} ms  p99={max(on_cpus):6.1f} ms")
print(f"gap   p50={statistics.median(gaps):6.1f} ms  p99={max(gaps):6.1f} ms")
print(f"      → median {100*statistics.median(gaps)/statistics.median(walls):4.1f}% of wall is invisible to userspace")

# Output (on a 4-vCPU shared host with 8 hogs running):
wall p50= 142.3 ms  p99= 318.7 ms
on-CPU p50=  98.4 ms  p99= 102.1 ms
gap   p50=  43.9 ms  p99= 216.6 ms
      → median 30.8% of wall is invisible to userspace

The script asks the loop to spin for 100 ms of wall time. On a quiet machine, wall and on-CPU agree to within a millisecond. On a contended machine — exactly what your production node looks like during the IPL final or Big Billion Day — the wall-clock budget is 142 ms but only 98 ms is on-CPU. The remaining 44 ms (median) and 217 ms (p99) is somewhere, and userspace cannot tell you where.

Line 7 — cpu_jiffies(): reading /proc/self/task/<tid>/stat is the cheapest way to get per-thread on-CPU time without a syscall-tracing tool. Fields 14 and 15 are user and system jiffies; CLK_TCK converts them to seconds. The rparen = line.rfind(")") step is necessary because the comm field can contain spaces or parentheses.

Lines 19–25 — the burst loop: it asks for 100 ms of wall time, deliberately. This is the userspace-API analogue of a span you'd record around any blocking operation. The number you get from time.monotonic_ns() is what your span would record, and it is what your APM dashboard will show.

Lines 38–41 — the noisy neighbours: 8 CPU hogs simulate the shared-host condition every Indian SaaS team faces on AWS m5.xlarge or GCP n2-standard-4 nodes. The kernel's CFS scheduler interleaves all of them with your "important" thread, but the only signal of that interleaving is the gap between wall and on-CPU — and a span only records wall.

Line 49 — the punchline: 30% of the time your "100 ms operation" was running was time the userspace tracer literally could not observe. In a 1000 ms p99 latency budget that is 300 ms of unaccounted time, every single request, on a contended host. That is the wall.

Illustrative — generated from the script above on a deliberately contended laptop. The accent bars (on-CPU) sit at a tight 98–102 ms because the actual compute work is a fixed loop. The light bars (wall clock, what your span records) rise into 250+ ms whenever the scheduler de-queued the thread. Every span you read at a Razorpay or Hotstar war-room dashboard during a busy event shares this distortion.

The reproduction footer is short because the script is stdlib-only:

# Reproduce this on your laptop
python3 userspace_blind.py
# (run it twice — once on an idle machine for a baseline, once with
#  `stress-ng --cpu 8 --timeout 30s` running in another terminal)

Why the classical answers stopped scaling — and what that leaves us with

Linux has had ways to peek under the wall for a long time. The classical toolkit predates eBPF by a decade, and every entry has a reason it cannot be your default production tool.

Kernel modules. Write C, compile against the kernel headers, insmod the result, hope you didn't crash the box. The module runs with full kernel privileges — a null-pointer dereference is a panic, not an exception. SREs at Indian banks who run Linux on bare metal still maintain modules from the 2010s for vendor-specific NIC counters, and the institutional memory of "the time module X panicked the trading host at 09:14:22" is why no one writes a new one for an investigation. Modules are the most expensive form of kernel observation: high power, no safety, no portability across kernel versions.

strace and ptrace. strace -p <pid> attaches to a process and intercepts every syscall. It works, on one process, in a debugging session, at a slowdown of 50–500×. Running strace on a production payment service for an hour is not a debugging tool; it is an outage. The slowdown comes from the fact that every syscall takes two extra kernel transitions to copy state into the tracer's address space, which is structurally fixed by the design of ptrace.

perf record / perf trace. The perf tool reads the kernel's tracepoints and PMU counters and writes them to a userspace ring buffer. It scales well — overheads of 1–3% on a busy host are normal — but perf record writes to disk, and the post-processing (perf report, perf script) is offline. You cannot use perf to drive an alert; you can only use it to investigate after the fact.

/proc polling. Every counter the kernel exposes through /proc/<pid>/... is readable by any userspace agent without privilege. node_exporter, cadvisor, and most Indian-built sidecar agents (Razorpay's metrics-collector, Flipkart's host-stats) work this way. The catch: /proc is state, not events. A page-fault counter that went from 1,290 to 1,310 in the last 5 seconds tells you 20 page faults happened, but not when, not which threads, not what addresses, not whether they hit the page cache or disk. For deep diagnosis, polling is a low-fidelity primitive.

Tracepoints, kprobes, uprobes (the raw kernel-side hooks). Tracepoints are stable kernel-instrumented hook points (e.g., sched:sched_switch); kprobes attach to arbitrary kernel functions; uprobes attach to userspace functions. They are the substrate eBPF builds on, and they have always been there. Without eBPF, using them required either a kernel module (see above) or hand-written perf scripts running offline. The mechanism existed; the safe in-process programming model didn't.

SystemTap and DTrace (and why they didn't win on Linux). SystemTap (a Red Hat project from 2005) and DTrace (Solaris-native, ported to Linux late and incompletely) tried to fill exactly this gap a decade before eBPF. Both compiled a small, audited script into a kernel module and loaded it for the duration of the trace. Both required the kernel debug symbols (kernel-debuginfo) to resolve function names, which is a many-hundred-megabyte download per kernel version. SystemTap had a steep learning curve, an unpredictable safety story (a misbehaving probe could panic the host), and was orthogonal to the rest of the Linux observability ecosystem. DTrace's Linux port never had vendor support and lagged kernel features by years. By the time Indian production teams were ready to adopt always-on kernel observability — roughly 2019–2021 — eBPF had already absorbed the ergonomics SystemTap was reaching for, with verifier-checked safety SystemTap never offered. The historical lesson is that the problem statement is older than eBPF; eBPF is the first attempt at it that production SREs trust.

Illustrative — quadrant placement is approximate. The shape of the gap is the substantive claim: a working SRE who wants per-event kernel-level signal in production has no good answer with the classical tools. Each pre-eBPF entry is unsafe (module), too slow (strace), too offline (perf), or too coarse (/proc). The empty quadrant is exactly what motivated eBPF's verifier-checked, in-kernel, attach-and-go design.

Why this gap is the entire reason Part 8 exists: every Indian production team that runs a self-hosted observability stack runs into the same wall. They can see userspace beautifully — Tempo for traces, Loki for logs, Prometheus for metrics. They cannot see the 30% of latency that lives below the syscall boundary. The classical kernel tools each solve part of the problem and break a different production constraint. eBPF is the first tool that lands in the upper-right quadrant — deep, safe, always-on — and the next chapter of this curriculum (/wiki/why-ebpf-changed-the-game) is about why that particular combination of properties needed kernel changes that took a decade to ship.

A second measurement: counting kernel events your span never recorded

The previous script measured time. A second measurement makes the events below the wall visible by counting them — voluntary vs nonvoluntary context switches, page faults, syscalls — straight from /proc, before any eBPF tooling is in the picture.

# kernel_events_during_span.py — count below-the-wall events around a "span"
import os, time

def read_status(pid: int) -> dict[str, int]:
    """Parse /proc/<pid>/status into a dict of integer counters where possible."""
    out = {}
    with open(f"/proc/{pid}/status") as f:
        for line in f:
            if ":" not in line: continue
            k, v = line.split(":", 1)
            v = v.strip().split()[0] if v.strip() else ""
            if v.isdigit(): out[k] = int(v)
    return out

def read_io(pid: int) -> dict[str, int]:
    out = {}
    with open(f"/proc/{pid}/io") as f:
        for line in f:
            k, v = line.split(":", 1)
            out[k] = int(v.strip())
    return out

pid = os.getpid()
before = read_status(pid)
io_b = read_io(pid)

# A "span" the application would record around a 50ms-of-work operation
t0 = time.monotonic_ns()
buf = bytearray(50_000_000)        # ~50MB — forces page faults
for i in range(0, len(buf), 4096):
    buf[i] = i & 0xff              # touch each page
total = sum(buf[::4096])           # read each page back
time.sleep(0.01)                   # voluntary block — futex wake later
t1 = time.monotonic_ns()

after = read_status(pid)
io_a = read_io(pid)

span_ms = (t1 - t0) / 1e6
print(f"span wall time: {span_ms:.1f} ms (this is what your tracer would record)")
print()
print("kernel events the userspace span did NOT include:")
print(f"  voluntary ctxt switches    : {after['voluntary_ctxt_switches']    - before['voluntary_ctxt_switches']}")
print(f"  nonvoluntary ctxt switches : {after['nonvoluntary_ctxt_switches'] - before['nonvoluntary_ctxt_switches']}")
print(f"  read syscalls              : {io_a['syscr'] - io_b['syscr']}")
print(f"  bytes read from storage    : {io_a['read_bytes'] - io_b['read_bytes']}")

# Output:
span wall time: 64.3 ms (this is what your tracer would record)

kernel events the userspace span did NOT include:
  voluntary ctxt switches    : 11
  nonvoluntary ctxt switches : 3
  read syscalls              : 2
  bytes read from storage    : 0

The reader's takeaway: a single 64 ms span hides 14 context switches and 2 read syscalls. Each of those is an event the kernel can attribute to a function, a stack, a state. The userspace tracer reduced 14 events plus the work between them to one number. The reduction was not a bug; it is the design contract of a span. The cost of the reduction is everything you cannot say once it has happened.

Why this wall is worse in containers and Kubernetes

The wall is steeper for containerised workloads, which is most production workloads in India today. Three Kubernetes-specific reasons compound the problem:

Per-container cgroup limits make the kernel a louder co-tenant. A pod with resources.limits.cpu: 2 runs under a CFS bandwidth controller that throttles every 100 ms quanta. When a Razorpay payments pod hits its CPU limit, the kernel freezes every thread in the pod for the rest of the period. A userspace span straddling that quanta sees its 100 ms operation become a 180 ms operation, with the extra 80 ms being deterministic kernel-driven freeze-time. None of the application tools can attribute it; only cgroup_throttle events from the kernel can, and those events are only reachable below the wall.

Network namespaces add a second TCP stack. Every pod in Kubernetes lives in its own network namespace with its own TCP stack instance. Packets traversing pod-to-pod traffic cross at least one virtual interface (veth), often a CNI-specific overlay (VXLAN, Geneve), and a host-level TCP stack on each side. A 30 ms span the application sees as "one network call" actually spans two separate TCP stacks, each with its own retransmit timer, congestion window, and buffer state. Userspace cannot distinguish "the destination was slow" from "the source-side veth's qdisc dropped a packet"; the kernels know, but only at their own boundaries.

The host kernel is shared across pods. The "isolation" sold by container runtimes is process and namespace isolation, not kernel isolation. A noisy-neighbour pod can saturate a host's I/O scheduler, fill the page cache with its own working set (evicting yours), or trigger global memory reclaim that stalls every other pod's allocations. The Flipkart SRE team found during a Big Billion Day rehearsal in 2024 that one rogue indexing job — running in a "low priority" pod with no CPU limits set — triggered host-level memory reclaim that added 90 ms of direct_reclaim time to every payment pod's heap allocations on the same node. The signal was not in any pod's userspace metrics. It was in vmscan:mm_vmscan_direct_reclaim_* tracepoints — kernel events, below the wall.

Why containers do not let you delegate this problem to the platform: the platform team running the Kubernetes cluster sees the same wall as the application team. They have access to the kernel of every node, but the same classical tools (/proc, top, iostat) give them per-host counters, not per-request causality. Tracing a slow request from a pod, through its network namespace, across a CNI overlay, into the host kernel, and back is a multi-hop kernel-internal journey that requires kernel-level observation to follow. Without eBPF, the platform team has the same blind spot the application team has — just expressed in different node counters.

What the next part will and won't fix

A fair warning before you walk into Part 8 expecting magic. Kernel-level observability fixes a specific class of blindness, not all blindness. eBPF tells you precisely how long the kernel spent on do_sys_recvfrom for a given socket, which thread was off-CPU and for how long, which TCP retransmits hit which connection, which page faults resolved from disk. It does not tell you why your business logic chose to call recvfrom 14 times instead of 1, why your code allocates 200 MB on the hot path, or why your Postgres query is doing a sequential scan. Userspace observability still owns the "why are we calling this at all" axis. eBPF owns the "what did the kernel do once we asked" axis. The Part-8 pipeline (bpftrace, bcc, Pixie, Pyroscope-with-eBPF) reads kernel-side events and emits userspace-consumable signals — usually as Prometheus metrics, OTel spans, or flamegraph SVGs that join up with the existing pipeline you spent Parts 1–7 building.

Part 8 is therefore not a replacement for Parts 1–7. It is the missing pillar that explains the time gaps Parts 1–7 cannot account for. Everything you learned about cardinality budgets, sampling decisions, span structure, dashboard layout, and burn rates still applies — eBPF data flows into the same backends, hits the same cardinality budget, and renders on the same dashboards. The difference is that the data finally includes events from the second operating system.

Common confusions

"eBPF is just strace for the kernel." No — strace traces a single process via ptrace, attaching/detaching individual syscalls with 50–500× slowdown; eBPF programs run inside the kernel at the hook point itself, with overhead in the 1–10% range and no per-syscall context-switch tax. The two are different categories, not different speeds of the same thing.
"If my application is fully async, kernel time doesn't affect me." Wrong direction. Async code makes you more dependent on kernel mechanisms, not less — your event loop's wake-up latency is epoll_wait return time, which is kernel-controlled. An async service on a contended host has higher variance, not lower, because every wakeup is a scheduler decision.
"Adding more application spans will eventually cover the gap." Spans are bounded by what your code can observe between syscalls. No span structure can include time during which the process is off-CPU, no matter how many spans you add. The wall is structural.
"/proc/<pid>/stat gives me kernel observability." It gives you state samples, not events. The page-fault counter went from 1,290 to 1,310 in 5 seconds — that is a count, not a list of which faults, when, where, or whose. For event-level diagnosis, polling state counters is fundamentally insufficient.
"Kernel observability is only for kernel developers." It is for any engineer whose userspace traces have unaccounted milliseconds. That is roughly every backend engineer working on a production system at scale — Razorpay, Zerodha, Flipkart, and Hotstar SRE teams all run eBPF tooling because their userspace stacks have measurable below-the-wall latency.
"perf and eBPF are the same tool." perf is a userspace tool that reads kernel events; eBPF lets you run code inside the kernel in response to events. perf record + perf report is offline analysis; eBPF is online and can drive an alert. There is overlap (modern perf uses some BPF features), but the programming model is different.

Going deeper

The microbenchmark trap and why "my laptop is quiet" lies to you

The script in this article runs on an artificially-contended host and shows a 30% gap. On a quiet developer laptop, the same script returns wall=on-CPU within 1 ms. This is the trap teams fall into when they run a benchmark on a quiet box, see no kernel-attributable latency, and conclude "userspace tracing is enough". The latency they care about is the latency at production load, where every CPU is at 60–90%, every NIC ring is occasionally full, every memory.high is occasionally being approached. Always reproduce contention before drawing conclusions about how blind userspace tracing is — stress-ng --cpu N --io 4 --vm 2 is the laptop-friendly version of the world your service actually lives in.

The one place where userspace tracing genuinely fixes kernel-time blindness — and its limits

There is a partial answer below the eBPF horizon: strace -ff -e trace=desc -tt records per-syscall wall-clock duration into your trace, and a userspace tracer can incorporate those numbers into spans. Some Java agents (Datadog, NewRelic) do exactly this with USDT probes on the JVM's own syscall wrappers. This catches a specific subset of below-the-wall time — namely, time the kernel spent inside individual syscalls — but misses everything that happens between syscalls (scheduler waits, runqueue time, off-CPU blocks). It is a useful partial measurement, not a wall-tearer.

Why the kernel cannot just push events to userspace cheaply (the design constraint eBPF had to solve)

The naïve answer to "why doesn't the kernel just emit a span every time something interesting happens" is that "interesting" is uncountable. Tracepoints fire millions of times per second on a busy host. A scheme that sent every event to userspace via a syscall would dominate CPU time. eBPF's design solves this by running an in-kernel program at each event that decides — at full kernel speed, in CPU registers — whether the event is interesting, and only then writing it to a per-CPU ring buffer that userspace consumes batch-asynchronously. The verifier guarantees the in-kernel program cannot loop forever or read invalid memory, which is what makes "running arbitrary code in the kernel hot path" safe enough for production. The next chapter walks the verifier in detail.

A diagnostic ladder for "the missing milliseconds" before you reach for eBPF

You will not always have eBPF tooling deployed at the moment an incident hits at 03:00 IST. A short ladder of progressively-deeper steps — each runnable on a vanilla Linux node with no extra agent — gets you most of the way before you have to escalate. Step 1: cat /proc/<pid>/status | grep -E "voluntary_ctxt|nonvoluntary_ctxt" — the ratio of nonvoluntary to voluntary context switches above ~10% says the kernel is preempting your process more than it is yielding, which points at scheduler pressure. Step 2: cat /sys/fs/cgroup/<pod>/cpu.stat — nr_throttled and throttled_usec are the smoking gun for CFS throttling; if throttled_usec is climbing during the incident, your pod's cpu.max is the proximate cause. Step 3: ss -tinm sport = :8080 | head — retrans and lost columns on the connection's TCP info give you per-socket retransmit counts without any agent installation; a non-zero retrans on a connection that should have been clean is the kernel telling you packets are being dropped somewhere you cannot see from userspace. Step 4: perf stat -e context-switches,cpu-migrations,page-faults -p <pid> -- sleep 30 — a 30-second window of kernel-level event counters during the incident, no recording, no analysis tooling needed. None of these are eBPF; all of them peek under the wall. They are the lowest-effort answer when the production fleet does not yet have always-on kernel observability deployed.

The Hotstar IPL 2023 cgroup-throttling story

During an IPL knockout match in May 2023, Hotstar's video-segment-fetch p99.9 spiked from 800 ms to 2.4 s with no userspace-visible cause. Userspace traces sat at the same shape they always did. Every metric was "normal". The cause turned out to be that an autoscaler had packed two video-encoder pods on the same node, both bursting past their cpu.max limits, both hitting CFS throttling for tens of milliseconds at a time. runqlat from bcc showed the runqueue p99 had jumped from 4 ms to 180 ms; that signal was invisible to every userspace tracer in the stack. The fix was a cpu.cfs_quota_us adjustment plus an anti-affinity rule. The detection took 90 minutes only because someone on the SRE team thought to check below the wall — a habit that is the entire payoff of internalising this chapter.

Where this leads next

Part 8 (chapters 48–54) walks the kernel-level toolkit that fills the empty quadrant: /wiki/why-ebpf-changed-the-game for the design break, /wiki/bpftrace-for-ad-hoc-tracing for the one-liner workflow on a war-room call, /wiki/parca-pixie-pyroscope for the production-grade always-on profiling stack, /wiki/agentless-observability-claims for what "agentless" actually means and where it falls short, /wiki/ebpf-for-network-observability-cilium-hubble for the network-stack view, and /wiki/ebpf-limitations-in-production for the honest list of what eBPF still does not solve.

After Part 8 the curriculum moves to dashboards (Part 9), SLOs (Part 10), and alerting (Part 11) — and you will see that kernel-level signals (off-CPU time, retransmit rates, runqueue p99) are first-class citizens in those parts, not a sidecar concern. The wall is gone; the data is in the same shape as the rest.

References

Brendan Gregg, BPF Performance Tools (Addison-Wesley, 2019) — the cookbook for the entire kernel-observability worldview, with the off-CPU-time / runqueue-latency / TCP-retransmit examples this chapter lives upstream of.
Brendan Gregg, "Linux Performance" page (brendangregg.com/linuxperf.html) — the canonical map of where each kernel mechanism sits and which tool reaches it.
Linux kernel source, Documentation/trace/kprobetrace.rst and Documentation/admin-guide/perf-tools.rst — primary sources for the pre-eBPF toolkit's contracts and limitations.
Liz Rice, Learning eBPF (O'Reilly, 2023) — the modern reader's introduction; chapters 1–2 motivate the empty-quadrant gap explicitly.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — the framing of "the userspace tracer cannot see what your process did not record" appears in chapter 4 of this text.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — the foundational paper on why kernel-level variance dominates p99 at fleet scale; the "stragglers" they describe are largely below-the-wall events.
/wiki/rollups-and-continuous-aggregates — the previous chapter's freshness-vs-correctness framing has a kernel-level analogue: kernel events arrive in nanoseconds, userspace consumers process them in batches, and the same lateness questions apply.
/wiki/coordinated-omission-and-latency-measurement — coordinated omission is a wall of its own, structurally similar to the userspace ceiling: missing measurements look like missing problems.