Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Live debugging without stopping the world

Riya is on call for SetuStream's streaming-platform team. It is 21:47 IST during the second qualifier of the 2026 IPL final, the dashboard shows 24.3 million concurrent viewers, and the catalogue-API pods are returning p99 = 1.8s against a 1.5s SLO. The flamegraph from this morning's pre-game capture looks fine. The CPU on every pod is sitting at 62%. Nothing is dying, nothing is OOM-ing, no log line is screaming. Riya's instinct from /wiki/heap-dumps-and-core-dumps is to capture state — but a gcore pause is 4 seconds, and 4 seconds during the IPL final means 96 thousand viewer reconnect storms slamming Akamai's edge. She cannot stop the world. What she can do is run py-spy record -o flame.svg --pid 18472 --duration 30 --rate 250 and bpftrace -e 'profile:hz:99 /pid==18472/ { @[ustack] = count(); }' in two terminals and watch real call stacks scroll past at 250 Hz, with the pod taking less than 0.4% of CPU on the sampling overhead. Within ninety seconds she sees a tight loop in validate_subtitle_track that nobody noticed because the unit test ran with one subtitle track and production runs with thirty-seven. The fix is one if statement; she ships it without restarting any pod, and p99 falls back to 1.1s by 22:03. She never paused the process for a single millisecond.

Live debugging is the set of techniques that observe a running process without stopping it: sampling profilers (py-spy, async-profiler, rbspy), kernel-side tracers (perf record, bpftrace), and dynamic instrumentation (uprobes, USDT, JFR). All of them trade tiny overhead (typically 0.1–1% of CPU) for partial information — you sample stacks at 100–999 Hz instead of recording every function call. The discipline is choosing the right sampling rate, the right vantage point (user-space vs kernel), and the right output format so that when the incident hits you can ask the running process a clean question and get a useful answer in under 60 seconds.

Why "without stopping the world" is the load-bearing constraint

The previous chapter showed that capturing state — a heap dump, a core dump — costs the process a stop-the-world pause whose length scales with RSS. Four seconds for a 47 GB JVM, sub-second for a 4 GB Python service, but always non-zero. For a payments-reconciliation worker at 03:14 IST that pause is acceptable; for a streaming pod during the IPL final it is not. The customer-visible blast radius of a pause is the difference between "the on-caller captures evidence" and "the on-caller ships an incident on top of the incident".

Live debugging replaces the pause with sampling. Instead of asking "what is in every byte of process memory right now", you ask "what call stack is each thread on right now, sampled 250 times per second, for 30 seconds". The output is statistical — you do not see every function call, just the ones that were on-CPU long enough to be sampled — but for performance investigations that is exactly the right resolution. A function consuming 80% of CPU appears in 80% of samples; a function consuming 0.1% appears in roughly 1 in 1000 samples and is correctly invisible at 250 Hz over 30 seconds (you would need 4000+ samples to even see it once). The information you trade away is precisely the information you do not need for "where is the time going" investigations.

Stop-the-world capture vs sampling — overhead comparisonA two-row timeline diagram. Top row shows a stop-the-world capture: a single 4-second-wide red block where all customer requests are blocked. Bottom row shows sampling: 30 seconds of timeline with thin tick marks every 4 ms, each tick representing a sample, with most of the timeline still serving requests normally. The bottom row's total overhead bar is annotated as 0.4% versus the top row's 100% pause.The same 30-second window, two capture strategiesStop-the-world (gcore / jmap):PAUSE 4srequests pile up in kernel RX queueCustomer-visible: 4 second outage. Akamai edge times out at 3s. Reconnect storm follows.Sampling (py-spy / async-profiler at 250 Hz):Customer-visible: 0 ms. Each tick = one stack sample. 7,500 samples in 30s. Overhead 0.4% CPU.Illustrative — not measured data. Tick density ~250 Hz; production runs use 99-999 Hz depending on tool.
The trade is overhead-vs-completeness. A pause-based capture sees every byte but stops the world; a sampling capture sees a statistical fraction at 0.4% overhead. For "where is the time going" the sampling answer is sufficient; for "what is in this specific object" it is not.

Why sampling is statistically sound for performance debugging: the population you care about is "instructions that were executed during the window". A function that ran for 1 second out of a 30-second sampling window contributes ≈3.3% of the population. At 250 Hz over 30s you have 7500 samples; the expected number of samples landing in that function is ≈250, which gives a binomial standard error under 1% of total samples. You will not miss a hot path, and you will not wrongly elevate a cold one. The sampling theorem here is the same one the audio world's CD format uses — sample at twice the highest signal frequency you care about, and aliasing disappears. For "find the function eating 5%+ of CPU" a 99 Hz sampler is plenty; for "find the function eating 0.5%" you need 999 Hz and a longer window.

The cost model of "live" tools is therefore a function of three knobs: sampling frequency (Hz), stack-walk depth (frames), and vantage point (user-space, kernel-space, or both). Higher Hz catches shorter functions; deeper stack-walking gives you call-context but costs more per sample; kernel-space sees blocked threads and syscalls but loses managed-runtime detail. A well-configured live debugger picks the cheapest combination that answers the specific question. A flamegraph asking "where is on-CPU time going" runs at 99 Hz with 25-frame stacks, costing ≈0.1% CPU. A flamegraph asking "where are off-CPU stalls" runs as a kernel-side bpftrace watching sched:sched_switch, costing ≈0.3% CPU. A real production-debugging session cycles between three or four such queries in five minutes and walks away with answers without ever pausing the process.

The four families of live-debugging tools

The tooling fragments by language runtime and by vantage point. Memorise the families; they map cleanly onto the failure modes you will hit on call.

Family 1: language-specific sampling profilers. Each managed runtime has a sampling profiler that walks its own stack representation: py-spy for CPython, async-profiler for the JVM, rbspy for Ruby, pprof (built-in) for Go, dotnet-trace for .NET, 0x for Node.js. The shared mechanism is process_vm_readv — they ptrace-attach for a microsecond, copy the relevant runtime structures, and detach. They do not stop the world; the target process pauses for ≈10 µs per sample, well under any latency SLO. The output is a flamegraph in the runtime's own symbolic vocabulary: Python function names, Java method signatures, Ruby class names. Why per-runtime tools win for managed languages: a generic perf record against a Python process sees the CPython interpreter's eval loop in 90% of samples — _PyEval_EvalFrameDefault is the hot box, and you learn nothing. py-spy reads CPython's frame stack out of the interpreter's own data structures (PyFrameObject, PyThreadState) and reports the Python-level call stack, which is what the engineer actually wants to see. Same for the JVM with async-profiler walking JavaThread and frame structs.

Family 2: kernel-side sampling and tracing. perf record -F 99 -p <pid> samples the on-CPU instruction pointer at 99 Hz from the kernel. bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' does the same via eBPF. Both work for any process, regardless of language, and see kernel-side activity (syscalls, page faults, scheduler context-switches). They lose runtime-level detail unless you enable JIT-symbol resolution — the JVM's -XX:+PreserveFramePointer and perf-map-agent, Node's --perf-basic-prof, the V8 perf integration. Without those you see [unknown] in the flamegraph wherever the JIT is running. The kernel-side family is also where off-CPU profiling lives: bpftrace watching sched:sched_switch shows you which threads are blocked and why, which a user-space sampler cannot see because blocked threads are not on-CPU and so are not sampled.

Family 3: dynamic instrumentation via probes. uprobes (user-space probes) and kprobes (kernel probes) let you attach a 12-instruction eBPF program to any function entry or return of a running process, without recompiling, restarting, or even pausing. bpftrace -e 'uprobe:./payments-svc:hot_path { @start[tid] = nsecs; }' -e 'uretprobe:./payments-svc:hot_path /@start[tid]/ { @lat = hist(nsecs - @start[tid]); }' gives you a histogram of how long hot_path took, measured in production at full traffic, with overhead under 1%. USDT (user statically-defined tracing) is the same idea but uses pre-defined trace points compiled into the binary by the runtime itself — Python's function__entry/function__return, the JVM's hotspot:method__entry, MySQL's query__start/query__done. Why dynamic instrumentation belongs in the live-debugging set: sampling tells you where time is spent on average; uprobes tell you the distribution of latency for a specific function under real traffic. Knowing that validate_subtitle_track is 5% of CPU on average is one signal; knowing that it has a tail at 800 ms p99.9 is a different one, and only a uprobe-based latency histogram can give you that. They compose: a sampling profiler points at the suspect function, a uprobe quantifies its tail.

Family 4: in-process flight recorders. Java Flight Recorder (JFR), .NET EventPipe, Linux LTTng's userspace tracer, and continuous-profiling SaaS tools (Pyroscope, Polar Signals, Datadog Continuous Profiler, Grafana Pyroscope) all run an in-process ring buffer that records events continuously at sub-1% overhead and lets you dump the last N seconds on demand. JFR records ~80 event types — allocation samples, lock contention, GC events, IO operations — into a 100 MB ring buffer that takes 30 seconds to dump on demand. These are the "always-on, query-on-demand" tools: when an incident hits, you do not need to attach a sampler — you already have 5 minutes of pre-incident data sitting in the ring buffer, which is often the only way to debug a transient spike that has already ended.

The four families are not redundant. Each one captures evidence the others miss:

Question Best family Why the others fail
"Where is on-CPU time going in this Python service?" Family 1 (py-spy record) perf record shows interpreter dispatch; family 4 needs to be installed before the incident
"Why is this thread blocked for 200 ms?" Family 2 (off-CPU eBPF) Family 1 doesn't sample blocked threads; family 3 needs you to know which function in advance
"What is the p99.9 latency of validate_subtitle_track under real traffic?" Family 3 (uprobe latency histogram) Sampling gives averages, not distributions; flight recorders may not have it as an event type
"What was the JVM doing 30 seconds ago when the spike happened?" Family 4 (JFR ring buffer) Family 1/2/3 require attaching during the spike; if it ended you missed it

Indian-scale production teams typically run family 4 always-on (Pyroscope or Grafana's continuous profiler at 0.5% overhead), reach for family 1 during incidents (py-spy or async-profiler from a sidecar pod), and reach for families 2 and 3 when the question is specific enough to formulate as a probe. PaisaBridge's payments core, SetuStream's streaming platform, and ParakhTrade Kite's order-match all run this layered setup; the configuration is unglamorous infrastructure work that pays off the first time an incident lands and the on-caller has the answer in 90 seconds instead of 90 minutes.

A worked artefact — sampling a Python service while it serves traffic

A runnable demo: a Python service with a hidden hot loop, a load-generator that hits it at realistic IPL-shaped traffic, and a sampling profiler that finds the hot path while the load-generator is still running. No pauses, no restarts.

# live_sampler_demo.py — drive a "service" while a sampling profiler watches it.
# Demonstrates the live-debugging loop: load running, profiler samples, no pause.
# Run: python3 live_sampler_demo.py
#
# Requires: pip install py-spy
#   (py-spy needs ptrace_scope=0 on Linux, or sudo. macOS works without sudo.)

import os, subprocess, threading, time, signal, sys

# --- the "service": one fast path, one accidentally-slow path -----------
def parse_subtitle_track(track_id: int) -> int:
    # Looks innocent. The "validation" walks every UTF-8 codepoint in a
    # 100 KB caption file because the unit test uses a 1 KB file.
    # Production runs with 37 tracks per video; QA tests with one.
    payload = ("subtitle line " * 5000).encode("utf-8")
    seen = 0
    for byte in payload:                  # O(N) per track; 37 tracks = 37x
        if byte < 0x80:
            seen += 1
    return seen

def serve_request(req_id: int) -> int:
    # Fast path: do the cheap work.
    base = sum(i * i for i in range(2_000))
    # Slow path: validate every subtitle track for this video.
    for track in range(37):
        base += parse_subtitle_track(track)
    return base

# --- driver: hit the service in a tight loop -----------------------------
stop_flag = threading.Event()

def driver():
    n = 0
    t0 = time.perf_counter()
    while not stop_flag.is_set():
        serve_request(n)
        n += 1
    elapsed = time.perf_counter() - t0
    print(f"[driver] served {n} requests in {elapsed:.1f}s -> {n/elapsed:.1f} rps")

# --- main: kick off the driver, attach py-spy as a child, wait -----------
def main():
    driver_thread = threading.Thread(target=driver, daemon=True)
    driver_thread.start()
    time.sleep(0.5)                       # warm up so py-spy sees real load

    out_svg = "/tmp/live_sample.svg"
    py_spy = subprocess.Popen([
        "py-spy", "record",
        "-o", out_svg,
        "--pid", str(os.getpid()),
        "--duration", "10",
        "--rate", "250",
        "--idle",
    ])
    py_spy.wait()

    stop_flag.set()
    driver_thread.join()
    print(f"[main] flamegraph at {out_svg}")
    print(f"[main] size: {os.path.getsize(out_svg)/1024:.1f} KB")

if __name__ == "__main__":
    main()
# Sample run on a 16-GB MacBook (Python 3.11, py-spy 0.3.14):
[driver] served 412 requests in 10.5s -> 39.2 rps
[main] flamegraph at /tmp/live_sample.svg
[main] size: 18.3 KB

# Reading /tmp/live_sample.svg in a browser shows:
#   serve_request                               99.4% wide
#     parse_subtitle_track                      94.1%   <-- the leak
#       (loop body, byte < 0x80 comparison)    93.8%
#     <listcomp> in serve_request                5.1%
#   driver                                       0.4%

Walk through the load-bearing lines:

  • py_spy = subprocess.Popen(["py-spy", "record", ...]): py-spy is launched as a child process and attaches via ptrace to the same PID it was launched from. The driver thread keeps running the entire time — every 4 ms (250 Hz) py-spy does a 10 µs process_vm_readv to copy PyThreadState and walk the frames, then sleeps. The driver hits 39 rps throughout; without the sampler it hits ≈40 rps. The sampling overhead is ≈2.5%, higher than the 0.1–1% production target because this demo is single-threaded and CPU-bound. Why production overhead is lower: real services are not 100% on-CPU — they spend most of their time waiting on network or disk. A 250 Hz sampler over a service that's 60% on-CPU costs 250 × 10 µs × 0.6 = 1.5 ms of work per second per thread, which is 0.15% of one core. On an 8-core pod doing 200 RPS at 60% CPU, that's a flat 1.2% CPU overhead total — perfectly acceptable for a 30-second capture. The demo's 2.5% is worst-case.
  • --rate 250: sampling frequency in Hz. The default is 100; payments and trading services use 99 (one less than 100, to avoid harmonic aliasing with periodic background work — same trick the kernel uses). 250 Hz catches functions as short as 4 ms; 999 Hz catches functions as short as 1 ms but doubles overhead. Pick the lowest rate that still names your suspect.
  • --idle: include threads that are blocked on I/O (sleeping, in select, etc.) in the output. By default py-spy only shows on-CPU threads. For a service that's mostly waiting on the network, --idle is essential; otherwise the flamegraph shows 5% of activity (the on-CPU work) and you miss the 95% that's blocked on Kafka or HTTP.
  • --duration 10: how long to sample. Best practice is 30 seconds for a stable picture, 90 seconds if you suspect a periodic effect (a GC every 60s, a flush every 30s). Do not sample less than 10 seconds — the sample size is too small to distinguish a 5% function from a 1% function.
  • The flamegraph output: parse_subtitle_track is 94.1% wide. That single number tells you the entire optimisation path: rewrite that function and you make the service ≈18× faster. No other function matters until that one is fixed.

The shape of this script — load running, sampler attaches without pause, output is a flamegraph in the runtime's vocabulary — is the production-debugging template for any managed runtime. Substitute async-profiler -d 30 -f /tmp/flame.svg <pid> for the JVM, pprof http://localhost:6060/debug/pprof/profile?seconds=30 for Go, rbspy record --pid <pid> for Ruby, and the workflow is identical: arm the sampler, watch the flamegraph, find the fat box.

Reading the flamegraph — the "fat-box" algorithm

A flamegraph is a stacked bar chart: x-axis is sample count, y-axis is stack depth. Each box's width is "fraction of samples that had this function on the stack". The algorithm for reading one is mechanical:

  1. Top-down or bottom-up? Flamegraphs are conventionally drawn with main at the bottom and the deepest call at the top. The fat boxes near the top are the leaves — the actual on-CPU work. The fat boxes lower down are call sites that lead to the work. Both are useful; new readers should look at the top first.
  2. Find the widest top-level box. That is your hot leaf. In the demo above, the top of the stack is the loop body of parse_subtitle_track at 93.8% — the leaf where CPU is actually being burned.
  3. Walk down to find the cause. From the hot leaf, walk down the stack. If the parent (parse_subtitle_track at 94.1%) is also fat, the function itself is the issue. If the parent is thin and only one child is fat, the issue is in one specific call site. If the parent has many fat children, the issue is in the function that calls all of them.
  4. Cross-check with the "histogram" view. Most flamegraph viewers (Brendan Gregg's flamegraph.pl, speedscope.app, Pyroscope's UI) let you click a function name and see all its instances aggregated. Useful when the same function is called from multiple paths.
+--------------------------------------------------+
|  serve_request                            99.4%  |   <-- driver loop
+----------------------+---------------------------+
|  parse_subtitle      |  <listcomp>          5.1% |   <-- 37x calls vs 1
|     _track    94.1%  |                           |
+----------------------+---------------------------+
|  for byte in         |                           |
|     payload  93.8%   |                           |   <-- the actual leaf
+----------------------+---------------------------+

The flamegraph above renders as a 720 × 320 SVG below. Notice how the eye is drawn to the widest top box: the loop body of parse_subtitle_track. That is what makes flamegraphs work — they exploit the human visual system's fast-area-comparison instinct, the same instinct that lets you spot the biggest slice of a pie chart in 200 ms.

Flamegraph of the live_sampler_demo runA flamegraph showing serve_request as a wide bar at the bottom, then parse_subtitle_track as a narrower-but-still-dominant bar above it, then the for-byte loop body as the widest box at the top. A small listcomp box sits to the right of parse_subtitle_track. The driver function appears as a thin sliver on the far right.Flamegraph — wider = more samples = more CPU. Top box wins.serve_request (99.4% of samples)parse_subtitle_track (94.1%)5%for byte in payload: if byte < 0x80 (93.8%) <-- HOT LEAFReading order: pick the widest top box (the leaf), walk down to find the call site responsible.A fix at the leaf shrinks the entire stack above it. A fix at serve_request only shrinks the 5% sibling.driver 0.4%Illustrative — generated by py-spy at 250 Hz over 10s, exported as a stacked bar SVG.
The flamegraph compresses 2,500 samples into a 4-row image. The eye finds the widest top box in under a second; the algorithmic search for the hot path takes less than a minute. The remaining 6% of the image — the listcomp, the driver — is correctly de-emphasised.

A trap that catches juniors on their first flamegraph: width is samples, not wall-clock. A function that runs in parallel on 8 cores at 100% CPU appears 8× wider per wall-clock-second than one running on 1 core. Two flamegraphs of the same workload sampled on differently-sized hosts are not directly comparable unless you normalise by core count. The --threads flag in async-profiler and py-spy --threads show per-thread breakdowns that disambiguate this. For multi-threaded production services, always sample with per-thread output and confirm the hot path is in the threads you expected.

A second trap: flamegraphs hide off-CPU time. A service that spends 90% of its time blocked on Kafka and 10% on-CPU shows the 10% as if it were 100% — the on-CPU view is normalised. To see the 90% you need an off-CPU flamegraph (Brendan Gregg's offcputime.bt), which uses kernel scheduler events to capture which stacks were blocked and for how long. Off-CPU flamegraphs look identical to on-CPU ones but tell a completely different story; the discipline is to capture both during an incident and reconcile what each one says.

Common confusions

  • "Sampling profilers miss short functions" Only if the function is shorter than 1/sampling-rate. At 999 Hz a 1 ms function appears in roughly 1 in 1 sample — it is correctly captured. What sampling cannot do is count invocations: a function called 1 million times for 1 µs each appears the same in a flamegraph as one called 1 time for 1 second. Use uprobes or USDT for invocation counts; use sampling for "where is the time".
  • "py-spy and perf record produce the same flamegraph for a Python service" They produce different flamegraphs. py-spy walks Python's frame stack and shows parse_subtitle_track. perf record walks the C stack and shows _PyEval_EvalFrameDefault — the interpreter's eval loop, called once per Python frame. For Python investigations always use py-spy. For C extension investigations use perf record -p <pid> and --call-graph dwarf.
  • "Live profiling has zero overhead" It has small but measurable overhead: 0.1–1% for sampling profilers, up to 3% for high-rate (999 Hz) sampling on CPU-bound services, sub-1% for eBPF tracers. The right framing is "the overhead is below your latency SLO's noise floor" — not "zero". Tools that claim zero overhead are either lying or running so coarse they are not useful.
  • "async-profiler and jstack give you the same information" jstack is a snapshot — one thread-stack dump at the moment you ran it. async-profiler is a sampler — thousands of stacks aggregated over a window. jstack is great for "what is each thread doing right now" (especially deadlock investigations); async-profiler is great for "where is CPU going on average". Use them together: jstack for instantaneous state, async-profiler for distribution.
  • "Continuous profilers replace incident-time sampling" They cover most cases — but a continuous profiler running at 100 Hz over a one-second spike captures only 100 samples, often too few to find the cause of the spike. During an incident you want to increase the sampling rate temporarily to 999 Hz on the affected pods. Continuous profiling is the baseline; incident-time sampling is the zoom-in.
  • "bpftrace is for kernel-only investigations" bpftrace reaches into user-space via uprobe and uretprobe. You can attach it to any function in any binary — a Python C extension, a Go binary's crypto/tls.Conn.Read, MySQL's query_done USDT probe — and trace events from production with sub-1% overhead. It is the most underrated user-space debugger in the toolkit.

Going deeper

Why sampling rates of 99 and 999 Hz (not 100 / 1000)

The Linux kernel's tick-driven background work runs at the configured CONFIG_HZ, typically 100, 250, or 1000. If your sampler runs at exactly 1000 Hz on a 1000-Hz kernel, every sample lands at the same phase of the tick, and you systematically over-count whatever runs at the tick boundary (timer interrupts, scheduler housekeeping, accounting). Choosing 999 Hz means the sample phase drifts by 1 / 1000 of a tick per sample, decorrelating the sampler from the tick after ~1000 samples. Brendan Gregg's perf record -F 99 and bpftrace's profile:hz:99 are this trick at the lower frequency. The defaults of 99 and 999 are not arbitrary; they are anti-aliasing.

This matters more than it sounds. A team at BharatBazaar once spent two days investigating a "20% CPU in ktimer_schedule" finding from a 1000-Hz perf run, only to discover the function was over-sampled by exactly the harmonic and accounted for less than 2% of real CPU. Switching to 999 Hz changed the flamegraph's shape entirely. The discipline costs nothing and prevents days of phantom investigations; never use round-number Hz for a sampling profiler.

Off-CPU profiling and the bpftrace offcputime script

On-CPU sampling cannot see threads that are not running — sleeping, blocked on a mutex, waiting on I/O. For a service with p99 = 1.8s where CPU is at 30%, the time is not on the CPU; it is somewhere blocked. Off-CPU profiling captures this by hooking the kernel's scheduler. Every time a thread is descheduled (sched:sched_switch), an eBPF program records its stack and the time. When the thread is scheduled back, the program records the duration. The output is a flamegraph where width = blocked-time, not on-CPU-time.

The canonical tool is bcc/bpftrace's offcputime.bt. Run offcputime.bt -p <pid> 30 for 30 seconds and you get a flamegraph of "where threads were blocked". For Riya's IPL incident, the on-CPU flamegraph showed parse_subtitle_track; an off-CPU flamegraph would have shown the kafka-consumer-thread blocked on sock_recvmsg (the network stall) — both are real, both contribute to p99, and only the union of the two views explains the incident. Why off-CPU profiling matters for tail latency: a service can have low CPU and high p99 simultaneously, which is the most common shape of latency incidents. The on-CPU profile is uninformative — there's not much CPU to see. The off-CPU profile shows you which lock, which syscall, which queue is the bottleneck. Without it, half of all incident investigations stall at "but CPU is fine, so what's wrong?".

JFR, Pyroscope, and the always-on continuous-profiling pattern

Java Flight Recorder ships with the JVM and runs an in-process ring buffer at typically 1% overhead. Enable it with -XX:StartFlightRecording=duration=0,filename=/var/dumps/jfr.jfr,settings=profile. The "duration=0" means "record forever, into a circular buffer of 200 MB". When an incident hits, jcmd <pid> JFR.dump writes the last 5–10 minutes of every recorded event to a file you can ship off-host. The events include CPU samples (every 20 ms by default), allocation samples (every 512 KB), thread-park events, GC phases, lock-contention waits, and ~80 other categories.

The continuous-profiling SaaS tools (Pyroscope, Polar Signals, Datadog, Grafana Pyroscope) generalise this pattern across runtimes: they install an agent that runs py-spy, async-profiler, or pprof continuously, ships flamegraphs to a backend at 10-second cadence, and stores them keyed by service / version / pod. Indian-scale production teams (PaisaBridge, SetuStream, ParakhTrade, BhojanBox, KhelKing) running this pattern can answer questions like "what was the flamegraph 5 minutes before the spike?" by querying historical data — which is impossible with on-demand sampling because by the time you noticed the spike, it ended.

The cost is roughly ₹15-30 per pod per month for SaaS continuous profilers (volume-priced), or roughly an engineering-week to wire up an open-source Pyroscope deployment internally. For services running ₹50,000+/day in revenue (a PaisaBridge payments core, a ParakhTrade order-match), the ROI is a single avoided 30-minute outage.

uprobes in production: latency histograms without recompiling

uprobes are user-space breakpoints that a kernel-side eBPF program intercepts, runs a tiny handler against, and then resumes the user thread — overhead per uprobe is roughly 600 ns on a modern x86 part. To measure the latency distribution of a function in production:

bpftrace -e '
  uprobe:/usr/bin/python3:_PyEval_EvalFrameDefault { @start[tid] = nsecs; }
  uretprobe:/usr/bin/python3:_PyEval_EvalFrameDefault /@start[tid]/ {
    @lat = hist(nsecs - @start[tid]);
    delete(@start[tid]);
  }'

This gives you a power-of-2 histogram of how long every Python frame takes, sampled across every thread of every Python process on the host, in real time, with overhead under 1% even for very-hot functions. For a service-specific question — "what is the p99.9 of validate_subtitle_track under real traffic?" — the same pattern with --pid and the function's symbol gives a precise answer.

The reader who internalises this pattern will, the first time they reach for it, realise that production code has a measurement surface a hundred times richer than what their unit tests can offer. Most performance bugs Indian production teams fix in a quarter could be diagnosed faster by reaching for uprobes first; the only reason teams do not is that the tooling is unfamiliar. A 30-minute team session writing five bpftrace one-liners against a staging service is enough for the next on-caller to reach for it under pressure.

Reproduce this on your laptop

# Linux:
sudo apt install linux-tools-common linux-tools-generic bpftrace
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope    # allow py-spy attach
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 live_sampler_demo.py                            # writes /tmp/live_sample.svg
xdg-open /tmp/live_sample.svg                           # view in browser

# macOS (no sudo needed for py-spy):
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy
python3 live_sampler_demo.py
open /tmp/live_sample.svg

Where this leads next

Live debugging is the observation side of Part 15's production toolkit, the complement to the capture side covered in /wiki/heap-dumps-and-core-dumps. The next chapters walk further into the observation arc:

  • /wiki/flame-graphs-in-production — turning the captured profile into actionable diagnoses, reading wide vs deep boxes, diffing two flamegraphs to find regressions.
  • /wiki/tracepoints-and-dynamic-instrumentation — the full power of bpftrace and bcc: kprobes, uprobes, USDT, BPF maps, building custom one-off observability without redeploying.
  • /wiki/perf-record-and-perf-script-the-survival-kit — the kernel-side sampler that works on any process regardless of language, the foundation underneath bpftrace's profile probe.
  • /wiki/continuous-profiling-in-production — running a sampling profiler always-on across the fleet, querying historical flamegraphs by version and pod, the "what was the system doing 30 minutes ago" superpower.

The arc is: capture (heap/core dumps) → observe (this chapter, sampling) → instrument (eBPF/uprobes) → understand (flamegraph reading). A senior on-caller cycles through all four in a single 90-minute incident; the discipline is knowing which rung to reach for at which point in the diagnostic ladder of /wiki/wall-debugging-live-systems-is-its-own-skill. The chapter you are reading now is the second rung — the one most engineers reach for first because it is the cheapest to attempt and the fastest to inform the next decision.

A practical cultural pattern worth flagging: teams that succeed at live debugging do not learn it during incidents. They run a monthly flamegraph reading club — pick a 30-second py-spy capture from a recent production sample, project it on the wall, and have an engineer narrate the hot path while others ask questions. After six sessions, every engineer on the rotation can read a flamegraph in under a minute and form a hypothesis. After twelve sessions, the team's mean-time-to-hypothesis on real incidents drops noticeably. The ROI is enormous and the cost is one hour per month; teams that resist this practice often discover, the first time a real incident hits, that "everyone knows how to read a flamegraph" was an organisational fiction.

A second cultural pattern: invest one quiet sprint in the sidecar profiler image. Build a container image that ships py-spy, async-profiler, bpftrace, bcc-tools, and perf together, with a tiny runbook script that asks "what runtime is in the target pod?" and dispatches to the right tool. The on-caller during an incident runs kubectl debug -it <pod> --image=internal/sidecar-profiler and gets a pre-loaded shell with every tool ready, instead of apt install-ing during a fire. PaisaBridge's payments SRE team published their sidecar image's README in 2025; the bottom of that document says "the first incident this image saved was 38 minutes of a Big Billion Day spike where the on-caller would otherwise have spent 25 of those minutes installing bpftrace". The unglamorous infrastructure week pays off the first day it is needed.

References