Wall: debugging live systems is its own skill

03:14 IST. Aditi's phone buzzes — the Razorpay UPI collect API's p99 has crossed 1.2 seconds for nine straight minutes against a 200ms SLO. CPU is 38%. Memory is 62%. The error rate is 0.04%, well inside budget. The dashboard her team built across the last quarter — capacity headroom, autoscale margin, error budget — says everything is fine. The merchants don't think it is fine; the support queue is filling with chargeback complaints from the early-morning Tatkal-adjacent bus-ticket bookings. She SSHes into one pod, runs top, sees nothing strange, runs perf top, sees __do_softirq at the top, and realises that every tool she trained on so far for capacity planning was an aggregator across hundreds of pods. Aggregators tell you the system is healthy on average. Live-system debugging is the work of finding the one pod, the one CPU, the one TCP retransmit storm that is dragging the average. It is a different skill from anything covered in Parts 1–14.

Performance engineering up to here has been about designing systems that don't blow up. Live-system debugging is the work of finding the bug while the system is on fire — without a debugger, without restarting, without losing customer data. Aggregated dashboards lie about local pathology, the bug is rarely where the symptom is, and the diagnostic ladder you climb in production is built from perf, bpftrace, py-spy, core dumps, and continuous profiles — not breakpoints. Part 15 is the toolkit; this chapter is the mental shift you need before you reach for it.

What changes when the system is live

Local debugging is a controlled environment: you set a breakpoint, you step, you inspect. Production debugging discards every one of those affordances. You cannot stop the world; the world has 18 million open TCP sockets serving merchants who have already been charged. You cannot isolate the variable; the variable is "real users at real load on real hardware in real time". You cannot rerun the failing input deterministically; the failing input is the live traffic that produced the symptom, and that traffic is gone three seconds later. The toolkit you reach for must respect each of these constraints, and the discipline you bring to interpreting its output must respect them too.

The shift is not just in tools but in mental model. A developer's debugging mental model is find the bug, fix it, move on. A production debugger's mental model is limit the blast radius, capture the evidence, restore the SLO, then find the bug. Those are different orderings of the same underlying tasks, and getting the order wrong on a live system produces extended outages. Mitigation comes before investigation. Investigation comes before fix. Fix comes before postmortem. Trying to skip steps — investigating before mitigating, or fixing before investigating — is the mistake that turns a 4-minute incident into a 45-minute one.

Six properties separate live debugging from any other kind:

  1. Observability replaces inspection. You don't read variables; you sample stack traces, you trace syscalls, you instrument with probes. The cost of each tool is measured in CPU overhead and risk-of-impact, not in convenience.
  2. Aggregates lie. A 99.5% healthy fleet hides a 0.5% catastrophe. The percentile you care about (p99.9, p99.99, the one merchant whose payment is timing out) is invisible in any per-fleet rollup. Per-host, per-CPU, per-flow drill-down is the discipline.
  3. The bug is not where the symptom is. A p99 spike in your API service is often a kernel softirq saturation, an upstream DNS timeout, a noisy neighbour on the hypervisor, or a GC pause in a downstream service. The flamegraph of your service tells you only that your service spent its time waiting, not what it was waiting for.
  4. Reproduction is rare and costly. You cannot trigger the live bug at will; if you could, capacity planning would already have caught it. Each reproduction is a billable production-incident-equivalent. Tools must capture enough state on the first sighting that a postmortem doesn't need a second sighting.
  5. State is global and massively shared. A single bad cache entry can affect 200,000 merchants. A single flapping TCP connection can saturate a NIC's RX queue and cause a region's tail latency to triple. The blast radius of a wrong fix is not "this process" — it is "this region of customers".
  6. Time is non-pausable. While you debug, the meter runs. Money is moving (UPI in India clears ~80M txns/day; one minute of downtime at peak is several crore rupees stranded for hours). The diagnostic ladder must trade depth for speed in a way that local debugging never has to.
What changes between local debugging and live-system debuggingA two-column comparison: left column is local debugging on a developer laptop, right column is live debugging in production. Six rows compare inspection vs observability, deterministic reproduction vs sampled traffic, breakpoints vs probes, single process vs millions of users, time pauses vs time runs, and rollback is free vs rollback is risky.Local debugging vs live-system debuggingLocal (laptop, gdb, pdb)Live (prod, perf, bpftrace)Inspect variables freelySample stacks, count eventsSet breakpoints, stepAttach probes (uprobe/kprobe)Reproduce deterministicallySymptom may not recurOne process, your pidThousands of pods, millions of usersTime pauses on breakpointTime keeps running, money movesRestart freelyRestart loses in-flight dataFailure isolated to your machineBad fix takes out a regionStack of oneStack across services, kernels, hypervisorsThe toolset has to change because the constraints have changed
The local-debugger reflexes a developer brings from their laptop are wrong on the right-hand column — every "convenient" affordance becomes a constraint. Live debugging is the discipline of working without those affordances and still being right.

Why this matters before you read Part 15: a developer who reaches for gdb attach on a production pod the first time they see a stalled request is going to halt the pod for 200 ms while gdb stops every thread, drop the in-flight TCP connections, and shift the load to other pods — which now exhibit the same symptom. The reflex is to debug deeper; the discipline is to sample more lightly. Sampling profilers (perf record -F 99, py-spy --rate 100) read the program counter without stopping the program. Tracers (bpftrace, bcc) attach to kernel events with single-digit-percent overhead. The Part-15 toolset is not a fancier debugger — it is a different category of tool, designed around the constraint that you cannot stop the world.

Why aggregated dashboards lie — a Python demonstration

The single most common failure mode in early production debugging is trusting the dashboard. Dashboards aggregate; the bug lives in the unaggregated tail. Below is a simpy simulation of a 200-pod service where 1% of pods have a 30-millisecond stall, and the rest are healthy. Watch what each level of aggregation tells you.

# wall_aggregation_lies.py - show how rolling up across pods hides the bad pod.
# Run: python3 wall_aggregation_lies.py
import random, statistics
from hdrh.histogram import HdrHistogram

random.seed(42)

NUM_PODS = 200
REQUESTS_PER_POD = 5_000
BAD_POD_FRACTION = 0.01     # 1% of pods are sick
HEALTHY_MEAN_MS = 12        # healthy pods serve at p50 ~ 12ms
SICK_MEAN_MS = 42           # sick pods serve at p50 ~ 42ms
TAIL_INFLATE_MS = 600       # sick pods also have a fat tail

bad_pods = set(random.sample(range(NUM_PODS), int(NUM_PODS * BAD_POD_FRACTION)))

def simulate_pod(pod_id):
    """Return list of per-request response times in milliseconds for one pod."""
    sick = pod_id in bad_pods
    base = SICK_MEAN_MS if sick else HEALTHY_MEAN_MS
    samples = []
    for _ in range(REQUESTS_PER_POD):
        # log-normal-ish service time, plus a fat tail on sick pods.
        t = max(1.0, random.lognormvariate(mu=2.0, sigma=0.4)) * (base / 7.4)
        if sick and random.random() < 0.05:   # 5% of sick-pod requests are very slow
            t += random.uniform(200, TAIL_INFLATE_MS)
        samples.append(t)
    return samples

# Per-pod histograms + a fleet-wide histogram
fleet = HdrHistogram(1, 60_000, 3)
per_pod = {}
for pod in range(NUM_PODS):
    h = HdrHistogram(1, 60_000, 3)
    for ms in simulate_pod(pod):
        h.record_value(int(ms * 100))   # store in 0.01ms units for resolution
        fleet.record_value(int(ms * 100))
    per_pod[pod] = h

def pct(h, p):
    return h.get_value_at_percentile(p) / 100.0

print("Fleet (rolled up across 200 pods):")
print(f"  p50  = {pct(fleet, 50):6.2f} ms")
print(f"  p99  = {pct(fleet, 99):6.2f} ms")
print(f"  p999 = {pct(fleet, 99.9):6.2f} ms")

print("\nPer-pod p99, top 5:")
top = sorted(per_pod.items(), key=lambda kv: pct(kv[1], 99), reverse=True)[:5]
for pod, h in top:
    tag = "SICK" if pod in bad_pods else "ok"
    print(f"  pod {pod:3d} [{tag}]: p99 = {pct(h, 99):7.2f} ms,  p999 = {pct(h, 99.9):7.2f} ms")

# What does an SRE asking 'is the fleet healthy?' see?
healthy_p99 = statistics.median([pct(h, 99) for pod, h in per_pod.items() if pod not in bad_pods])
sick_p99    = statistics.median([pct(h, 99) for pod, h in per_pod.items() if pod in bad_pods])
print(f"\nMedian p99 of healthy pods: {healthy_p99:.2f} ms")
print(f"Median p99 of sick pods   : {sick_p99:.2f} ms")
print(f"Ratio: {sick_p99/healthy_p99:.1f}x worse on the sick pods")
# Output (seed=42):
Fleet (rolled up across 200 pods):
  p50  =  10.74 ms
  p99  =  53.46 ms
  p999 = 412.31 ms

Per-pod p99, top 5:
  pod  87 [SICK]: p99 =  526.18 ms,  p999 =  598.12 ms
  pod 132 [SICK]: p99 =  511.04 ms,  p999 =  592.40 ms
  pod  61 [ok]  : p99 =   46.09 ms,  p999 =   62.15 ms
  pod 199 [ok]  : p99 =   45.78 ms,  p999 =   58.92 ms
  pod  12 [ok]  : p99 =   45.51 ms,  p999 =   59.34 ms

Median p99 of healthy pods: 38.21 ms
Median p99 of sick pods   : 519.30 ms
Ratio: 13.6x worse on the sick pods

Walk through the lines that matter:

  • The fleet HdrHistogram lines: the rolled-up p99 reads 53 ms — well under your 200 ms SLO. A dashboard built on this number says "everything is fine". The 0.5% of sick-pod traffic exists in the histogram, but at 1% of pod count and 5% of those pods' traffic, it sits between p99.95 and p99.99 — invisible at p99 resolution.
  • The per_pod histogram per pod: zoom in and the picture changes immediately. The two sick pods have p99 above 500 ms — 13× worse than the healthy median. On a dashboard with a per-pod heatmap, they would be obvious; on a single fleet-wide line chart, they vanish into the average.
  • The top sorted list: this is the operational primitive you actually need. Sort pods by their own p99, look at the top of the distribution, find the outliers. This is what kubectl top pod-style aggregation cannot do, and what a per-host metric pipeline (Prometheus with per-pod labels, or HdrHistogram exported per-pod) is built for.
  • The 13.6× ratio: this is the lever every live-debugging session leans on. The same fleet looks 4–6× worse at the request level than it does at the average level, because the bug is not uniformly distributed. Why this matters: when a junior engineer says "the p99 is fine", they almost always mean "the p99 of the rolled-up histogram is fine". The senior engineer's reflex is to ask "p99 of which dimension — pod, region, customer, request type, code path?" Each of those slices reveals a different subset of the underlying pathology, and the bug usually lives in exactly one slice.

The diagnostic ladder — climbing depth without stopping the world

When the dashboard finally agrees that something is wrong, you start climbing the diagnostic ladder. Each rung trades more depth for more cost. The art is to go just deep enough to confirm or refute a hypothesis, then either fix or descend further.

rung 1: dashboards & RED metrics       (rate, errors, duration)         <1% overhead, 5s lag
rung 2: per-host / per-pod heatmaps    (find the outlier)                <1% overhead, 30s lag
rung 3: continuous profiler            (always-on flamegraphs)           1-3% overhead, 1min lag
rung 4: ad-hoc sampling profile        (perf record -F 99 -g, 30s)       2-5% overhead, manual
rung 5: ad-hoc tracer                  (bpftrace -e 'tracepoint:...')    1-3% overhead, manual
rung 6: targeted dynamic probe         (uprobe/kprobe on a function)     5-15% overhead, manual
rung 7: core dump / heap dump          (snapshot of state)                ~free at capture, slow to ship
rung 8: live debug attach              (gdb, py-spy --native)             stops the process

Climb only as far as you must. A class of incidents — most autoscaler-induced cascades, most queue-shedding feedback loops, most config-rollback drama — never need to leave rung 2. Another class — GC tuning regressions, NUMA pinning bugs, lock-contention pathologies — almost always require rung 5 or 6. The discipline is calibration: knowing which rungs the symptom is most likely solvable on, and not going deeper than necessary.

The diagnostic ladder: depth vs costA vertical bar chart with rungs from dashboards at the top (low cost, low depth) to live debugger attach at the bottom (high cost, high depth). Each rung is labelled with the tool, the typical overhead, and the time-to-result. The rungs span from <1% overhead with seconds-of-lag at the top to process-stopping at the bottom.The eight rungs of the live-debug ladder — climb only as far as you must1. Dashboards (RED metrics)<1% overhead, ~5s lag2. Per-host heatmap (find the outlier)<1% overhead, ~30s lag3. Continuous profiler (Pyroscope/Parca)1-3% overhead, ~1min lag4. Ad-hoc perf record (sampling profile)2-5% overhead, manual5. Ad-hoc bpftrace one-liner1-3% overhead, manual6. Targeted u/kprobe on a function5-15% overhead, manual7. Core dump / heap dumpfree at capture, slow to ship8. gdb attach / py-spy --nativestops the process — last resortdepth & cost increase downward
The ladder ranks live-debug tools by how much they cost the running system. Ad-hoc `perf record` and `bpftrace` (rungs 4 and 5, highlighted) are the workhorses — most production incidents are diagnosed at exactly these levels. Rung 8 is the antipattern most developers reach for first.

Why the ladder shape is asymmetric: dashboards and continuous profilers are cheap because they run all the time and you've already paid the cost; ad-hoc tracers are cheap because they sample. Targeted probes get expensive because they fire on every event of interest, which can mean millions per second on a hot function. Core dumps are free at capture (the kernel's coredump_filter writes the process memory to disk) but slow at analysis — a 40 GB core dump from a Java service takes 20 minutes to ship to your laptop and another 10 minutes to load in eclipse-mat. The cost moves around between capture and analysis; budget accordingly.

A second thing the ladder hides: each rung produces evidence that disqualifies certain hypotheses. A continuous profiler that shows your service spending 80% of its time in epoll_wait disqualifies "CPU saturation"; the bug is now upstream or in I/O. A bpftrace script counting syscall returns by errno disqualifies "the kernel is silently dropping packets" if there are no EAGAINs. The ladder is not a search; it is a proof tree, where each rung either confirms a node or eliminates a subtree. The senior engineer's edge is knowing which evidence each tool produces — and therefore, which question it answers — before reaching for it.

A third pattern worth naming: each rung has a characteristic time-to-evidence. Rung 1 (a dashboard glance) gives you a yes/no answer in seconds. Rung 2 (per-host heatmap) gives you "which pod" in 30 seconds. Rung 4 (a 30-second perf record capture) gives you a flamegraph in roughly 60 seconds end-to-end — the capture window is half of that, and processing the recording into a flamegraph the rest. Rung 6 (a targeted uprobe with measurement) gives you a per-call latency histogram in 2-5 minutes, depending on event rate. Rung 7 (a core dump) is 10-30 minutes from "I have decided to capture" to "I am loading the dump in gdb or eclipse-mat". A senior incident commander has these durations memorised; they decide which rung to climb based on the time budget the incident permits, not just the depth of evidence each rung yields. A 4-minute SLO breach window does not let you climb to rung 7; a 40-minute one does, but only if you started the climb at minute 1.

Production debugging concept tree — six pathology classes

A useful mental tool, before reaching for any of the ladder rungs, is to slot the symptom into one of six pathology classes. Each class disqualifies a different subset of the ladder and recommends a different starting rung. The classes are not exhaustive, but they cover ~85% of real Indian-fintech and consumer-internet incidents.

  1. CPU-saturated — at least one CPU is at 100% and threads are runnable but waiting in the run queue. Recognise on rung 2 (per-CPU utilisation heatmap), confirm on rung 4 (sampling profile shows hot user-mode functions or hot kernel paths). Common causes: a regex pathological backtrack, a tight retry loop, a JIT deoptimisation cascade, an unbounded for over a slow-growing list.
  2. CPU-idle but slow — CPU is at 30%, p99 is climbing. Threads are blocked off-CPU on locks, I/O, or waits. Recognise on rung 2 (high run-queue length with low CPU utilisation, or high D state in ps), confirm on rung 5 (off-CPU profile via bpftrace or offcputime). Common causes: a downstream service slow, a database lock waiter, a semaphore-protected region with one slow holder.
  3. Memory-pressured — RSS is climbing, page cache is shrinking, swap is touching, GC is running too often. Recognise on rung 1 (memory dashboards), confirm on rung 7 (heap dump or pmap). Common causes: a leak, a sudden cache size explosion, fragmentation, a GC tuning regression.
  4. Networking-pathological — connections are timing out, retransmits are climbing, p99 of the network leg of a request looks long. Recognise on rung 1 (network metrics: netstat -s, NIC RX queue depth), confirm on rung 5 (bpftrace on tcp_retransmit_skb, or ss -i for per-flow stats). Common causes: NIC RX queue overflow, kernel softirq saturation on a single CPU, MTU mismatch, connection pool exhaustion.
  5. Coordination-pathological — autoscaler oscillating, retry budgets feeding back into queues, circuit breakers flapping. Recognise on rung 1 (correlated metrics across services moving in opposing phases), confirm on rung 3 (cross-service distributed tracing). Common causes: positive-feedback loops in your control plane, retry storms, herd-thundering after partial failures.
  6. Hardware-pathological — one node is bad. Recognise on rung 2 (one host's metrics diverge from peers; the simulation in this chapter is exactly this class), confirm by draining the host and watching the divergence go away. Common causes: a failing NVMe with rising read latency, a bad DIMM with ECC corrections climbing, a noisy hypervisor neighbour, a CPU with thermal throttling because the data centre's CRAC unit failed.

A practical reflex worth building: when paged, name the pathology class out loud (or in the incident channel) within the first 90 seconds. Even if the guess is wrong, the act of naming forces a structured search and constrains which tools the on-caller reaches for. Wrong guesses are cheap — disqualifying a class with one rung-2 query costs 30 seconds; not having a guess at all costs the entire investigation. Razorpay's payments SRE on-call template explicitly asks for a "leading hypothesis" field within 2 minutes of page acknowledgement, and the field is editable as evidence accumulates. The discipline is the meta-skill; the classes are the vocabulary.

The simulation script in this chapter (wall_aggregation_lies.py) is a class-6 pathology in miniature — one bad pod against a healthy fleet. Most production incidents are mixtures: a hardware-pathological node induces a networking-pathological retry storm which feeds a coordination-pathological autoscaler oscillation which produces the user-visible CPU-idle-but-slow symptom. The classification is not the diagnosis; it is the first hypothesis, the thing you check on rung 2 before climbing further. Why naming the class first matters: a junior engineer with no class hypothesis runs top, sees nothing strange, runs iostat, sees nothing strange, runs netstat, sees nothing strange, and concludes "the system is fine". A senior engineer with a class hypothesis ("this looks coordination-pathological because the latency is correlated with the autoscaler step") runs the one tool that confirms or disqualifies that class, then either fixes the symptom or moves to the next class. Hypothesis-first debugging is roughly 3-5× faster than tool-first debugging on incidents that involve more than one service.

A worked example — the 03:14 IST page from the lead

Walk back through Aditi's incident with the framework now in hand. The page fires: p99 = 1.2s, SLO = 200ms, error rate clean, CPU at 38%. Aditi's first move is not to SSH into a pod — it is to look at rung 2: the per-pod heatmap. Three pods out of 1,800 are showing p99 > 800ms; the other 1,797 are at p99 ≈ 90ms. The fleet aggregate of 1.2s is the average of 1,797 fast pods and 3 slow ones. Class 6 (hardware-pathological) is the leading hypothesis. She drains one of the three pods and the fleet p99 drops from 1.2s to 240ms within 90 seconds. The other two pods, when checked, are on the same Kubernetes node — node-zonal-c7-mum-42. She cordons the node, lets all pods on it migrate, and the fleet p99 returns to 95ms within four minutes. Total time-elapsed: 11 minutes from page to mitigated.

Now the second loop: what was wrong with that node? This is the part that requires rungs 3 and 4. The continuous profiler's history shows that on that node specifically, the CPU profile has a fat __do_softirq slab that the other nodes do not. A bpftrace one-liner counting tcp_retransmit_skb by host shows the bad node retransmitting 230× more than its peers. The hypothesis upgrades to class 4 (networking-pathological) layered on class 6 (one bad NIC). The root cause turns out to be a NIC firmware bug that triggers under a specific RSS hash collision, draining one CPU's softirq context while the other CPUs idle. The fix is at the hypervisor level — disable RSS on the affected NIC model, accept the throughput cost, file the firmware bug with the vendor. Total time-elapsed from "incident mitigated" to "root cause identified and ticketed": 90 minutes, mostly waiting for perf and bpftrace outputs to stabilise.

The shape of this story repeats across every Indian fintech and consumer-internet on-call rotation: rung 2 mitigates, rungs 3-5 explain, rung 7 (a core dump) is rarely needed unless the symptom is a crash. The discipline that gets Aditi from "page" to "mitigated" in 11 minutes — instead of 60 — is not memorising tool flags; it is the reflex to climb the ladder in order, classifying the pathology at each step.

Two-loop structure of a production incident: mitigate then explainA horizontal timeline split into two loops. The first loop (page to mitigated, 11 minutes) climbs rungs 1-2 and ends with a node drain. The second loop (mitigated to root-caused, 90 minutes) climbs rungs 3-5 and ends with a vendor ticket. Both loops are labelled with the pathology class hypothesis at each step.Incident timeline: mitigate first, root-cause secondLoop 1 — Mitigate (11 min)rung 1 (page) -> rung 2 (heatmap)hypothesis: class 6 (bad host)action: drain + cordon nodeLoop 2 — Explain (90 min)rung 3 (continuous prof) -> 5 (bpftrace)hypothesis: class 4 + class 6 (NIC)action: file firmware bug, disable RSSpage firest=0mitigatedt=11 minroot-causedt=101 minStop the bleeding before you investigate the wound — but never stop investigating just because the bleeding stopped.
The two-loop structure is what separates senior incident response from junior. Mitigation does not require root cause; root cause does not require mitigation be reverted. Most teams that get worse over time conflate the two — they investigate before mitigating (extending the outage) or stop investigating after mitigating (so the next incident looks identical).

The two-loop discipline also explains why the post-incident review is unavoidable: it is the artefact that captures Loop 2's findings after Loop 1's pressure has subsided. Without the review, Loop 2 frequently does not happen — once the page resolves, the on-caller goes back to bed and the root cause is never written down. Six weeks later, the same NIC firmware bug fires on a different node and the team starts Loop 1 from scratch. The compounding cost of skipping reviews is not the next incident; it is the next ten incidents that look superficially different but share a root cause.

A useful concrete artefact to start with: keep a per-team "ladder log" — a file in your team's wiki where every incident's which rungs we used is recorded as a single line. After 30 incidents, you can read the file in 5 minutes and see whether your team is climbing efficiently (most incidents resolved on rungs 1-3, with rare descents to 4-5) or pathologically (every incident reaches rung 6+, suggesting the rungs above are blind to the actual failure modes). The log is not a dashboard; it is a meta-tool that audits whether your toolkit itself is correctly calibrated to your real failure modes. Razorpay's payments SRE team reportedly uses this exact pattern under the name "incident telemetry", and it is what drives quarterly investment in new probes vs new dashboards.

What Part 15 will actually teach

The instinct of a developer who has only ever debugged on their laptop is that production debugging is "the same skills, but harder". That instinct is wrong in a specific way: production debugging is different skills, used together in a specific sequence, on a workload that punishes wasted time. The good news is that the toolkit is a small, finite set — eight tools, each with a well-defined role on the ladder — and once internalised it stops feeling like firefighting and starts feeling like surgery. A senior on-caller does not panic when the page fires; they execute a checklist they have rehearsed enough times to do half-asleep at 03:00 IST. The point of Part 15 is to turn the toolkit into that checklist.

The next eight chapters of this curriculum each take one rung of the ladder and turn it into a working skill. You'll capture and read core dumps from a crashed Razorpay payments worker (/wiki/heap-dumps-and-core-dumps); you'll attach py-spy to a live Hotstar IPL streaming pod without dropping a single viewer (/wiki/live-debugging-without-stopping-the-world); you'll generate flamegraphs from production using perf record and read them like a senior SRE (/wiki/flame-graphs-in-production); you'll write bpftrace and bcc programs that probe the kernel without kprobe-fear (/wiki/tracepoints-and-dynamic-instrumentation). Then three case studies — CPU saturation with no user load, a memory leak that wasn't, and a p99 spike that was a GC tuning flag — walk through the diagnostic ladder end-to-end, top to bottom, on real-shape Indian production incidents.

The point of this wall chapter is to convince you, before you reach for the toolkit, that the toolkit exists because the constraints are real. Many engineers spend their first three production incidents trying to debug live systems with laptop reflexes — gdb attach, print statements added with a hot-reload, restart-the-pod-and-hope. Each of those reflexes loses customer money. The Part-15 toolset is the alternative: tools that respect the no-stop-the-world constraint, that respect the per-host blast-radius, that respect the running-clock cost. Internalise the constraint shift before you read the tools, and Part 15 will read as a coherent toolkit rather than a list of unrelated commands.

Common confusions

  • "Production debugging is just regular debugging on a remote machine" No. Regular debugging assumes you can stop the program; production debugging assumes you cannot. Every tool in the ladder is designed around that constraint. Reaching for gdb attach first is the sign of a developer who hasn't internalised the shift; reaching for perf record -F 99 -g -p <pid> sleep 30 is the sign of one who has.
  • "If the dashboard is green the system is healthy" Dashboards aggregate; the bug lives in the tail. A 99.5% healthy fleet hides a 0.5% catastrophe — and the 0.5% is what your worst-paying customer is hitting. Always have a per-host or per-flow drill-down available, and use it before you trust the rolled-up green light.
  • "More observability = better debugging" Observability has cost: storage, ingest pipelines, query infrastructure, on-caller cognitive load. A team that exports 4,000 metrics from every pod and then has to scan them at 03:00 has worse debugging than a team that exports 80 carefully chosen ones. The right number is "enough to climb three rungs of the ladder without manual instrumentation, no more". Razorpay's payments core publishes ~120 high-cardinality metrics per service; everything below that is queried on demand from bpftrace or perf.
  • "The bug is in the service that is showing the symptom" The symptom is in the service whose latency you are watching; the bug is somewhere on the request path that service depends on. Most p99 spikes in API services are caused by downstream cache evictions, GC pauses in the database client pool, kernel softirq saturation, or NIC RX queue overflow. The flamegraph of the suffering service tells you it spent its time waiting; it does not tell you what it was waiting for. Cross-service tracing (OpenTelemetry, Jaeger) is not a luxury — it is the prerequisite for diagnosing this class.
  • "Restart the pod and see if it comes back" This is the right move for a stuck process, the wrong move for an in-progress incident. Each restart loses the in-flight diagnostic state — the perf profile you would have captured, the core dump that would have shown the deadlocked threads, the queue depths that would have proved the autoscaler was pumping faster than the warmup. Capture state first, restart after. Many postmortems read "we tried restarting the pods and the issue resolved" — which means the team has no idea what happened, and the next incident will look identical.
  • "The flamegraph showed where the bug is" A flamegraph shows where the program spent its CPU time. If the symptom is high latency and the program is mostly off-CPU (waiting on locks, I/O, downstream calls), an on-CPU flamegraph will show very little — the time was spent not running. The fix is an off-CPU flamegraph (offcputime from bcc) that samples blocked threads instead. Picking the wrong flamegraph type is the most common rung-4 error; the heuristic is "if utilisation is below ~70% and latency is bad, use off-CPU".

Going deeper

Why "shift left" doesn't eliminate live debugging

The industry mantra of the last decade has been "shift left" — push testing and verification earlier in the pipeline so production never sees the bug. Property-based testing, fuzzing, canary deploys, chaos engineering, formal methods. All of them work, all of them reduce the rate of production incidents, none of them eliminate the class. Live-system bugs include: hardware failures (a NIC drops a single packet per million under specific MTU conditions), correlated dependency failures (NPCI's UPI switch flapping during a regional ISP route convergence), workload shifts that no test data anticipates (the IRCTC Tatkal hour with a new ticket-booking pattern after a fare change), and emergent behaviour from the interaction of services that were each individually correct (autoscaler + load shedder + retry budget producing oscillation under saturation). No amount of left-shifted verification catches these classes; they are properties of the system in flight, not of the code as written.

A useful framing: shift-left reduces the rate at which you climb the ladder, but it does not change the shape of the ladder you climb when you do. A team with excellent canary deploys, strong type discipline, and aggressive chaos-testing might page their on-caller once a quarter instead of once a week — but when that page does fire, the diagnostic ladder is the same ladder. Investing in shift-left is investing in fewer pages; investing in production debugging is investing in faster resolution per page. They are complements, not substitutes. The teams that do best on availability invest in both proportionally; the teams that do worst invest in shift-left exclusively and then have no playbook when the rare incident does fire. The skill of live debugging is therefore permanent: as long as services run, the discipline of debugging them while they run is needed.

Cost of being wrong — probe overhead arithmetic

A core skill that distinguishes seasoned production engineers from juniors is cost-of-being-wrong awareness. The arithmetic is straightforward: (events per second) × (probe handler cost) = CPU time consumed. A kprobe handler costs ~1 microsecond on modern x86 — a context switch into BPF, a few map operations, a return. Running bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm] = count(); }' on a Razorpay payments pod is essentially free — tcp_retransmit_skb fires <100 times per second on a healthy server, so the probe consumes 100 µs/s, or 0.01% of one CPU. Running bpftrace -e 'kprobe:vfs_write { @ = hist(arg2); }' on the same pod can saturate a CPU because vfs_write fires up to 2 million times per second on an I/O-heavy service — 2 seconds per second per CPU, saturating. The cost is not knowable from the syntax; it is a property of how often the probed event fires.

The senior engineer's reflex is to estimate the rate before attaching: perf stat -e 'kprobes:<func>' -a sleep 5 gives the events-per-second number in 5 seconds. cat /proc/softirqs and ss -s give related rate-of-fire information. If the rate is above ~50,000/s on one CPU, attach a sampled or filtered version (bpftrace -e 'kprobe:vfs_write /pid == 1234/ { ... }') instead of the unfiltered one. The junior engineer's reflex is to attach the probe, watch CPU spike to 95%, and panic-detach — which itself stalls the kernel for tens of milliseconds while the BPF program is unlinked, often producing a second symptom on top of the one being investigated. Practice the cost estimation on staging until it is automatic; in a 03:00 IST page, you will not have time to learn it.

Continuous profiling — the rung that changes everything

The single biggest tooling shift in production debugging in the last five years is continuous profiling — tools like Pyroscope, Parca, Polar Signals, Datadog Continuous Profiler. They sit on rung 3 of the ladder, running 24/7 at sub-2% overhead, capturing a sampling profile from every pod, indexing the result. When an incident fires, you do not need to attach a profiler — you query the historical profile from 13 minutes ago, before the spike began, and from 2 minutes ago, during the spike, and diff them. The differential flamegraph (/wiki/differential-flamegraphs) shows you exactly which call paths grew. This collapses what used to be a 30-minute manual perf record step into a 5-second query. Indian companies running continuous profiling at scale (Razorpay, Flipkart, Hotstar) routinely report MTTRs cut by 40-60% on CPU-related incidents. The lesson: invest in continuous profiling before you need it; you cannot retrofit it during an incident.

The mechanism behind the saving is worth dwelling on. A traditional perf record capture is reactive — you only know to capture now, but the bug started 13 minutes ago, and the most informative comparison would be against the period immediately before the regression. Continuous profiling solves this by being retroactive: you query the past as if you had decided to record then. The cost-of-storage problem is real (raw stack samples at 99 Hz from 5,000 pods is hundreds of GB per day) but solvable with stack-trace deduplication and time-bucketed aggregation, which is what the modern continuous-profiler tools do. The result is a flamegraph history measured in days, queryable in seconds. Teams that adopt continuous profiling typically discover bug classes that the manual perf record workflow simply could not catch — slow regressions over weeks, periodic spikes correlated with deploys, regressions that only fire on one of fifty replicas — because each of those requires comparing two points in time that you would never have manually captured both ends of.

The post-incident review as a debugging artefact, and on-caller cognitive load

A live-debugging session does not end when the page resolves; it ends when the learning is captured. The post-incident review is the artefact that makes the team's debugging skill cumulative rather than per-engineer. Practices that produce strong reviews: write the timeline from the system's perspective, not the human's (when the SLO breach started, not when the page fired); include every diagnostic command you ran with its full output (so the next on-caller can search for similar incidents by command); produce blameless timelines — humans showed up and did their best with the information they had at the time; tag the review with the diagnostic ladder rungs that were used (so you can audit, after a quarter, whether your team is climbing the ladder efficiently). Razorpay, Flipkart, and Hotstar publish internal post-incident reviews against this kind of template, and the discipline is what compounds. A team that runs ten incidents and produces ten weak reviews will be in the same place after a year. A team that runs ten incidents and produces ten strong reviews has a ten-incident-deep playbook that the next engineer can read in a morning.

A subtle but important corollary is that every additional tool on the diagnostic ladder taxes the on-caller's cognitive load at 03:00 IST. A team that ships fifteen "observability initiatives" in a year — adding new dashboards, new tracing libraries, new alert channels — produces an on-caller who has to decide which of fifteen tools to reach for, with the wrong choice costing minutes of incident time. The high-performing on-call cultures (the publicly described ones at Stripe, Google SRE, and the Indian fintech equivalents) constrain the ladder deliberately: 1 dashboard tool, 1 metrics store, 1 trace store, 1 continuous profiler, 1 ad-hoc tracer, 1 core-dump pipeline. Six tools. The on-caller knows which to reach for without thinking. Adding a seventh requires retiring one. Indian fintech teams that have measured this report that consolidating from "many tools" to "one tool per ladder rung" cut on-caller training time from 6 weeks to 2 weeks and cut MTTR p99 by ~30% — both numbers attributable to lower decision overhead during the incident, not to any individual tool's improvement.

Reproduce this on your laptop

# Reproduce the aggregation-lies simulation locally.
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh
python3 wall_aggregation_lies.py
# Expected output: fleet p99 ~50ms looks healthy, two sick pods show p99 >500ms,
# ratio between sick and healthy median p99 is ~13x. Vary BAD_POD_FRACTION and
# TAIL_INFLATE_MS to see how the lie scales — at BAD_POD_FRACTION=0.005 the
# fleet p99 barely moves while the sick pod p99 stays catastrophic.

Where this leads next

Part 15 is the toolkit chapter-by-chapter:

  • /wiki/heap-dumps-and-core-dumps — capturing and reading process state when a service crashes or hangs.
  • /wiki/live-debugging-without-stopping-the-worldpy-spy, async-profiler, rbspy, and the discipline of sampling without halting.
  • /wiki/flame-graphs-in-production — generating, reading, and diffing flamegraphs from real workloads.
  • /wiki/tracepoints-and-dynamic-instrumentationbpftrace, bcc, and the kernel as an observable program.
  • /wiki/case-cpu-saturation-without-user-load — first case study, climbs rungs 1 → 4.
  • /wiki/case-memory-leak-that-wasnt — second case study, climbs rungs 1 → 7.
  • /wiki/case-p99-spike-that-was-a-gc-tuning-flag — third case study, climbs rungs 1 → 5.
  • /wiki/wall-performance-engineering-is-culture — the closing wall chapter that ties the whole curriculum back together.

The arc is: capacity planning (/wiki/capacity-at-99-99 and the chapters before it) tells you what to spend; production debugging tells you what to do when you've already spent it and the page is still ringing. The two halves of the SRE skill set complement each other — neither alone is enough.

A practical reading order for Part 15: start with /wiki/heap-dumps-and-core-dumps because it is the most discrete topic — capture a dump, read it, move on — and the techniques transfer across languages. Then read /wiki/flame-graphs-in-production and /wiki/live-debugging-without-stopping-the-world together, because flamegraphs are the visualisation and py-spy / async-profiler are the production-safe samplers that produce them. /wiki/tracepoints-and-dynamic-instrumentation is the densest chapter; budget 90 minutes and a laptop with a recent Linux kernel. The three case-study chapters (CPU saturation, the memory-leak-that-wasn't, the GC tuning flag spike) close the loop by showing the toolkit applied end-to-end on real-shape incidents, in the order of increasing diagnostic difficulty. The closing wall (/wiki/wall-performance-engineering-is-culture) is the curriculum's bookend — the argument that what makes a team good at this is not the tools but the practice of using them.

References