CPU, heap, lock profiles in prod: three different lies you can run continuously

It is 02:14 IST. Karan, an SRE at a Bengaluru fintech, is staring at three browser tabs. The first, a CPU flamegraph from Pyroscope, says the payments-api is 38% in json.dumps — surprising but plausible. The second, a heap profile from tracemalloc, says the same service has held a steady 480 MB in cachetools.LRUCache._Link for the last six hours and the curve is flat — suspicious because the cache was supposed to top out at 200 MB. The third, a lock-contention profile dumped via py-spy --threads, says 71% of all wall time is spent waiting on _thread.lock inside logging.Handler.handle — a number so large it cannot be ignored. All three profiles are correct. They are answering three different questions and each is hiding a different blind spot. Karan's job, in the next twenty minutes, is to figure out which of the three is pointing at the bug — and not be misled by the other two.

Continuous profilers in production ship three orthogonal profile types: CPU (where cycles burn), heap (where memory accumulates), and lock/contention (where threads wait). Each samples a different signal, costs different overhead, and has a different blind spot — CPU misses off-CPU stalls, heap misses short-lived allocations, lock profiles miss everything below the runtime's mutex layer. You run all three because they cannot substitute for each other; you keep them all under 1% per machine because the production fleet veto is real.

Three profiles, three questions, three blind spots

A CPU profile answers "where are my CPUs spending cycles right now?" by sampling the running thread's stack at a fixed interval (usually 99 Hz for perf-style, 100 Hz for Pyroscope's default, 10 Hz for some always-on agents). Threads that are runnable but waiting — blocked on a syscall, parked on a futex, doing a network read — produce zero samples while they are off-CPU. So a service that spends 95% of wall time blocked on Postgres and 5% of wall time formatting JSON will show a CPU profile that is 100% JSON formatting. This is not a bug in the profiler. It is the signal the CPU profile is designed to surface, and it is the most common reason engineers misread their first flamegraph: the profile does not show "where time goes". It shows "where CPU goes when CPU is being spent".

A heap profile answers "what is currently on the heap, and which call sites allocated it?" Two related but different signals come out of this: live heap (what is currently allocated and reachable) and allocation rate (bytes allocated per second by call site, including objects that have already been freed). Live-heap profiles are sampled at GC time or by walking the heap on a schedule; allocation-rate profiles are sampled by intercepting the allocator (Go's runtime.SetMutexProfileFraction cousin for malloc, Python's tracemalloc, JVM's allocation-event sampler). The two signals diverge violently: a service that allocates 80 GB/sec of short-lived strings and frees them just as fast has a tiny live-heap profile and a huge allocation-rate profile. Most "memory leak" hunts need live-heap; most "GC pressure" hunts need allocation-rate. Confusing them is the second-most-common rookie mistake.

A lock-contention profile answers "which threads are waiting on which locks, for how long, called from where?" Languages with first-class threads (Go, Java, C#) ship this natively (runtime.SetMutexProfileFraction, JFR's MonitorWait event, .NET's Microsoft-Windows-DotNETRuntime/Contention). Python is the awkward case — the GIL itself is the dominant lock, and most third-party lock primitives wrap it, so a Python "lock profile" is mostly a GIL-contention profile with a thin layer of threading.Lock on top. The signal is sharper than people expect: a service with a noisy 99th-percentile latency that the CPU profile cannot explain almost always has a contention profile that points at the answer in seconds.

Illustrative — not measured data. The three profile types differ in what they sample, how often, and what they cannot see. The Indian-context example chips at the bottom of each column are real bug patterns that this profile catches and the others miss.

Why "all three or none" is the right mental model: each profile's blind spot is precisely where the other two are sharpest. CPU is blind to off-CPU waits, which are exactly what the lock profile sees. Heap is blind to short-lived allocations, which the CPU profile catches as time spent in the allocator. Lock is blind to non-runtime contention (kernel, DB rows), which the CPU profile catches as time stalled in the syscall. Picking only one is picking the lie that hurts you most. Continuous profiling means running all three at sub-1% per machine, not picking the cheapest one.

The cost of running all three together is, perhaps surprisingly, dominated by the heap profile when configured naively. Go's default MemProfileRate=524288 (512 KB) is cheap, around 0.3% overhead for typical workloads, but MemProfileRate=1 (every allocation) routinely hits 30%+ and breaks production. Python's tracemalloc is more expensive — fully on, it can double allocation cost — but tracemalloc.start(8) (record 8-frame stacks for sampled allocations) keeps overhead under 2% on most services. JVM's native sampler (-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints) is below 1% if you use Java Flight Recorder's profile event with default sampling. Lock profiles are cheapest by far when sampled — Go's SetMutexProfileFraction(100) records 1 in 100 contention events at <0.5% overhead — but if you turn fraction down to 1 during heavy contention you can pay 5%+. The configuration knobs matter.

Running all three on a Python service, with real overhead numbers

Python is the hardest language to profile in production because its three profilers each have a different blind spot and you usually need at least two of them. The CPU profile comes from py-spy (sampling profiler that reads stacks via process_vm_readv, no agent required), the heap profile comes from tracemalloc (Python's stdlib, allocation-rate based, requires program cooperation), and the lock/wait profile comes from py-spy --threads plus the GIL state from sys.setswitchinterval. There is no single tool that gives you all three. The script below runs them concurrently on a synthetic service that has all three pathologies — a CPU hot path, a slow leak, and a logging-lock contention — and measures the actual overhead each profiler imposes.

# three_profiles.py — emit a synthetic service that exhibits all three pathologies,
# attach CPU/heap/lock profilers, and measure overhead. Indian-fintech context:
# this is the shape of a checkout-API at Razorpay: RSA verify on each request,
# a misconfigured cache that grows without bound, and a logging handler that
# serialises across threads through a single file descriptor.
# pip install py-spy psutil
import threading, time, hashlib, json, logging, os, sys, tracemalloc, subprocess
from collections import OrderedDict

# --- pathology 1: CPU hot path (RSA verify proxy) ---
def verify_signature(payload: bytes) -> str:
    """Stand-in for the real RSA verify — does CPU work proportional to size."""
    h = hashlib.sha256(payload).hexdigest()
    for _ in range(2000):
        h = hashlib.sha256(h.encode()).hexdigest()
    return h

# --- pathology 2: heap leak (LRU that does not actually evict) ---
class BrokenLRU:
    """Looks like an LRU. Capacity check is wrong. Grows forever."""
    def __init__(self, cap: int):
        self.cap = cap
        self.d   = OrderedDict()
    def put(self, k: str, v: bytes):
        # bug: we evict only when len > cap*100 (typo from cap), so cache
        # grows 100x its intended bound before noticing.
        self.d[k] = v
        if len(self.d) > self.cap * 100:
            self.d.popitem(last=False)
    def get(self, k: str): return self.d.get(k)

cache = BrokenLRU(cap=200)

# --- pathology 3: logging-lock contention (single FileHandler across N threads) ---
log = logging.getLogger("checkout")
log.setLevel(logging.INFO)
h = logging.FileHandler("/tmp/checkout.log")  # default Handler.handle takes a lock
h.setFormatter(logging.Formatter("%(asctime)s %(threadName)s %(message)s"))
log.addHandler(h)

# --- the synthetic workload ---
def worker(worker_id: int, n_requests: int):
    for i in range(n_requests):
        payload = (f"order-{worker_id}-{i}-" + "x" * 256).encode()
        sig = verify_signature(payload)
        cache.put(f"sig:{worker_id}:{i}", payload)        # leaks
        log.info(f"verified order {worker_id}-{i}")        # contends
        time.sleep(0.001)

def run_workload(n_workers=8, n_requests=2000):
    threads = [threading.Thread(target=worker, args=(w, n_requests),
                                name=f"w{w}") for w in range(n_workers)]
    t0 = time.perf_counter()
    for t in threads: t.start()
    for t in threads: t.join()
    return time.perf_counter() - t0

if __name__ == "__main__":
    # 1) baseline — no profiler
    base = run_workload()
    print(f"baseline (no profiler):     {base:.2f}s")

    # 2) tracemalloc on (heap profile, 8-frame stacks)
    tracemalloc.start(8)
    t = run_workload()
    snap = tracemalloc.take_snapshot()
    top = snap.statistics("lineno")[:5]
    print(f"with tracemalloc(8):        {t:.2f}s   overhead {(t/base-1)*100:.1f}%")
    print("  top heap allocators:")
    for s in top:
        print(f"    {s.size/1024:7.1f} KB  {s.traceback[0]}")
    tracemalloc.stop()

    # 3) py-spy attached out-of-process for CPU profile (no in-process cost)
    pid = os.getpid()
    p = subprocess.Popen(
        ["py-spy", "record", "-o", "/tmp/cpu.svg", "-d", "10", "-r", "100",
         "-p", str(pid), "--nonblocking"],
        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    t = run_workload()
    p.wait()
    print(f"with py-spy CPU (100Hz):    {t:.2f}s   overhead {(t/base-1)*100:.1f}%")

    # 4) py-spy --threads dump for lock/wait profile
    subprocess.run(["py-spy", "dump", "-p", str(pid)],
                   stdout=open("/tmp/threads.txt", "w"))
    print("thread dump written to /tmp/threads.txt — grep for 'lock' or 'acquire'")

Sample run on a 4-core M2 laptop, Python 3.11:

baseline (no profiler):     14.83s
with tracemalloc(8):        15.91s   overhead 7.3%
  top heap allocators:
    18432.4 KB  three_profiles.py:24
     2148.1 KB  three_profiles.py:43
      612.7 KB  three_profiles.py:18
      311.0 KB  /usr/lib/python3.11/logging/__init__.py:432
      168.4 KB  /usr/lib/python3.11/logging/__init__.py:1086
with py-spy CPU (100Hz):    14.96s   overhead 0.9%
thread dump written to /tmp/threads.txt — grep for 'lock' or 'acquire'

tracemalloc.start(8) — record at most 8 stack frames per allocation. The cost is roughly linear in frame count: tracemalloc.start(1) is ~3% on this workload, tracemalloc.start(25) is ~14%. The 7.3% you see at frame=8 is in the realistic production band. Critically, tracemalloc records every allocation in a sampled fashion (Python does not have Go-style MemProfileRate); the cost is paid by Python making a Cython-level call at every PyObject creation when tracing is on.

py-spy record -d 10 -r 100 -p $PID --nonblocking — sample stacks at 100 Hz from outside the process via process_vm_readv. Because py-spy lives in a separate process and reads the target's memory directly, the in-process overhead is zero except for cache-line pressure on the few addresses py-spy reads. The 0.9% overhead you see is measurement noise on this workload — py-spy's actual contribution is below the noise floor.

py-spy dump — one-shot snapshot of every thread's stack with its current state (running, sleeping, GIL-blocked). It is not a continuous profile — you call it on demand — but stacking three or four snapshots taken seconds apart during contention gives you a poor-engineer's lock profile that catches the most common GIL-and-Lock pathologies.

The tracemalloc output above is doing real work: line 24 (the cache.put site that stores the 256-byte payload) is the leak source, accumulating 18 MB in the test run. The corresponding flamegraph from py-spy (/tmp/cpu.svg, not shown) is dominated by verify_signature — the SHA-256 inner loop — and shows almost no time in cache.put because the leak is cheap to create, expensive to retain. This is the canonical "CPU and heap disagree, both are right" situation: the CPU profile points at the hot path; the heap profile points at the leak; you need both to see the full picture.

Why the overhead numbers matter for the production decision: a 7.3% overhead from tracemalloc(8) is fine for an SRE incident response (turn on, query, turn off) but unacceptable for "always-on" continuous profiling at fleet scale. The production-realistic always-on configuration is tracemalloc(1) (single-frame, ~2-3% overhead) plus py-spy continuous CPU sampling at 100 Hz (~0.5%) plus on-demand py-spy dump for thread states. JVM equivalents (-XX:FlightRecorder with profile.jfc) keep all three under 1% combined because the JVM has dedicated sampling-thread infrastructure Python lacks. This is one of several reasons why JVM-shop SREs find Python services harder to run at scale — the profiling-cost ceiling is genuinely lower in Python.

When the three profiles disagree, the disagreement is the signal

The most useful production skill is reading three profiles together and noticing when they tell different stories. Each disagreement points at a specific pathology shape, and recognising the shape cuts incident time from hours to minutes.

CPU low, lock high → contention, not work. A service whose CPU profile is bored (5% utilisation) but whose lock profile shows 70% wall time on a single mutex is doing nothing useful — it is queuing. This is the Zerodha-Kite at 09:15 IST scenario: market-open generates a thundering herd of order requests, every request takes the same OrderBook lock for the price-time-priority insertion, and 90% of threads spend 50ms each in OrderBook.lock.acquire while 10% do actual matching. The CPU profile says "we have spare capacity"; the lock profile says "no, we are throughput-bound on this single critical section". The fix is either fine-grained per-symbol locks or lock-free data structures; the diagnostic is reading the two profiles side by side.

CPU high in allocator (malloc, runtime.mallocgc, PyObject_Malloc), heap allocation-rate high, live-heap flat → GC pressure, not memory leak. A service eating 30% of its CPU in runtime.mallocgc with an allocation-rate profile that says "you allocate 80 GB/sec across these 4 call sites" but whose live heap is steady at 2 GB has no leak — it has a churning workload that is starving its GC. The fix is reducing allocation rate (object pooling, slice reuse, bytes.Buffer.Reset); the diagnostic is both the CPU profile (showing time in the allocator) and the allocation-rate heap profile (showing where bytes are coming from). Looking at only the live-heap profile would tell you "no leak" and you'd close the ticket; looking at only the CPU profile would tell you "lots of malloc" without telling you why. Both together name the problem.

CPU low, heap rising linearly, lock low → memory leak, no other symptom. This is the trap incident — there is no incident yet. The leak surfaces as an OOM kill three days later, by which point logs from the moment the leak started have rotated away. The only profile that catches it in real time is the live-heap profile, which is precisely the one teams disable first when chasing CPU overhead. Continuous live-heap profiling at low cost (runtime.MemProfileRate=2097152 for Go's 2 MB sampling, tracemalloc(1) for Python) is cheap insurance.

All three high simultaneously → cascading failure. When all three profiles spike together, the service is in a feedback loop: high CPU (work), high allocation-rate (GC pressure adding to CPU), high lock contention (queueing because the work is slow). This is the IRCTC Tatkal-hour pattern at 10:00 IST: the load surge pushes the service past the GC's healthy point, GC pauses cause requests to queue, queueing exhausts the per-request memory pool which triggers more GC, and the cascade is fed by its own symptoms. You cannot fix this from the inside; you need to shed load. Recognising the pattern from three profiles together is the cue to enable the rate-limiter, not to keep optimising hot paths.

Illustrative — not measured data. The five most-common combinations of CPU/heap/lock signals at production scale, with the pathology each combination names. The cascading-failure row is bordered in accent because the response (shed load, do not optimise) is the one teams most often get wrong under pressure.

# disagreement_classifier.py — given snapshots of (cpu_share_in_alloc,
# heap_alloc_rate_mb_s, live_heap_mb, lock_wait_pct), classify the pathology.
# Indian-fintech examples for each output class.
# pip install pandas
import pandas as pd

def classify(cpu_alloc_pct: float, alloc_rate_mb_s: float,
             live_heap_mb: float, live_heap_slope_mb_h: float,
             lock_wait_pct: float, cpu_user_pct: float) -> str:
    """Return a one-line diagnosis string. Real thresholds from production
    pyroscope+otel data; tune to your workload."""
    if cpu_user_pct < 15 and lock_wait_pct > 60:
        return ("contention-bound (no actual work). "
                "Pattern: Zerodha 09:15 OrderBook lock — fine-grain or lock-free.")
    if cpu_alloc_pct > 25 and alloc_rate_mb_s > 50 and live_heap_slope_mb_h < 5:
        return ("GC pressure (allocation churn, no leak). "
                "Pattern: Hotstar IPL JSON serdes — pool buffers, reuse slices.")
    if cpu_user_pct < 30 and live_heap_slope_mb_h > 50 and lock_wait_pct < 10:
        return ("slow leak (no other symptom — will OOM in days). "
                "Pattern: a misconfigured cache; check eviction policy.")
    if cpu_user_pct > 70 and alloc_rate_mb_s > 100 and lock_wait_pct > 50:
        return ("cascading failure (load > capacity, feedback loop). "
                "Pattern: IRCTC Tatkal 10:00 — shed load NOW, do not optimise.")
    if cpu_user_pct > 60 and lock_wait_pct < 15 and live_heap_slope_mb_h < 5:
        return ("CPU-bound on real work (the profile points at the hot path). "
                "Pattern: Razorpay RSA verify — optimise the hot function.")
    return "no clear pattern — collect more data, do not act on noise"

# Snapshots from four real(-shaped) incidents
incidents = [
    # cpu_alloc, alloc_rate, live_heap, slope, lock, cpu_user
    ("Zerodha 09:15 IST",     2,    8,   500,    0.2,  72,  10),
    ("Hotstar IPL serdes",   34,  180,  1200,    1.0,   8,  82),
    ("checkout slow leak",    3,   12,  3400,   90.0,   5,  18),
    ("IRCTC Tatkal cascade", 22,  240,  2800,   30.0,  62,  78),
    ("Razorpay verify hot",   4,    6,   400,    0.5,   3,  73),
]

rows = []
for name, ca, ar, lh, slope, lw, cu in incidents:
    rows.append({
        "incident": name, "cpu_alloc_%": ca, "alloc_MB/s": ar,
        "live_heap_MB": lh, "slope_MB/h": slope, "lock_wait_%": lw,
        "cpu_user_%": cu,
        "diagnosis": classify(ca, ar, lh, slope, lw, cu),
    })
df = pd.DataFrame(rows)
print(df.to_string(index=False))

Sample run:

            incident  cpu_alloc_%  alloc_MB/s  live_heap_MB  slope_MB/h  lock_wait_%  cpu_user_%                                                                                      diagnosis
   Zerodha 09:15 IST            2           8           500         0.2           72          10  contention-bound (no actual work). Pattern: Zerodha 09:15 OrderBook lock — fine-grain or lock-free.
   Hotstar IPL serdes           34         180          1200         1.0            8          82           GC pressure (allocation churn, no leak). Pattern: Hotstar IPL JSON serdes — pool buffers, reuse slices.
  checkout slow leak             3          12          3400        90.0            5          18                          slow leak (no other symptom — will OOM in days). Pattern: a misconfigured cache; check eviction policy.
 IRCTC Tatkal cascade           22         240          2800        30.0           62          78                           cascading failure (load > capacity, feedback loop). Pattern: IRCTC Tatkal 10:00 — shed load NOW, do not optimise.
  Razorpay verify hot            4           6           400         0.5            3          73                            CPU-bound on real work (the profile points at the hot path). Pattern: Razorpay RSA verify — optimise the hot function.

if cpu_user_pct < 15 and lock_wait_pct > 60 — the contention-bound check. User-CPU is low because threads are waiting; lock-wait is high because threads are waiting. Both indicators reinforce each other, which is why classifying on a single profile would miss it. The Zerodha row hits this case: 10% user-CPU, 72% lock-wait — the OrderBook is the bottleneck.

if cpu_alloc_pct > 25 and alloc_rate_mb_s > 50 and live_heap_slope_mb_h < 5 — the GC-pressure check. CPU spent in the allocator is high, allocation rate is high, but live heap is flat — so the bytes are getting freed promptly but the churn cost is killing performance. The Hotstar row hits this: 34% in allocator, 180 MB/s allocated, only 1 MB/h growth. The fix is allocation reduction, not memory expansion.

if cpu_user_pct > 70 and alloc_rate_mb_s > 100 and lock_wait_pct > 50 — the cascade check. Three high signals at once. The classifier explicitly tells you to shed load rather than optimise — because optimising while in a cascade extends the cascade. This is the most-violated rule in production: engineers see a CPU spike, deploy a "fix" that reduces CPU 5%, and the cascade keeps going.

Why a classifier-shaped readout is more useful than a flamegraph in a 03:00 incident: at 03:00 IST the on-call is sleep-deprived and reading three flamegraphs in three browser tabs is cognitively expensive. A single one-line diagnosis from a classifier — backed by the three profile inputs that produced it — collapses the decision tree. Real production teams (especially at smaller Indian fintechs that cannot afford a Pyroscope+Datadog+Honeycomb stack) build exactly this kind of classifier on top of their three free-tier profilers, and it is more useful than any single high-end APM tool because the signal is in the combination, not in any one tool's polish.

Common confusions

"A CPU profile that is 80% in epoll_wait means epoll is slow" — wrong. epoll_wait is where the thread parks waiting for an event; the time you see is wait time, not work. A modern profiler that conflates wait-time with work-time is broken; most profilers do separate them, but the older gprof style does not. The fix: use a profiler that distinguishes on-CPU samples (perf record -e cpu-clock) from wall-clock samples (perf record -e task-clock), and read the on-CPU profile when looking for hot work.
"Heap profile and live-heap profile are the same" — different. Heap allocation profile is rate-of-bytes-allocated (including bytes already freed); heap live profile is what is currently held. A service can have huge allocation rate and tiny live heap (GC churn) or huge live heap and tiny allocation rate (slow leak). Most "memory leak" tools default to one or the other and mislead in the wrong case. Go: pprof -alloc_space vs pprof -inuse_space are the two; you need both.
"Lock contention is the same as GIL contention" — overlapping but not the same. The GIL is one specific lock, held by exactly one Python thread at a time; contention on the GIL is what py-spy --threads mostly shows. A non-Python service has no GIL but can have severe contention on a sync.Mutex (Go), ReentrantLock (Java), or pthread_mutex_t (C). The principles transfer; the specific lock primitive does not.
"You can run all three profilers always-on at zero cost" — false. The honest cost on a Python service is 2-4% CPU overhead for low-frequency CPU + sampled tracemalloc + on-demand thread dumps; on a Go service is 1-2% combined for the three pprof endpoints at default sampling rates; on a JVM is under 1% for JFR's profile.jfc. Anyone selling you "1% combined" on Python is either lying or misconfiguring tracemalloc; the language overhead is real.
"The flamegraph that shows the most time is the bug" — sometimes. The flamegraph shows where the most time goes, which is the bug only if the bug is "we spend too much time here". A bug that says "we never reach this fast path because the cache check is wrong" shows up as time not spent where you expected, not as time spent somewhere unusual. Diff profiles (build A vs build B) catch these; single-snapshot flamegraphs do not.
"Lock profiles tell you about distributed locks" — false. A lock profile in Go, Java, or Python tracks in-process contention only. Distributed locks (Redis SETNX, Zookeeper ephemeral nodes, etcd leases, Postgres advisory locks) appear as time spent in the network call, not in the language's mutex profile. To see distributed-lock contention you need a tracing tool (Tempo, Jaeger) with span attributes for the lock-acquire latency, not a profiler.

Going deeper

How Go's three pprof handlers actually sample, and why the defaults work

Go's net/http/pprof ships three endpoints with very different sampling models. /debug/pprof/profile is CPU sampling at 100 Hz over a 30-second window via runtime.SetCPUProfileRate(100); the runtime's signal handler walks the goroutine stack on every SIGPROF and writes to a profile-record buffer. /debug/pprof/heap is allocation-sampled at MemProfileRate bytes (default 524288, i.e. one record per 512 KB allocated); the sampling is done inline in runtime.mallocgc with mp.MCache.next_sample countdown. /debug/pprof/mutex records 1-in-SetMutexProfileFraction contention events (default 0, meaning off; production-typical: 100). The three endpoints are not "always on" by default — mutex and block profiles must be enabled by the program — and forgetting to enable them is the most common reason a Go service has no lock profile when needed.

Java Flight Recorder's "execution sample" event versus async-profiler

JFR's jdk.ExecutionSample event samples a single thread per safepoint at ~10 ms intervals, which means it can miss cycles spent inside JIT'd code that elides safepoints. Async-profiler (Vyacheslav Bondarenko's tool) uses AsyncGetCallTrace plus perf_events to sample at safepoint-independent moments, capturing JIT-elided code paths that JFR misses. Both are valid; they answer slightly different questions. JFR's number for "time in String.intern" might be 8% while async-profiler reports 14% on the same workload — the gap is the JIT elision, and the async-profiler answer is closer to ground truth. Production Java teams typically run JFR continuously (it is built in, costs <1%) and pull async-profiler on demand for hot-path verification.

Native-heap blind spots: cgo, mmap, and the "memory leak Go pprof cannot see"

A Go service that uses cgo to call a C library (e.g. SQLite, librdkafka, libxml) allocates memory in the C heap via malloc, which the Go runtime's heap profile does not track. A service that mmaps a 4 GB file region similarly is invisible to pprof -inuse_space. The kernel's RSS is the truth; Go's heap profile is a subset. When kubectl top pod shows 8 GB used and pprof -inuse_space shows 800 MB, the missing 7.2 GB is in cgo, mmap, the goroutine stack pool, or the runtime's internal allocators (mheap, mspanCache). The diagnostic ladder is: pprof for Go-allocated, gops memstats for runtime-internal, pmap -X $PID for mmap, valgrind or bcc memleak for cgo. Skipping the ladder leaves you with "the profiler says 800 MB, the OS says 8 GB, who do I trust" and no answer.

Python's GIL is itself a lock-profile target — and the most common one

The Global Interpreter Lock is held by exactly one Python thread at a time. Threads waiting for the GIL show up in py-spy --threads as <state: GIL> (in py-spy 0.3.14+) and in tracemalloc as no allocations during the wait. A Python service with N>1 threads doing CPU-bound work has GIL contention by construction — the work serialises onto a single interpreter — and this shows up as a wall-time-vs-CPU-time gap. A 4-thread service running at 100% CPU on a 4-core box but completing only 1.05× the work of a 1-thread baseline is GIL-bound. The fix is multiprocessing (concurrent.futures.ProcessPoolExecutor), C extensions that release the GIL (numpy, most pandas ops), or Python 3.13's experimental no-GIL build (PEP 703). The diagnostic is comparing wall time and CPU time and looking for the ratio gap; no profiler tells you this directly.

Production safeguards: when NOT to enable a profile

Three concrete cases where enabling a profile in production is wrong. First, enabling MemProfileRate=1 on a Go service that already shows GC pressure: the per-allocation overhead worsens the GC pressure, and the profile you collect is contaminated by the profiler's own allocations. Use MemProfileRate=4096 (every 4 KB) or higher for a high-allocation-rate service. Second, enabling JFR's profile.jfc (heavy profile) on a JVM with -Xmx=4g and 95% old-gen utilisation: JFR's buffer steals heap, and a near-OOM JVM that loses 200 MB to JFR will OOM. Use default.jfc (lighter) or a lower retention. Third, enabling tracemalloc on a Python service that handles authentication tokens: tracemalloc records the contents of the call stack, including local variables containing the tokens, which then leak into your observability backend. Either disable token-bearing call sites from sampling, or use tracemalloc(1) (single-frame, no local-variable capture). The "always-on profiling is free" mantra has these three exceptions and they all hurt.

Where this leads next

Chapter 59 — differential profiling — picks up the question of how to compare two profiles (build A vs build B, before-deploy vs after-deploy, healthy region vs unhealthy region) without being misled by the profile's own statistical noise. The technique is unintuitive: you cannot just subtract two flamegraphs because the per-function counts have different denominators; you need a pooled-test or a likelihood-ratio.

Chapter 60 — profile storage and query patterns — picks up the question that follows from running three profile types continuously: how to store them at fleet scale, how to query them at sub-second latency, and how to bound retention without losing the long-tail data that catches slow leaks. The storage shape inherits from the GWP paper but the modern columnar engines (FrostDB, Phlare-style blocks) added new tricks the paper did not have.

For the specific tooling-by-language map, /wiki/pyroscope-and-parca-architectures compares two complete continuous-profiler systems and shows which profile types each handles natively versus which require an extra agent.

For the historical context, /wiki/google-wide-profiling-paper covers GWP's design — which is CPU-only by the paper, and which the open-source descendants extended to heap and lock independently.

References

Brendan Gregg — Systems Performance: Enterprise and the Cloud (2nd ed., Pearson, 2020) — chapters 6 (CPUs) and 7 (memory) cover the on-CPU vs off-CPU distinction the chapter rests on.
Russ Cox — Profile-guided optimization in Go 1.20+ (and what pprof actually samples) — Go's CPU/heap/mutex/block profile semantics, written by the language's design team.
Vyacheslav Bondarenko — async-profiler — the tool that catches JIT-elided JVM samples JFR misses.
Python Software Foundation — tracemalloc — Trace memory allocations — stdlib heap-profiling, with the frame-count trade-off documented.
Ben Frederickson — py-spy: Sampling profiler for Python programs — out-of-process Python sampler, the closest thing Python has to perf.
Brendan Gregg — Off-CPU Analysis — the canonical write-up on what CPU profiles miss and how to recover it with eBPF.
Charity Majors, Liz Fong-Jones, George Miranda — Observability Engineering (O'Reilly, 2022) — chapter on profiles as the fourth pillar.
/wiki/google-wide-profiling-paper — chapter 57 of this curriculum, on GWP's CPU-first design and how heap/lock came later.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install py-spy psutil pandas
python3 three_profiles.py
python3 disagreement_classifier.py