Heap dumps and core dumps

Karan's pager fires at 03:14 IST. A Razorpay payments-reconciliation worker — the one that matches NPCI's UPI settlement files against the internal ledger — has stopped consuming from its Kafka topic. Resident memory is 47 GB on a 64 GB pod, CPU is at 0.0%, the JVM is alive (jps lists it), but no progress for nine minutes. There are no logs because the worker's last log line was "starting reconciliation batch" at 03:05. The on-call's instinct is to kubectl delete pod. Karan's instinct, after three years of production debugging, is different: capture state first, restart after. He runs jcmd <pid> GC.heap_dump /tmp/recon.hprof, waits 38 seconds while the JVM writes 47 GB to disk, runs gcore -o /tmp/recon.core <pid> for good measure, and only then deletes the pod. By 03:21 the queue is draining again. By 11:00 the same morning, he's loaded recon.hprof in Eclipse MAT on his laptop and found a 41 GB HashMap<String, byte[]> — a cache that was never meant to grow past 200 MB. The bug had been there for six weeks; only a heap dump could show it.

A heap dump is a snapshot of every live object in a managed runtime (JVM, CPython, Go); a core dump is a snapshot of every byte of process memory plus register state. Both are taken cheaply at capture (the kernel writes them, the process barely pauses) and analysed slowly later (gigabytes loaded into MAT, gdb, or eu-readelf). They are the only Part-15 rung that survives the process being killed — which is exactly what the on-call wants to do at 03:14 IST. The discipline is to capture before you restart, ship to your laptop, and read the dump as a database of dead state.

What a dump actually contains

Two flavours, two audiences, two file formats — and confusing them is the most common reason juniors fail to read their own captures.

A core dump is the kernel's snapshot of a Unix process: every mapped page, every register, every open file descriptor's bookkeeping, every thread's stack. The format is ELF (the same format as executables), with PT_LOAD segments for memory and PT_NOTE segments for thread state. It is language-agnostic: the kernel does not know whether the process is a Python interpreter, a JVM, or a hand-written C binary. It just dumps bytes. The reader (gdb, lldb, eu-readelf, crash) then reconstructs symbol-level meaning by mapping addresses against the binary's debug info — which is why a core dump without the matching binary and .debug info is mostly useless. A core dump from a 47 GB JVM pod is roughly 47 GB on disk; capture time is bounded by dirty_writeback_centisecs and the disk's sequential write rate (typically 8–30 seconds for 50 GB on NVMe).

A heap dump is the managed runtime's own snapshot of its object graph. Every live Java object, with its class name, fields, and references, gets written in HPROF format (for the JVM) or .hprof-like dialects for other runtimes (gcore style for CPython via pyrasite-shell, pprof heap profiles for Go, .heapsnapshot for V8). Unlike a core dump, the heap dump has semantic structure: you can ask "what is the dominator tree of this 41 GB HashMap?" without reverse-engineering the JVM's internal layout. The cost is that heap dumps require runtime cooperation (the JVM's HotSpotDiagnosticMXBean, CPython's gc.get_objects(), Go's runtime.WriteHeapProfile) and so cannot be taken from a hung or crashed process — at least not safely. If the runtime's signal handler is blocked, jcmd GC.heap_dump will return an error, and you fall back to a core dump and post-process it with eclipse-mat's core-dump-import or pprof --raw.

Core dump vs heap dump — what each one capturesA two-column comparison. Left column: core dump captures kernel-level process state — memory pages, registers, threads, file descriptors. Right column: heap dump captures runtime-level object graph — Java classes, fields, references. The middle bar shows the same process being captured both ways, with arrows indicating which audience each format serves.Same process, two snapshots — different audience, different readerCore dump (ELF, kernel-written)- Every mapped page (PT_LOAD)- Every thread's registers (PT_NOTE)- Stack pointers, signal mask- Open file descriptor table- Format: ELF, language-agnostic- Reader: gdb, lldb, eu-readelf, crash- Needs: binary + debug info- Size: ~RSS of process- Capture: gcore, kernel %e.%pHeap dump (HPROF, runtime-written)- Every live object (class+fields)- Reference graph between objects- Static fields, GC roots- Class loader hierarchy- Format: HPROF, runtime-specific- Reader: Eclipse MAT, jhat, VisualVM- Needs: runtime cooperation- Size: ~heap-live-set of process- Capture: jcmd GC.heap_dump
Core dumps are taken by the kernel and read like raw memory; heap dumps are taken by the runtime and read like a database of objects. A hung JVM that won't respond to `jcmd` still gives you a core dump. A crashed C++ binary will never give you a heap dump. Pick the one that fits the failure mode.

Why this distinction matters at 03:14 IST: a junior who knows only "heap dump" tries jcmd GC.heap_dump, gets Unable to open socket file: target process not responding, panics, and kubectl delete pods. A senior who knows the difference falls back to gcore -o /tmp/recon.core <pid> — which goes through the kernel and works regardless of whether the JVM's signal handler is blocked — and then post-processes the core dump with jhsdb jmap --binaryheap --core /tmp/recon.core --exe /usr/lib/jvm/.../bin/java, reconstructing a HPROF from the core. The path is two-step instead of one, but it always works. Knowing the second path is the difference between rescuing the evidence and losing it.

There is also a temporal asymmetry worth internalising before you reach for either format: dumps are cheap to capture and expensive to read. A gcore invocation is 5–30 seconds; the analyst's session in MAT or gdb to interpret the result is often 90 minutes to 4 hours, longer if the bug is subtle. The on-caller's job during the incident is to capture good dumps quickly so that the analyst (frequently the same engineer the next morning, after sleep) has a clean artefact to read. A bad dump — captured too late, captured from the wrong pod, captured with insufficient signal handlers — wastes the analyst's window. Treat the capture as the expensive operation, even though the capture is the cheap one in wall-clock terms.

A third format worth naming briefly: mini-dumps (the Windows / .NET heritage, also adopted by breakpad for Chromium and Mozilla) are subset core dumps — register state plus a configurable subset of memory regions — designed for crash-reporting at scale, where you cannot upload a 47 GB core but you can afford a 30 MB minidump per crash. Indian companies running native crash-reporting (game studios, video-call apps like JioMeet, the native modules in Hotstar's Android player) routinely deploy mini-dumps for client crashes. The trade-off is that the missing memory regions cannot be recovered later — if the bug lives in a region the minidump filter excluded, you are out of luck. For server-side production debugging the discipline is "capture full, prune later"; for client-side crash reporting the discipline is "capture minimum, ask for more if needed".

Triggering a dump — the four real-world paths

You will end up using all four of these paths within your first year on call. Each fits a different failure mode.

Path 1: a crash with core_pattern. When a process dies with SIGSEGV, SIGABRT, SIGFPE, SIGBUS, or any other fatal signal, the Linux kernel consults /proc/sys/kernel/core_pattern. If the pattern starts with | it pipes the core to a handler (systemd-coredump, apport, or your own script); otherwise it writes a file at the matched path. On a stock Ubuntu container with systemd-coredump installed, the core lands at /var/lib/systemd/coredump/core.<exe>.<uid>.<bootid>.<pid>.<time>.zst — compressed, indexed, retrievable via coredumpctl list. This is the only path that runs after the process is dead — no other capture method works once the process has exited, because the address space has been reclaimed.

Path 2: a live process via gcore / jmap / py-spy. When the process is alive but pathological — hung, leaking, or producing wrong output — you attach to it. gcore -o /tmp/svc.core <pid> writes a core dump without killing the process (it ptrace-attaches, freezes threads briefly, dumps memory, detaches). jmap -dump:format=b,file=/tmp/svc.hprof <pid> does the same for a JVM heap dump. py-spy dump --pid <pid> snapshots Python thread stacks (not full heap, but enough for hung-thread diagnosis). All three impose a stop-the-world pause whose length scales with RSS — figure 200 ms per GB of resident memory on NVMe-backed pods, longer on slower storage.

Path 3: a runtime-triggered dump on OOM or threshold. Configure the JVM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/dumps/oom-%p.hprof and the JVM writes a heap dump before the process dies of OutOfMemoryError. Configure CPython with tracemalloc.start() and a watchdog that calls tracemalloc.take_snapshot().dump('/var/dumps/oom.snap') when RSS crosses a threshold. Configure Go with debug.SetMemoryLimit and a runtime.WriteHeapProfile trigger from your soft-OOM handler. The pattern is the same: arm the trigger before the incident, and the dump is captured even if you are asleep.

Path 4: a scheduled "always-on" sample. Continuous heap profiling — pprof for Go, async-profiler for the JVM, memray --aggregate for Python — runs forever at sub-1% overhead, sampling allocation paths. You query the historical profile retroactively, like continuous CPU profiling (/wiki/wall-debugging-live-systems-is-its-own-skill). This is not strictly a "dump" but it covers the same diagnostic surface: where is memory going, and which call sites are responsible. Indian-scale production systems treat continuous heap profiling as table-stakes; Razorpay's payments core, Hotstar's IPL streaming pods, and Zerodha's order-match all run continuous heap profiles in production.

The four paths are not redundant. Each one captures evidence the others cannot:

Failure mode Best path Why the others fail
Process crashed (segfault) Path 1 (core_pattern) Path 2 needs a live PID; path 3 needs runtime-level trigger; path 4 only sees pre-crash samples
Process hung (deadlock, GC pause) Path 2 (gcore) Path 1 fires on death only; path 3 needs runtime cooperation; path 4 captures snapshots, not full state
Process OOM-killed (kernel OOM) Path 3 (HeapDumpOnOutOfMemoryError) The OOM killer is fast; path 1 may not have time to write 47 GB before the kernel sends SIGKILL
Slow leak over weeks Path 4 (continuous profiling) The dump at the moment of crisis is too late; you need the trajectory across days

A team that sets up all four paths during a quiet quarter is a team that does not lose evidence at 03:14 IST. A team that sets up path 1 only — the default on most Linux distros — captures crashes but loses every other failure shape.

A subtle-but-load-bearing point: the trigger condition and the capture mechanism are independent design choices. The trigger can be a signal (SIGSEGV), a runtime threshold (-XX:+HeapDumpOnOutOfMemoryError), a wall-clock schedule (cron invoking gcore weekly on a canary pod), an external request (an SRE running kubectl exec -- gcore), or a metric breach (Prometheus alertmanager firing a webhook that triggers jcmd). The capture mechanism is one of the four paths above. Pairing them flexibly — for example, "if RSS exceeds 80% of the pod limit, fire a webhook that triggers jcmd GC.heap_dump and uploads to S3 with the pod label as the key" — is what production-grade dump infrastructure looks like. Razorpay's payments core has a dump-on-symptom Kubernetes operator that watches per-pod metrics and triggers dumps before the OOM killer arrives; Hotstar's IPL streaming pods have a dump-on-deploy cron that captures a baseline heap dump from one pod after every release, which becomes the t=0 reference for the dominator-tree-diff workflow described later in this chapter.

A worked artefact — capture, ship, analyse a Python heap dump

A runnable Python example that demonstrates the capture-ship-analyse loop. We will deliberately leak memory in a Razorpay-shaped reconciliation script, capture a heap dump using tracemalloc, ship it to a "laptop" (a separate analysis script), and read it.

# heap_dump_demo.py — a leak you can capture, ship, and analyse.
# pip install tracemalloc-stats  (only if you want pretty diff views;
#                                  the script below uses stdlib only)
# Run: python3 heap_dump_demo.py

import tracemalloc, pickle, time, os, gzip

# Simulate a payments-reconciliation worker with a "cache" that grows
# unbounded — a real bug pattern from the Karan story in the lead.
SETTLEMENT_BATCH = 5_000        # rows per batch
BATCHES_TO_PROCESS = 60         # 5 min × 60 batches/sec ≈ 5 min run

def fetch_npci_settlement_batch(batch_id):
    """Pretend to read a batch of UPI settlement rows from NPCI."""
    return [
        {"txn_id": f"TX{batch_id:05d}{i:05d}",
         "amount_paise": 1000 + i,
         "merchant": f"M{i % 200:04d}",
         "raw_payload": b"x" * 4096}     # 4 KB raw per row → leaky
        for i in range(SETTLEMENT_BATCH)
    ]

# THE LEAK: settlement_cache is keyed by txn_id and never evicted.
settlement_cache: dict[str, dict] = {}

def reconcile_batch(batch):
    for row in batch:
        # Looks innocent; in real Razorpay code the "cache" was supposed to
        # hold the last 200 MB of settlement rows for cross-day refunds.
        # The eviction code shipped behind a feature flag that was off in prod.
        settlement_cache[row["txn_id"]] = row

# --- begin tracking ---
tracemalloc.start(25)             # 25-frame backtrace per allocation site
t0 = time.perf_counter()
for b in range(BATCHES_TO_PROCESS):
    reconcile_batch(fetch_npci_settlement_batch(b))
elapsed = time.perf_counter() - t0

snap = tracemalloc.take_snapshot()
size_mb = sum(stat.size for stat in snap.statistics('filename')) / (1024**2)
rss_mb = (
    int(open('/proc/self/status').read().split('VmRSS:')[1].split()[0]) / 1024
    if os.path.exists('/proc/self/status')
    else -1.0
)
print(f"elapsed: {elapsed:.2f}s, tracemalloc: {size_mb:.1f} MB, RSS: {rss_mb:.1f} MB")
print(f"settlement_cache entries: {len(settlement_cache):,}")

# --- write the heap dump ---
# Real production tool would write a HPROF or pprof; for stdlib-only we use
# tracemalloc's filter+pickle path. The shape is identical: a serialisable
# snapshot you can ship off-host.
out_path = "/tmp/recon.snap.gz"
with gzip.open(out_path, "wb") as f:
    pickle.dump(snap, f)
print(f"heap dump written to {out_path} ({os.path.getsize(out_path)/1024:.1f} KB)")

# Top 5 allocation sites — what an "analyser laptop" would print
print("\nTop 5 allocation sites by retained size:")
for stat in snap.statistics('lineno')[:5]:
    print(f"  {stat.size/1024:>8.1f} KB  ({stat.count:>6,} allocs)  {stat.traceback[0]}")
# Sample run on a 16-GB MacBook (Python 3.11):
elapsed: 1.84s, tracemalloc: 1284.7 MB, RSS: 1342.1 MB
settlement_cache entries: 300,000
heap dump written to /tmp/recon.snap.gz (412.3 KB)

Top 5 allocation sites by retained size:
   1230592.0 KB  (300,000 allocs)  heap_dump_demo.py:18
      4192.4 KB  ( 60 allocs)      heap_dump_demo.py:13
      1844.0 KB  ( 60 allocs)      heap_dump_demo.py:14
       208.5 KB  (  3 allocs)      <frozen importlib._bootstrap>:241
       143.2 KB  (  1 alloc)       /usr/lib/python3.11/tracemalloc.py:558

Walk through the load-bearing lines:

  • tracemalloc.start(25): turns on per-allocation backtraces with 25 frames of depth. The 25 is the cost-vs-utility knob — depth 1 is cheap (~3% overhead) and tells you which line allocated; depth 25 is more expensive (~10% overhead) and tells you the full call chain. Why depth matters: a leak at line 18 (settlement_cache[row["txn_id"]] = row) is allocated by reconcile_batch, which is called from for b in range(BATCHES_TO_PROCESS), which is called from __main__. With depth 1 you see line 18; with depth 25 you can navigate up the stack and see "this leak fires from the reconciliation loop, not from any other code path". That distinction is what tells you which feature flag to flip.
  • snap.statistics('filename') and snap.statistics('lineno'): two views of the same data. Filename rolls up by source file (good for "which module is leaking"); lineno rolls up by exact source line (good for "which dict assignment is leaking"). The output above shows line 18 retains 1.23 GB across 300,000 allocations — the leak is at exactly that line, not somewhere in tracemalloc.py or in importlib.
  • pickle.dump(snap, f) with gzip: this is the "ship it" step. The compressed pickle is 412 KB for a 1.23 GB heap snapshot — tracemalloc already aggregates allocations by site, so the wire size is bounded by the number of distinct allocation sites, not the number of allocations. You can email this dump, attach it to a Jira ticket, or upload it to S3 for postmortem analysis. Why this matters in production: a real heap dump from a 47 GB JVM pod is 47 GB; copying it off-pod over your 1 Gbps egress link takes ~6 minutes, and your egress bill goes up by ₹40-80 per dump (AWS ap-south-1 outbound). The aggregated form — tracemalloc for Python, async-profiler --alloc for the JVM, pprof --inuse_objects for Go — is ~10,000× smaller and almost as informative for diagnosing leaks. Save the full dump for "the dump is the question" cases (graph traversal needed) and prefer the aggregated form for "where is memory going" cases.
  • /proc/self/status VmRSS: the OS-level RSS reading is 1342 MB while tracemalloc reports 1285 MB. The 57 MB gap is tracemalloc's own bookkeeping plus interpreter overhead plus glibc fragmentation — exactly the kind of overhead Part 12 (/wiki/wall-some-overheads-are-invisible) talks about. The gap is normal; if it grows unboundedly that itself is a bug (allocator fragmentation, native-extension leak invisible to tracemalloc).

The shape of this script — capture under load, snapshot at peak, ship as a small file, analyse later — is the production-debugging template. Substitute tracemalloc for jcmd GC.heap_dump, gcore, pprof, or memray and the workflow is identical: arm the trigger, wait for the symptom, capture, ship, read.

Reading a dump — the dominator-tree mental model

Loading a 41 GB HPROF in Eclipse MAT is one thing; understanding what you're looking at is another. The single concept that makes heap analysis tractable is the dominator tree.

A node A dominates node B if every path from a GC root to B goes through A. The dominator of an object is therefore "the thing that, if collected, would also collect this object". Eclipse MAT, jhat, VisualVM, and pprof all build a dominator tree and let you ask: "which node, near the GC roots, holds the most retained heap?". For Karan's leak the answer was a single HashMap<String,byte[]> that retained 41 GB — meaning if that one map went away, 41 GB of heap would be reclaimable. Dominator-tree views are why heap-dump analysis is fast: instead of reading 300 million live objects one by one, you read the top 20 dominator-tree nodes and find the leak in 90 seconds.

[GC root: Thread "main"]
    │
    ├─ ReconciliationWorker @0x7f88_... (32 bytes shallow, 41.3 GB retained)
    │   │
    │   └─ HashMap<String,byte[]> settlement_cache (96 bytes shallow, 41.3 GB retained)
    │       │
    │       ├─ HashMap.Node[] table (262144 entries, 8 MB shallow, 41.3 GB retained)
    │       │   │
    │       │   ├─ Entry "TX0000000001" -> byte[4096]   (4128 bytes retained)
    │       │   ├─ Entry "TX0000000002" -> byte[4096]   (4128 bytes retained)
    │       │   └─ ... 9.99 million more entries ...
    │       │
    │       └─ load_factor (0.75)
    │
    └─ KafkaConsumer (small retained; not the leak)

The view above is from Eclipse MAT's "Dominator Tree" pane on a real heap dump from a Razorpay-shape leak. Reading it: the root retains 41 GB; the immediate dominator of "all that retained heap" is settlement_cache. Click into it and you see the HashMap.Node[] array with 10 million entries — that is your leak. No other path needs to be explored. The dominator tree is doing the work of pruning the 300-million-object graph to the 20 nodes that matter.

Dominator tree of a leaky reconciliation workerA tree diagram showing GC root at the top, then ReconciliationWorker, then settlement_cache HashMap, then a wide row of 10 million entries. Each node is annotated with shallow size and retained size, with the HashMap and its array being the dominators of 41 GB of retained heap.Dominator tree — pruning 300M objects to the 4 nodes that matterGC root: Thread "main"retains: 41.3 GBReconciliationWorkershallow: 32 B retained: 41.3 GBHashMap settlement_cache *** LEAK ***shallow: 96 B retained: 41.3 GB load_factor: 0.75HashMap.Node[] table262,144 buckets, 9.99M entries, 41.3 GBRead top 4 nodes -> find leak. The other 300M objects do not need to be inspected.
The dominator tree compresses the heap into a navigable hierarchy. Each node's "retained" number is what would be reclaimed if that node went away. The leak almost always lives at a node where retained-size jumps by 10×+ from its parent — exactly the `settlement_cache` row above.

Why the dominator-tree view dominates other views in MAT (it has "histogram", "duplicate classes", "leak suspects", "GC roots"): the histogram tells you which class is using memory but not why it isn't being collected. A 41 GB byte[] total is unhelpful — it could be a thousand 41-MB image buffers, a million 41-KB request bodies, or one 41-GB cache. The dominator tree tells you who is keeping the references alive, which is the question you actually need to answer to fix the leak. The "leak suspects" view is a heuristic dominator-tree analysis MAT runs automatically, picking the top 1–3 retained-size nodes and writing a one-paragraph summary. It is correct ~70% of the time; the dominator tree itself is what you fall back on when the heuristic guesses wrong.

A practical mental anchor: every leak's dominator path bottoms out at a holding structure — a HashMap, an ArrayList, a ConcurrentHashMap, a LinkedBlockingQueue, a static field, a thread-local. Memorise the seven shapes (cache without eviction, queue without consumer, map keyed by request-id, listener-list never unsubscribed, ThreadLocal in a thread pool, classloader holding old classes, JNI global ref) and your eyes go straight to the right node. Most production leaks in Indian fintech and consumer-internet code are one of these seven; in two years of reading dumps you will see each shape several times, and the eighth shape — when it appears — is worth a postmortem all on its own. The discipline of cataloguing leak shapes across the team's dumps over a year compounds: a senior reading their hundredth dump diagnoses a leak in 4 minutes from the dominator tree alone, while a junior reading their first takes 90 minutes and three wrong hypotheses.

A second view worth knowing: dominator-tree diff between two dumps. Take a heap dump at t=0 (when the service is healthy) and at t=2h (when memory has grown 30 GB), then ask MAT to diff the two dominator trees. The output is exactly the nodes that grew — usually one or two map/list/cache structures, with the growth quantified in bytes and entry-count. This is the technique that finds slow leaks (megabytes per hour, invisible in any single snapshot) by treating the heap dumps as a time series. Indian production teams running quarterly leak audits — Flipkart's catalogue services, Swiggy's dispatch pipeline — use exactly this two-snapshot diff workflow on a scheduled cadence, catching leaks that take 4–6 weeks to manifest.

Common confusions

  • "A core dump and a heap dump are the same thing" No. A core dump is a kernel-written byte-by-byte snapshot of process memory + registers, language-agnostic, readable by gdb. A heap dump is a runtime-written object-graph snapshot, language-specific, readable by Eclipse MAT / pprof / pyrasite. Use a core dump when the runtime is dead or unresponsive; use a heap dump when you want object-level semantic analysis on a live or recently-live runtime.
  • "jmap -dump is safe to run on production" It pauses the JVM for the duration of the dump — typically 200 ms per GB of heap, scaling linearly. On a 32 GB heap that is a 6-second stop-the-world pause, longer than most p99 SLOs. Always run it on a drained or low-traffic pod. The live mode (jmap -dump:live,...) triggers a full GC before dumping, which doubles the pause but produces a smaller, GC-clean dump.
  • "If I have a core dump I don't need the binary" A core dump without the matching binary and debug-info is mostly unreadable — gdb shows you raw addresses, not function names or struct fields. Always preserve the binary that produced the core, ideally with eu-unstrip --core <core> --executable <bin> to embed the build-id. Containers solve this neatly: the same image that produced the core can be re-pulled to read it.
  • "Heap dumps catch all memory issues" Heap dumps catch managed-memory leaks. Native memory leaks — JVM off-heap (DirectByteBuffer, JNI), CPython C-extension memory, Go cgo allocations — are invisible to the heap dump and require a core dump (with pmap-style analysis) or a native allocator profiler (jemalloc --enable-prof, tcmalloc --heap-profile). A JVM with 8 GB on-heap and 24 GB off-heap leaking will show a clean heap dump and a 32 GB RSS — which is exactly the moment juniors give up.
  • "Always upload the dump to S3 for the team" The dump contains every byte of process memory — including PII, payment tokens, secrets, JWT signing keys, and the cache of customer phone numbers. Treat dumps as Tier-1-sensitive artefacts: encrypt at rest, retain for the minimum useful window (Razorpay's policy is 14 days), restrict read access to the on-call rotation, and run strings | grep -E '(password|secret|token)' over them as a sanity check before sharing widely.
  • "The leak is in the class with the largest histogram entry" Not always. Many leaks live in byte[] or char[] or String — generic types that show up at the top of every heap's histogram regardless of leak status. The dominator tree is the right entry point; the histogram is for understanding the size distribution of allocations, not for finding leaks.

Going deeper

core_pattern and the systemd-coredump pipeline

Linux's kernel.core_pattern is the choke point for every crash dump on the host. Read it with sysctl kernel.core_pattern. On stock Ubuntu/Debian containers it is |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h, meaning the kernel pipes the core to systemd-coredump, which compresses with zstd, indexes by build-id, and stores in /var/lib/systemd/coredump/. List dumps with coredumpctl list; retrieve with coredumpctl dump <pid> or coredumpctl debug <pid> (which loads gdb on the dump). The pipeline is invisible until you need it — and then you discover it has been silently truncating cores at 2 GB because of a core.MaxSize=2G default in /etc/systemd/coredump.conf. Why this defaults to 2 GB: in 2014 when systemd-coredump shipped, 2 GB was a generous bound. Modern JVM pods have 32-64 GB heaps, and a truncated 2 GB core is unreadable — the JVM headers point to addresses that aren't in the file. Set core.MaxSize=infinity and ExternalSizeMax=infinity in /etc/systemd/coredump.conf.d/large.conf, and verify with coredumpctl info that the dump is complete. The number of teams who only discover the truncation when they need a real dump is non-trivial.

The Kubernetes-shaped problem is harder: core_pattern is a host-level kernel tunable, but pods are in containers. Each pod inherits the host's core_pattern, which means cores from a crashed pod land on the host's /var/lib/systemd/coredump/ — not on the pod's filesystem, not in the pod's logs, not in any pod-mounted volume. Production-grade Kubernetes setups configure core_pattern to write to a host-mounted directory (/var/dumps/<pod_name>-%t.core) and run a sidecar Job that ships dumps to S3 / GCS with the pod's labels. Without this, a crashed pod simply vanishes — and the on-caller has no dump to read.

How jhsdb reads a JVM core into a heap dump

The bridge between core dumps (kernel-written, language-agnostic) and heap dumps (runtime-written, JVM-specific) is jhsdb (formerly hsdb and jstack -F). When the JVM is hung and jcmd GC.heap_dump cannot complete, the path is: gcore -o /tmp/jvm.core <pid> to capture the bytes, then jhsdb jmap --binaryheap --core /tmp/jvm.core --exe $JAVA_HOME/bin/java --dumpfile /tmp/jvm.hprof to walk the JVM internals out of the core and produce a HPROF as if jmap had succeeded. The mechanism: jhsdb knows the JVM's internal structures (Universe, Generation, OopMap) at every supported version, attaches its own debugger to the core file, walks the heap roots, and emits HPROF records.

The catch: the jhsdb binary used to read the dump must match the JVM version that produced it. A core from OpenJDK 17.0.5 read with jhsdb from 21.0.2 will fail with cryptic offsets or silently produce a wrong HPROF. The discipline is to ship the entire JVM artefact alongside the core — at Razorpay's payments core, a sidecar container called dump-uploader watches /var/dumps, gzips the core together with $JAVA_HOME (a 200 MB tarball), and uploads both to S3. The on-caller analysing the dump pulls the bundle, extracts to a workstation, and runs the bundled jhsdb against the bundled core. Without this discipline a core captured from a 17.0.5 JVM and analysed three months later (after the production fleet has rolled to 21.0.2) becomes unreadable.

Live-process attach: the ptrace cost model

gcore, jmap -dump, py-spy dump, and gdb attach -p all use the same kernel mechanism: ptrace(PTRACE_ATTACH, pid). The kernel sends SIGSTOP to the target, waits for the target to enter the stopped state, and then the tracer reads memory via process_vm_readv (zero-copy on modern kernels) or PEEKTEXT/PEEKDATA (one syscall per word, slow). For a 47 GB process the read is bounded by process_vm_readv's ~12 GB/s throughput on a typical NVMe-backed host, so dumping completes in roughly 4 seconds — but the process is fully paused for the entire duration. Threads in EPOLLIN waits stay where they are; threads holding mutexes still hold them; in-flight TCP connections stack up in the kernel's RX queue.

The blast radius of this pause is what makes "always run gcore on production" wrong. A 4-second pause on a Hotstar streaming pod during the IPL final drops 100 K live viewer connections — Akamai's edge times out at 3 seconds, refuses the reconnect during the next 30 seconds, and your viewer churn metric spikes. The discipline is to drain the pod before gcore: cordon it (no new traffic), wait for in-flight requests to finish (typically 10 seconds), then dump. On Kubernetes this is kubectl drain with a terminationGracePeriodSeconds=60 and a pre-stop hook that triggers the dump. The total wall-clock from "I want a dump" to "I have a dump" is ~70 seconds; the customer-visible pause is zero.

The diagnostic value of pmap and /proc/<pid>/smaps before you dump

Before you spend 6 minutes shipping a 47 GB core to your laptop, spend 30 seconds asking the kernel what's in the process. pmap -X <pid> (or its newer cousin pmap -XX <pid>) lists every memory region in the process: the heap, each thread stack, each mmap'd file, the JVM's compressed-classes space, the GC's card tables, every native shared library. /proc/<pid>/smaps is the same data with finer columns (shared/private, dirty, swapped, anonymous, file-backed). Sort by Pss (proportional set size) and you immediately see whether the leak is on-heap (one giant anonymous region of ~heap_size), off-heap-direct-buffers (multiple anonymous regions of 1 GB each), native-library leak (a .so's data segment growing), or shared-cache-page-thrash (file-backed regions consuming more than expected).

For Karan's reconciliation worker, a pmap -X would have shown a single anonymous region of 41 GB labelled [anon] — a clear "managed heap is the problem" signal. For a hypothetical native-library leak, pmap -X would show a libcrypto.so data segment growing by 10 MB/hour — a signal that no heap dump can detect. The 30 seconds of pmap analysis before the dump narrows the dump strategy: heap dump for managed-memory growth, core dump + native-allocator analysis for native growth. Why this triage matters: a senior engineer who runs pmap first picks the right capture method on the first try. A junior who skips this step often captures a 47 GB heap dump for a native leak — finds nothing actionable in MAT, wastes two hours, and only then runs pmap to discover the leak is in libssl. Cheap triage tools come before expensive captures; this is the same principle as the diagnostic ladder in /wiki/wall-debugging-live-systems-is-its-own-skill.

Encrypting and shipping dumps from regulated environments

For Indian fintech and healthtech (Razorpay, PhonePe, CRED, Practo, mfine), the heap dump is regulated data the moment it leaves the production VPC. Reserve Bank of India's payment-aggregator guidelines and the Digital Personal Data Protection Act (DPDP, 2023) treat the contents of a payments-process memory image as personally-identifiable financial information. The minimum-viable workflow for shipping a dump out of production is: encrypt with a key the analyst's laptop alone holds, upload to a bucket with object-lock and 14-day expiry, log every download access in an audit trail, and never persist the unencrypted form on shared infrastructure. The standard pattern uses age (or gpg, or a hardware security key with aws-kms encrypt) to wrap the dump before upload — age -r $ANALYST_PUBKEY -o /tmp/dump.age /tmp/recon.hprof && aws s3 cp /tmp/dump.age s3://razorpay-dumps/... — and the analyst decrypts only on their laptop with age -d -i ~/.age-key.txt /tmp/dump.age. The 30 seconds of extra ceremony at capture time prevent a 30-day fire-drill if the bucket is ever compromised. The same discipline applies inside the company too: an SRE who downloads a payments heap dump should know they have a customer-PII liability on their laptop until they delete it, and the team's runbook should include the deletion step explicitly.

Reproduce this on your laptop

# Reproduce the heap-dump-demo above on a Linux or macOS laptop.
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip                 # tracemalloc is stdlib, no install needed
python3 heap_dump_demo.py                 # produces /tmp/recon.snap.gz
ls -lh /tmp/recon.snap.gz                 # ~400 KB

# Optional: capture a real core dump of the running script.
ulimit -c unlimited
python3 -c "import time; time.sleep(60)" &
PID=$!
gcore -o /tmp/sleep.core $PID             # writes /tmp/sleep.core.<pid>
file /tmp/sleep.core.$PID                 # ELF 64-bit core file
kill $PID

Where this leads next

Heap and core dumps are the capture side of Part 15's production-debugging toolkit. The next chapters cover the live-process side and the analysis-flow side:

  • /wiki/live-debugging-without-stopping-the-worldpy-spy, async-profiler, and rbspy: sampling profilers that capture state without the ptrace pause. When a 70-second drain isn't acceptable.
  • /wiki/flame-graphs-in-production — turning the captured profile into a flamegraph, reading the fat boxes, diffing two flamegraphs to find regressions.
  • /wiki/tracepoints-and-dynamic-instrumentationbpftrace and bcc: capturing events instead of state, with sub-percent overhead.
  • /wiki/case-memory-leak-that-wasnt — case study walking through a heap dump that looked like a leak but turned out to be an off-heap DirectByteBuffer accumulation, with pmap as the disambiguating tool.

The arc is: capture (this chapter) → analyse (the case studies) → instrument (eBPF) → fix (the runtime / language chapters in Part 13). A senior on-caller can navigate all four of these in a single 90-minute incident; the discipline is knowing which tool to reach for at each rung of /wiki/wall-debugging-live-systems-is-its-own-skill's diagnostic ladder.

A practical next step worth flagging: set up the four capture paths (kernel core_pattern, on-demand gcore/jmap, runtime-triggered OOM dump, continuous heap profiling) in your team's quietest infrastructure week, not during an incident. The week-of-quiet investment is unglamorous — there is no demo at the next sprint review, no metric improves visibly — but it shifts your team's incident-response posture from "improvise capture under pressure" to "execute runbook from muscle memory". The compounding payoff lands six months later when the first real production leak hits and the on-caller produces a clean heap dump in their first 10 minutes, rather than thrashing for an hour while inventing a capture pipeline at 03:14 IST. Each path takes about a day to wire up — the host configuration, the sidecar uploader, the S3 bucket policy, the on-caller runbook entry. By the time you need any of them, the runbook should read "trigger a dump" not "figure out how to trigger a dump". The Razorpay payments SRE team's published incident metrics show the median heap-dump capture-to-analysis time dropping from 47 minutes to 9 minutes after they invested one week in this setup — the bulk of the saving was not in faster capture, but in not having to figure out the capture path during the incident.

A second next step, equally important: schedule a dump-reading drill once a quarter. Pick an old heap dump from a resolved incident, hand it to a junior on-caller without context, and ask them to find the leak in 30 minutes using only Eclipse MAT and the team's runbook. The drill exposes gaps in the runbook (missing tool installation steps, stale links to dump archives) and builds the muscle memory needed for the real incident. A team that drills heap-dump reading quarterly carries the skill across rotations; a team that only reads dumps during incidents loses the skill every time someone leaves the team. Indian on-call cultures that have invested in this drill — Razorpay's payments SRE rotation, Hotstar's streaming-platform team — report that the time-to-first-correct-hypothesis on real leaks drops by roughly a factor of three, and the drill itself takes 2 hours per quarter per engineer, an extremely cheap investment for that compounding return.

References