Case: memory leak that wasn't

It is 11:42 IST on a Wednesday. The NSE cash equity session has been open for two hours and Kiran, on call for Zerodha Kite's order-routing service, watches the on-call channel light up: pod kite-router-7c9-fxk2 OOMKilled, restarting. Three minutes later, two more pods. By 11:55 IST, eleven of the forty-eight pods have cycled through OOMKill at least once and the order-placement p99 has climbed from 38 ms to 410 ms because the survivors are absorbing the displaced traffic. The team had been watching the heap creep upward for six trading days — 4 GB on Thursday last week, 6 GB by Monday, 9 GB by yesterday's close, 11 GB this morning. Every standup said memory leak and pointed at the latest deploy, a routing-policy change shipped fourteen days ago. The heap dump captured at 11:43 IST said something else: ninety-one percent of live objects were in a single ThreadLocal map, and the map had nothing to do with routing policy. This chapter is how a ThreadLocal eviction bug — a class of bug that looks identical to a leak on every dashboard but is fixed in three lines of code, not by reverting a deploy — was found, named, and patched in seventy minutes.

A "leak" is unbounded growth that the garbage collector cannot reclaim. Most production "leaks" are not leaks; they are caches, pools, or thread-local maps that retain references correctly but evict incorrectly — bounded data structures whose bound was never set or was set to infinity by mistake. The diagnostic ladder for these incidents is heap dump → dominator tree → reference path → owning data structure → eviction policy, and you skip the temptation to git revert until the dominator tree has named the owner.

What the dashboards showed and why they were misleading

Kiran's first move when the pages started firing was the heap-usage panel on the JVM dashboard. Six days of data, sampled every fifteen seconds, painted the textbook leak shape: a sawtooth where each tooth's peak was higher than the last, the GC reclaiming progressively less per cycle, the floor (post-collection retained heap) climbing in a near-straight line.

The shape that says "leak" to every on-caller. The post-GC floor climbing is the discriminating signal — if the floor were flat and only the peaks were rising, this would be allocation rate, not retention.

Three things about this chart conspired to convince the team it was a leak from the deploy fourteen days ago. First, the trend was monotonic — every day's floor was higher than the previous day's floor, with no recovery over weekends (the JVM was running but the market was closed, so allocation was minimal but the retained set still climbed). Second, the magnitude was severe — from a 1.1 GB working set to 9.6 GB in six business days is roughly 1.4 GB of new permanent retention per day. Third, a deploy two weeks ago to the routing-policy module coincided with the start of the climb. Why coincidence misleads in production debugging: the reasoning shortcut "X started after Y was deployed, therefore Y caused X" works often enough to be tempting and fails often enough to burn entire on-call shifts. In this case the deploy was unrelated; the bug had been in the codebase for nine months and was triggered by a traffic-pattern change — a new market-maker client that opened a TCP connection per symbol they quoted, increasing the population of connections per pod from 80 to 800. The routing-policy deploy happened to land on the same day the new client onboarded, which is the worst possible coincidence for a debug team's mental model.

The team's first attempt at remediation was a git revert of the routing-policy deploy and a rolling restart. The heap usage immediately reset to 1.1 GB on every fresh pod (because the heap was empty), then started climbing again at the same rate. By the second day post-revert, the floor was already at 4.5 GB. The deploy was not the cause; the revert was wasted work and noise in the change log. What the team needed was rung two of the leak-diagnostic ladder, not rung zero.

The wasted revert was not free — it consumed forty minutes of on-call attention, three engineering pings to confirm the rollback was complete, and an entry in the change log that future incident-responders will have to reason about ("why was the routing-policy deploy reverted on day three? was that related?"). Premature rollbacks are not just neutral; they actively contaminate the diagnostic trail.

There is a second, subtler reason the dashboard misled. The y-axis was scaled to 16 GB — twice the typical operating range — so the early days of the climb (0.4 GB above baseline on Friday, 1.2 GB above on Monday) looked visually negligible. Only by Wednesday, when the day's peak was twice the baseline, did the line cross the threshold where the human eye registers an anomaly. A dashboard scaled to the typical operating range plus a 30% margin (so 0–10 GB rather than 0–16 GB) would have made the day-one anomaly visible to anyone glancing at the panel. Visual perception is part of the diagnostic toolkit; a dashboard whose axis hides slow growth is a dashboard whose alerting is by default delayed. The post-incident review redrew every JVM-services dashboard's y-axis to scale automatically based on the pod's -Xmx, with the lower bound at 50% of the seven-day baseline.

The heap dump and the dominator tree

The discriminating step in any "is it really a leak" investigation is a heap dump captured at the peak of the climb, ideally just before an OOM-kill. Kiran took one at 11:43 IST — one minute after the first OOM-kill — using jcmd <pid> GC.heap_dump /tmp/kite-router-11.42.hprof. The pod was already in graceful-shutdown mode, so the dump captured a near-OOM heap. The dump was 11.2 GB on disk; the analyst opens it in Eclipse MAT or heap-dumper (a Python tool the team uses internally that wraps jhat and the hprof parser).

The first thing to look at in a heap dump is the dominator tree — for every object, which other object would, if collected, allow this object to be collected too. The root of the dominator tree is the GC roots; the children are the objects directly retaining the most memory; the descendants tell you the retention path. A dominator tree turns the unstructured "11 GB of objects" into "92% of the heap is retained by this single map".

# heap_dump_summary.py - print the top-N dominators in an HPROF file
# Run: python3 heap_dump_summary.py /tmp/kite-router-11.42.hprof
import argparse, subprocess, re, json
from pathlib import Path

def run_jhat_dominators(hprof: Path, top_n: int = 15) -> list:
    """Use heaphero or jhat to extract dominator tree summary.
    For brevity we shell out to a wrapper that emits CSV; in real use,
    Eclipse MAT's parseheapdump.sh produces dominator_tree.csv directly."""
    cmd = ["mat-cli", "--report", "dominator_tree", "--top", str(top_n),
           "--format", "json", str(hprof)]
    out = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return json.loads(out.stdout)

def fmt_bytes(n: int) -> str:
    for unit in ("B", "KB", "MB", "GB"):
        if n < 1024: return f"{n:.1f} {unit}"
        n /= 1024
    return f"{n:.1f} TB"

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("hprof", type=Path)
    ap.add_argument("--top", type=int, default=15)
    a = ap.parse_args()
    rows = run_jhat_dominators(a.hprof, a.top)
    total = sum(r["retained"] for r in rows)
    print(f"# total retained across top {a.top}: {fmt_bytes(total)}")
    print(f"{'rank':<6}{'retained':>12}  {'class':<54}{'gc_root_path'}")
    for i, r in enumerate(rows, 1):
        cls = r["class"][:52]
        path = " -> ".join(r["gc_root_path"][-3:])
        print(f"{i:<6}{fmt_bytes(r['retained']):>12}  {cls:<54}{path}")

if __name__ == "__main__":
    main()

# Sample run on the Kite-router heap dump:
# total retained across top 15: 10.4 GB
rank  retained    class                                                 gc_root_path
1       9.6 GB    java.util.HashMap                                     Thread[netty-evt-3] -> ThreadLocalMap -> Entry[7].value
2     320.4 MB    com.zerodha.routing.PolicyTable                       PolicyTable.INSTANCE (static)
3     188.7 MB    io.netty.buffer.PoolChunkList                         PooledByteBufAllocator.DEFAULT (static)
4      94.3 MB    java.util.concurrent.ConcurrentHashMap                ConnectionRegistry.byClientId (static)
5      62.8 MB    [B (byte[])                                           SymbolCache.payloadCache (static)
6      48.1 MB    com.zerodha.routing.OrderContext                      OrderContext$Pool.idle (static)
7      32.4 MB    java.util.concurrent.LinkedBlockingQueue              ExecutorService.workQueue
8      24.7 MB    sun.security.util.MemoryCache                         SSLSessionCache.cache (static)
...

Why the dominator tree is the right lens, not the histogram: the histogram view in MAT (or jmap -histo) tells you "there are 38 million HashMap.Entry objects" — true but not actionable, because HashMap.Entry is the leaf of every map in the JVM and naming it tells you nothing about which map. The dominator tree tells you that 9.6 GB of those entries are reachable from one specific ThreadLocalMap.Entry belonging to thread netty-evt-3, which is a completely different finding. Always start with the dominator tree; reach for the histogram only when the dominator tree has narrowed the suspect to a class and you want to know how many instances exist.

Row 1 of the dominator output is the entire investigation: 9.6 GB of the 11.2 GB heap is retained by one HashMap whose GC-root path runs through a Netty event-loop thread's ThreadLocalMap at entry slot 7. Row 2 — the routing policy table the team had been blaming — is 320 MB, three percent of the heap. The deploy was not the cause and the heap dump named the actual cause in fifteen seconds of analysis.

A useful framing for what the dominator tree is doing: it transforms an unstructured graph (the entire heap) into a structured tree by computing, for each node, the immediate dominator — the closest ancestor that lies on every path from a GC root to that node. The retained size of a node is then the sum of the node and its dominator-tree subtree. This collapses billions of edges into a tree the eye can scan in seconds. A heap with a single dominant subtree (Kite-router) is a different shape from one where the top twenty subtrees are roughly equal in size — the latter usually means fragmentation or per-instance bloat, not a single owning data structure, and it requires a different diagnostic path. Reading the dominator tree as a shape, not just as a top-N list, is what separates engineers who spend ten minutes on a heap dump from engineers who spend five hours.

What the offending ThreadLocal actually held

The next step is to follow the reference path from the ThreadLocalMap.Entry value into the offending HashMap and read its keys and values. Eclipse MAT's "Path to GC Roots" view does this; the equivalent on the command line is the OQL (Object Query Language) query:

SELECT k.toString(), v.getClass().getName()
  FROM java.util.HashMap.Entry e
 WHERE e.implements(java.util.Map$Entry)
   AND e.getOwner().class.name = "java.util.HashMap"
   AND e.getOwner().retainedHeap > 1000000000

Running this against the dump and sampling 100 random entries, the keys are all of the form "client-<uuid>:<symbol>" — a composite of the client identifier and an NSE trading symbol. The values are all instances of RouteDecision, a small class holding the chosen execution venue (NSE, BSE, or internal cross), the route latency budget, and a precomputed checksum. The map has 4.1 million entries.

This is a per-client-per-symbol routing-decision cache. It was added nine months ago by an engineer who wanted to avoid recomputing the routing decision for every order from a client trading the same symbol — a sensible optimisation that saves about 80 µs per order. The engineer made it a ThreadLocal<HashMap<>> because the routing decision depends on thread-local state (the per-thread feature flags from the experiment framework), and they wanted to avoid a ConcurrentHashMap's lock overhead on the order-hot path.

The decision to use ThreadLocal rather than a sharded concurrent cache was sound. At 220k orders per second across the matching cluster, a ConcurrentHashMap.computeIfAbsent call holds the bin lock for roughly 90 ns on contended access; a ThreadLocal lookup holds nothing and completes in 5 ns. Across the day's order volume, the ThreadLocal saves about 6.4 minutes of accumulated CPU time per pod — modest, but real. The bug was not in choosing ThreadLocal; the bug was in not pairing the choice with an eviction policy that matched the new connection-multiplexing reality the market-maker client introduced. This is the recurring pattern in performance optimisations that decay into bugs: the optimisation is correct under the assumptions present at the time of writing, and incorrect under the traffic the system later sees. Optimisations need eviction-policy review at the same cadence as security review.

A subtle layer of this bug worth understanding before naming the fix: the JVM's own ThreadLocalMap does partial cleanup of stale entries during get() and set() calls, sweeping a few slots ahead each time looking for entries whose WeakReference to the ThreadLocal key has been collected. This mechanism was designed exactly to prevent unbounded ThreadLocal growth when the keying ThreadLocal instance gets GC'd. But the Kite-router cache's ThreadLocal instance is a static final field — it is never collected, so the partial-cleanup pass finds nothing to evict and does no work. The runtime's safety net is a no-op for this particular flavour of misuse. Why this matters operationally: a developer who has read about ThreadLocal "leaks" and remembers the runtime has a built-in cleanup might assume the cleanup applies here. It does not. The cleanup evicts entries whose keying ThreadLocal has been collected (because the surrounding code dropped its reference to the ThreadLocal object). It does not evict entries whose value — the inner HashMap, in this case — has grown without bound. The two failure modes share a name in the literature ("ThreadLocal leak") and have completely different mechanics. Knowing which mechanic you are looking at decides whether the runtime helps you or hurts you.

The bug is not the cache. The bug is that the cache has no eviction. The map grows on every cache miss and never shrinks. For a client trading 50 symbols on 8 threads, that is 400 entries — fine. For the new market-maker client trading 10,000 symbols across the day on 8 threads, that is 80,000 entries per pod. Across 48 pods and 200 active clients (the population that grew from 80 to 800 over the last two weeks because of the new market maker's per-symbol-connection model), the absolute count of entries climbed from a stable ~30,000 per pod (when 80 clients × 50 symbols × small threads) to 4.1 million. The RouteDecision value plus the String key averaged 2.4 KB per entry. Why this is not a leak in the strict sense: every entry in the map is correctly retained — the cache is the legitimate owner of the routing decision, and the JVM's reference graph correctly reflects that ownership. A leak is when an object is unintentionally retained — when the reference graph holds an object that the program thinks it has released. The Kite-router cache holds objects the program thinks it is keeping. The bug is in the eviction policy, not in the reference graph. Calling this a leak loads the wrong fix into the on-caller's head; calling it an unbounded cache loads the right fix.

The fix is three lines: replace the HashMap with a LinkedHashMap configured as a bounded LRU cache, with the bound set to 4,096 — three orders of magnitude above the median per-thread working set, but small enough that 8 threads × 4,096 entries × 2.4 KB = 75 MB per pod, well within budget.

The team chose LinkedHashMap over Caffeine for the hot-fix because it shipped in three lines and added no new dependency. A follow-up ticket migrated to Caffeine the next sprint, where the TinyLFU-based eviction is more sample-efficient on workloads with skewed key access. For the hot-fix the cost was negligible: LinkedHashMap.removeEldestEntry is called on every put, adding roughly 60 ns per cache miss — well below the 80 µs of route-decision computation it saves. The two-step approach — minimal hot-fix to stop the bleeding, follow-up to ship the production-grade structure — is the right pattern when the bleeding is happening; large-scale rewrites under outage pressure are how good fixes turn into regressions.

// before:
private static final ThreadLocal<HashMap<String, RouteDecision>> CACHE =
    ThreadLocal.withInitial(HashMap::new);

// after:
private static final ThreadLocal<LinkedHashMap<String, RouteDecision>> CACHE =
    ThreadLocal.withInitial(() -> new LinkedHashMap<String, RouteDecision>(
        4096, 0.75f, true) {
      @Override
      protected boolean removeEldestEntry(Map.Entry<String, RouteDecision> e) {
        return size() > 4096;
      }
    });

The fix rolls out at 12:48 IST. Within twenty minutes every restarted pod's heap is at 1.4 GB and stable. The post-incident measurement at end-of-day shows the heap floor is flat across the entire afternoon session — the same cache is doing the same work, but the bounded eviction lets the oldest entries fall out as new ones arrive.

What the team changed beyond the three-line fix

The three-line LinkedHashMap change stopped the bleeding, but the post-incident review at Zerodha (held at 17:00 IST the same day, mandatory within four hours of any P0) produced four follow-ups that mattered more than the hot-fix itself. Performance-debug culture lives or dies on what the team changes after the fire is out, not on how fast the fire was put out.

The first follow-up was an eviction-policy lint rule. The internal Java style guide, enforced by an Error Prone plugin in CI, was extended with a rule that any static Map, List, Set, or ThreadLocal field declared without a documented eviction or bound emits a CI error. The rule has roughly eighty exceptions allow-listed (the cases that are genuinely bounded by their semantics — for example, a registry whose population is fixed at startup), but every new declaration since the lint shipped requires either a @Bounded(...) annotation specifying the bound or a @LifecycleBounded annotation pointing at the lifecycle owner. This is not a silver bullet — Error Prone cannot prove an eviction policy is correct — but it forces the engineer adding the field to think about the bound at write time, which is when the cost of getting it right is one minute and the cost of getting it wrong is six trading days.

The annotation is also the answer to the inevitable code-review question of why a particular bound was chosen. @Bounded(maxSize = 4096, evictionPolicy = LRU, rationale = "median per-thread working set is ~50 entries; 4096 is two orders of magnitude headroom and 75 MB worst-case across 8 threads") is a sentence the next engineer can read three years later and decide whether the rationale still holds. A bare LinkedHashMap of size 4096 with no comment is a number a future engineer will quietly halve, double, or remove based on whatever heuristic they bring in their own head — and the bug-introduction rate on bound changes inside CI-enforced rationales drops by roughly half compared to bound changes inside uncommented code, per Zerodha's internal review-comment statistics.

The follow-ups were not unique to Zerodha. Razorpay's payments-platform team published a similar four-step retrospective in 2024, and Hotstar's video-ingest team published a three-step version in 2023. The shape generalises: stop the bleeding (hot-fix), build the alert that catches the next instance early (heap-floor or equivalent), build the lint that prevents the next class of instance entirely (write-time guard), and run the audit that finds the dormant siblings (cross-codebase sweep). Teams that ship only the hot-fix relive the incident; teams that ship all four change their reliability profile over the following quarter.

The second follow-up was a heap-floor alert. The dashboard had been showing heap used — what ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed() returns at the sampling instant, which includes both live objects and dead objects waiting on the next collection. What the team needed instead was the post-GC retained set — used measured immediately after a collection, available via MemoryPoolMXBean.getCollectionUsage().getUsed(), which the JVM specifically populates after the most recent collection. That is the floor of the sawtooth. The new dashboard plots both, but the alert path uses only the collection-usage number, removing every false-positive page from allocation bursts and weekend traffic shifts. A new alert fires when the seven-day moving average of the post-GC floor climbs by more than 15% over the rolling baseline. Why the moving average and not the raw floor: every JVM heap fluctuates within a band, especially when traffic shifts (Friday afternoon retail equity volume is lower than Monday morning's; Diwali week is anomalously slow). A single-sample alert on the floor would page on every weekend and every market holiday. The seven-day moving average smooths over the volatility while still catching the kind of monotonic six-day climb the Kite-router pods exhibited. The threshold (15%) was chosen by replaying historical heap data through the alert formula and tuning to produce zero false-positive pages over the previous twelve months and a single true-positive page on the current incident's day-three data.

The third follow-up was a heap-dump-on-deploy snapshot. Every production deploy triggers a heap dump on one randomly-selected pod thirty minutes after rollout, captured to S3 with a fourteen-day retention. The cost is roughly three dumps per service per week (eleven services × three deploys/week each = thirty-three dumps weekly, around 250 GB of cold S3 storage at ₹1,800/month). The benefit is that when an incident like Kite-router happens, the team has a baseline dump from before the heap started climbing — they can diff the dominator trees and see exactly which entry's retained size grew, often catching the bug before the on-caller has finished kubectl exec-ing into the loud pod.

The fourth follow-up was an unbounded-cache audit across all forty-seven JVM services Zerodha runs. The audit was a one-time engineering effort: a six-engineer team spent four days reading the heap dumps from production peak windows and the source code for every static Map field, every ThreadLocal, and every callback registry. The audit found nine other unbounded structures, three of which were imminent risks (would have OOM-killed within sixty days at observed growth rates) and six of which were dormant but exposed. All nine were fixed in the two weeks following. The audit is now a quarterly recurring task — it does not catch every bug, but it catches the regressions that creep into the codebase between audits, and it is the kind of preventive work that changes a team's reliability profile from "react when it pages" to "find it before it pages".

The combined effect of the four follow-ups: in the six months following the Kite-router incident, Zerodha's JVM-services OOM-kill rate dropped from roughly two pages per month to zero, with one near-miss (an unbounded listener registry caught by the heap-floor alert at day four of growth, fixed before any pod hit the limit).

The hot-fix saved the day; the follow-ups changed the trajectory.

The taxonomy of "leaks that aren't"

The Kite-router incident is one instance of a recurring class. Across published postmortems from Razorpay, Hotstar, Swiggy, Zomato, Flipkart, and PhonePe over the last four years, the same five sources account for the majority of "looks like a leak, isn't" pages.

Five flavours of "leak that wasn't". Each has a distinct dominator-tree signature; the diagnostic ladder is the same in every case, but the owning data structure is different. Naming the owner is the unblocking step.

Why "leak that wasn't" is the right framing: actual unrecoverable-reference leaks — the kind a finalizer queue or a static field gone wrong produces — are rare in modern JVM applications because the language and the runtime work hard to prevent them. What is common is bounded data structures with their bound mis-set or unset. The visible symptom on every dashboard (heap floor rising, OOM-kill at the limit) is identical between the two cases, which is why the on-caller jumps to "leak". The dominator tree is the instrument that tells the difference; without it, you cannot distinguish between "the program is wrong about which references it holds" and "the program is correct about which references it holds, but its eviction policy is wrong". The remediations are completely different: the first is a code review and a fix in the application logic; the second is a configuration change in the cache or the data structure. Calling both "leak" is the imprecision that costs on-call shifts.

The diagnostic ladder generalises across the five flavours:

Heap floor rising over multiple GCs confirms it is retention, not allocation rate. If the floor is flat and only the peaks are rising, the JVM is producing garbage faster than the GC is scheduled to run — a tuning problem, not a memory problem.
Heap dump captured near the peak gives you the structured snapshot of every live reference. Capture before OOM-kill via jcmd <pid> GC.heap_dump, or set -XX:+HeapDumpOnOutOfMemoryError for automatic capture at the OOM moment.
Dominator tree analysis names the data structure holding the memory. This is the discriminating step: the answer is almost always one or two top entries that account for 80%+ of the heap.
Reference path inspection tells you who owns that data structure. A HashMap named in the dominator tree is meaningless without the path back to the GC root: is it a static field of RoutingPolicy, or is it a ThreadLocalMap.Entry on a Netty thread? The path is the diagnosis.

The fix is then matched to the owner: an unbounded cache gets a Caffeine wrapper with maximumSize and expireAfterWrite; an unbounded ThreadLocal map gets a LinkedHashMap LRU; a leaked listener gets a WeakReference registry or an explicit deregister-in-finally; a leaked pool object gets try-with-resources. None of these are deploy reverts, none are JVM flag changes, and none of them require restarting more than the affected pods.

Common confusions

"A git revert is the safe first move." Reverting the most recent deploy is only safe when you have evidence the deploy is the cause. Six days of sawtooth growth followed by a deploy two weeks ago, with no change in the heap-growth slope across the deploy boundary, is evidence that the deploy is not the cause. Reverting wastes a rolling restart and adds noise to the change log; do the heap dump first.
"The histogram view tells me what is leaking." The histogram tells you the types of objects on the heap, sorted by count or size. It does not tell you who owns them. A heap with 38 million HashMap.Entry objects is uninformative because every map in the JVM contributes to that count. The dominator tree, not the histogram, is the right first lens.
"ThreadLocal is leak-prone, just remove it." ThreadLocal is a fine pattern when (a) the values are bounded per thread and (b) the threads themselves have a bounded lifetime. The Kite-router bug was the first condition violated, not the second. The fix is bounded eviction, not removing the ThreadLocal. Removing it would have cost 80 µs per order across the entire trading day — the cache exists for a reason.
"More heap is the temporary fix." Doubling the JVM heap from 12 GB to 24 GB delays OOM-kill by a few hours but does nothing to fix the unbounded growth. Worse, it produces a larger heap dump (24 GB takes minutes longer to capture and analyse), and a longer GC pause (G1 mark-and-sweep on a 24 GB heap with 11 GB live is more expensive than on a 12 GB heap with 6 GB live). Stop reaching for -Xmx increases as a stalling tactic; capture the dump, name the owner, fix the eviction.
"A 6-day climb means a slow leak; a 6-hour climb means something else." The growth rate is set by the arrival rate of distinct cache keys, not by the underlying bug class. A ThreadLocal cache without eviction climbs over six days at one client population and over six hours at another — same bug, different exposure. Diagnose by reference path, not by climb rate. The climb-rate signal is useful only for capacity planning ("at this rate, we OOM in N hours"), not for classifying the bug.
"If the JVM is healthy after -XX:+UseG1GC -XX:+UseStringDeduplication, the leak is gone." Tuning flags change how the GC works, not what it can collect. A bug that retains 9.6 GB of legitimately reachable data is invisible to every GC tuning flag because the GC, by design, never reclaims reachable data. Tuning flags fix pause-time problems and allocation-rate problems; they do not fix retention-graph problems. If your retention-graph fix is a JVM flag, you have not yet found the bug.

Going deeper

How the bug stayed hidden for nine months

The Kite-router cache shipped in early 2025. From rollout to the first OOM-kill the cache had been in production for nine months without any heap-floor anomaly. Three properties of the original deploy explain the latency: the original client population had a small symbol working set (cash equity has roughly 200 actively-traded symbols even on the busiest day, and any single client touches 30–80 of them); the thread pool was small (8 Netty event-loop threads per pod, so the worst-case map size was bounded by 8 × symbols-touched-per-day ≈ 640 entries); and the pods restarted nightly as part of the deploy cadence (every weeknight push reset the heap to baseline, so even an unbounded structure had only one trading session to grow).

When the new market-maker client onboarded, two of the three properties broke at once. Their TCP-per-symbol model meant each pod saw roughly 10,000 distinct symbols per day instead of the 200 that mattered before; the per-symbol routing decision was now keyed on (client × symbol) instead of (client × top-symbols), so the cache size became 10,000 × 8 threads = 80,000 entries. And a coincidental change to the deploy cadence two weeks before the incident moved nightly pushes to twice-weekly, removing the implicit nightly reset that had kept the unbounded growth invisible. None of these changes were obviously memory-related; each was a perfectly defensible decision in its own context. The bug emerged from the interaction, which is the recurring shape of latent production bugs — a defensible decision today plus a defensible decision next quarter plus a defensible decision next year intersect to produce an outage no single owner could have predicted. The lesson generalises: every production bug audit should include not just "what is the latent bug" but "what assumption is currently keeping it dormant" — and when that assumption changes, the audit should re-fire.

Why the dominator tree dominates the histogram

The dominator tree is built from the heap's reference graph by computing, for every node, the closest ancestor that every path from a GC root to that node must pass through — its "immediate dominator". The retained size of an object is the sum of itself and everything it dominates. This is the right metric for memory diagnosis because it answers "if I freed this object, how much memory would I get back?" — directly. The histogram answers "how many of class X exist", which is a different and less useful question.

Eclipse MAT computes the full dominator tree in O(n) time using the Lengauer-Tarjan algorithm on a 11 GB heap dump in about 90 seconds on a developer laptop. The output is then sortable by retained size. The single dominant entry is almost always the bug; cases with no dominant entry (where the top 20 dominators are all small) are usually fragmentation or per-instance bloat rather than a single owning data structure, and they require a different diagnostic path. For the Kite-router case, one entry held 86% of the heap, which made the diagnosis trivial once the tree was rendered.

Capturing a heap dump without bringing down the pod

Heap dump capture is not free. jcmd <pid> GC.heap_dump <path> triggers a full GC and a stop-the-world walk of the reference graph; for a 12 GB heap, this is a 6–20 second pause depending on the GC and the host. In production, the standard approach is to use a sidecar pattern: a "heap-dumper" pod that shares the target pod's PID namespace via kubectl debug --target=app, runs the dump to an emptyDir volume, then uploads to S3. The orchestration is a 40-line Python script that wraps jcmd and aws s3 cp. The advantage of the sidecar is that the dump tooling does not need to be in the application image, and the upload happens out-of-process so it does not contend with the application's heap.

For Kite-router-class incidents (memory growing slowly, OOM-kill imminent but not immediate), the right time to capture is during a quiet window — pre-market open in this domain, 09:00–09:14 IST — when the heap is at its peak from yesterday's session but no orders are flowing. The 6–20 second pause is then invisible to users.

`-XX:+HeapDumpOnOutOfMemoryError` is the safety net

JVM applications should always run with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/hprof/. This automatically captures a heap dump at the moment the JVM hits the heap ceiling, before the OOM-kill or the propagated exception. The dump path should be on a volume large enough to hold the full heap (typically a dedicated PV mounted at /var/log/hprof/) and the volume should outlive the pod (so the dump survives the OOM-kill restart). Without this flag, OOM-kills destroy the evidence and you debug from logs, dashboards, and inference — strictly worse than debugging from a captured dump.

The Kite-router team had this flag set, but the OOM-kill at 11:42 IST was a container OOM-kill (the kubelet's OOM-killer, triggered when the pod's memory hit the cgroup limit), not a JVM OOM-kill (which only fires when the heap hits -Xmx). The container limit was 12 GB, the JVM -Xmx was 11 GB; the JVM was getting close but had not yet thrown OutOfMemoryError. Container OOM-kills do not trigger the JVM's heap-dump-on-OOM behaviour because the kernel kills the process before the JVM notices. This is why Kiran captured the dump manually at 11:43 IST on a still-running pod, before the next OOM-kill cycled it. Setting -Xmx below the container limit by enough margin (typically 10–15%) gives the JVM time to react and dump before the kernel intervenes.

Reproduce this on your laptop

sudo apt install openjdk-21-jdk eclipse-mat
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh

# Compile and run a synthetic ThreadLocal-cache leak:
cat > LeakDemo.java <<'EOF'
import java.util.*;
public class LeakDemo {
  static final ThreadLocal<HashMap<String,byte[]>> C = ThreadLocal.withInitial(HashMap::new);
  public static void main(String[] a) throws Exception {
    Random r = new Random();
    while (true) {
      String k = "client-" + r.nextInt(1_000_000);
      C.get().put(k, new byte[2400]);
      if (C.get().size() % 100_000 == 0)
        System.out.printf("entries=%d heap_used=%d MB%n",
          C.get().size(),
          (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/1_048_576);
    }
  }
}
EOF
javac LeakDemo.java
java -Xmx2g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/leak.hprof LeakDemo

# When it OOMs (~30 seconds), open the dump:
ParseHeapDump.sh /tmp/leak.hprof org.eclipse.mat.api:dominator_tree
firefox /tmp/leak_Dominator_Tree.html

The synthetic reproduction OOMs in roughly thirty seconds at -Xmx2g; the dominator tree on the captured .hprof will show one HashMap retaining ~99% of the heap, rooted in ThreadLocalMap. This is the Kite-router bug shape, in miniature. Modify the code to use LinkedHashMap with removeEldestEntry to see the heap stabilise.

Where this leads next

This is the second of three case studies in section 15.1. Each case extends the diagnostic ladder by one rung and demonstrates one production family.

/wiki/case-cpu-saturation-without-user-load — the previous case, where the CPU is hot but the request graph is flat. The ladder there ends at a flame graph; this case's ladder extends to the heap dump and dominator tree.
/wiki/case-p99-spike-that-was-a-gc-tuning-flag — the next case, where the p99 latency cliff turned out to be a single MaxGCPauseMillis flag set wrong for the heap size. The ladder extends to GC logs and runtime tuning.
/wiki/wall-performance-engineering-is-culture — the part-closing wall that ties all three cases together: the diagnostic ladder is a cultural artefact, not a tool.

Across the triad: the CPU case ends at the flame graph (rung three); the memory case ends at the dominator tree and reference path (rung four for the JVM-specific tooling); the latency case ends at GC log analysis (rung five). The reader who reads all three has a complete production-debug ladder for the three families that account for the bulk of paging incidents in steady-state services: CPU saturation, memory growth, and tail-latency cliffs.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 7 — the canonical memory-performance chapter, including the framework that distinguishes retention bugs from allocation pressure.
Eclipse Memory Analyzer (MAT) documentation — the dominator-tree analysis tool used in this chapter, with worked examples of the exact reference-path views demonstrated here.
Lengauer & Tarjan, "A fast algorithm for finding dominators in a flowgraph" (TOPLAS 1979) — the algorithm Eclipse MAT uses to compute the dominator tree in linear time.
Caffeine cache documentation — the modern JVM cache library with size-based and time-based eviction; the recommended replacement for unbounded HashMap caches.
Effective Java, 3rd ed., Item 7: "Eliminate obsolete object references" — Joshua Bloch's enumeration of the recurring leak patterns in Java, including the ThreadLocal family of which this case is one instance.
Zerodha engineering blog — the public engineering writeups from the Kite team on production incidents in the Indian capital-markets domain.
/wiki/heap-dumps-and-core-dumps — the chapter on heap dump capture mechanics, which this case study builds on for its rung-two tooling.
/wiki/live-debugging-without-stopping-the-world — the chapter on capturing diagnostics from a running pod without taking it offline, which informs the sidecar capture approach described in this case.