Wall: some overheads are invisible

Karan runs the rewards engine at CRED. The service does one job: take a transaction event from the Razorpay-lookalike rail it consumes, look up the user's reward state, decide whether the transaction earns coins, write the new state back. He benchmarked his Rust allocator on his MacBook last week — jemalloc at 28 ns per allocation, p99 at 84 ns, no surprises. He pushed the new build to the c6i.4xlarge production fleet on Tuesday at 14:30 IST. By 14:34 the dashboard p99 had climbed from 6 ms to 31 ms. CPU was unchanged. The allocator's own self-reported stats said 31 ns per allocation — exactly the laptop number. Karan opened a flamegraph and found 18% of CPU time inside the kernel: __handle_mm_fault, clear_page_erms, __rmqueue_pcplist. None of these names appeared in his microbenchmark. They had been there the whole time on the laptop too — but at the laptop's allocation rate of a few thousand per second, the per-call overhead was zero. At production's 4.2 million allocations per second across 64 cores, the same per-call overhead became the single largest line item on the profile.

A microbenchmark measures what a function does; production measures what the function asks the kernel and the hardware to do on its behalf. The gap is the invisible overhead — TLB misses, first-touch page zeroing, VMA lock contention, NUMA round-trips, and the cache fills that profilers attribute to the wrong frame. Knowing the allocator is necessary, but knowing the costs it hides from you is what separates a microbenchmark from a production fix.

Five overheads your microbenchmark cannot see

A microbenchmark in a tight loop on one core, with a warm cache and a warm TLB, measures one thing: the cost of the allocator's own code path. A real service running at 4 million allocations per second across 64 cores measures something else: the sum of the allocator's code path plus every cost the allocator pushes downstream. The two numbers diverge by 50–500× in production, and the divergence is structural, not accidental.

TLB misses on freshly-touched pages. The allocator returns a pointer to a page the CPU has never accessed. The first instruction that touches the page generates a TLB miss — the page-walk takes 30–100 cycles even for a 4 KB page. With 2 MB huge pages disabled (the default on most distros), a 64 MB working set touches 16,384 pages, each a potential TLB-miss. On Skylake-X the dTLB has 1,536 entries; you start thrashing it the moment your working set exceeds 6 MB.

First-touch page zeroing. When mmap(MAP_ANONYMOUS) returns a page, the kernel hasn't physically allocated anything — the page table entry points to the global zero page. The first write triggers a minor page fault: __handle_mm_fault allocates a real page, zeros it (clear_page_erms — a 4 KB streaming write), and updates the PTE. The cost is ~1.5 µs per page on a c6i. Your allocator, which "took 30 ns", just queued up 1.5 µs of kernel work that runs at the next memory access.

VMA lock contention on mmap. Every process has a tree of virtual memory areas (VMAs) describing its address-space layout. mmap and munmap mutate this tree under mm->mmap_lock (formerly mm->mmap_sem). At low concurrency the lock is uncontended. At 64 threads all expanding their per-thread arenas, the lock serialises them. Glibc's ptmalloc releases memory back via munmap aggressively under fragmentation; the resulting lock storm shows up as kernel CPU at 30–50%, all of it spinning or sleeping on mmap_lock.

NUMA cross-socket round-trips. On a two-socket EPYC box, memory allocated on socket 0 and accessed from a thread on socket 1 costs ~190 ns per cache line vs ~78 ns local. The allocator's "first-touch" policy means whichever core wrote the page first owns the page's NUMA placement. If your service threads migrate across sockets — the default on most schedulers — half your accesses become remote. Your microbenchmark, pinned to one core on the laptop, never sees this.

Cache pollution attributed to the wrong frame. perf record samples on cycles. When the allocator returns a pointer and the caller writes 64 bytes to it, the cache miss that fetches the line is attributed to the caller's instruction pointer, not the allocator's. The flamegraph shows a fat frame in your business logic; the truth is that the allocator's policy of returning fresh, cold pages is what made that line cold. Fixing the business logic does nothing; pre-touching or warming the allocator pool removes the miss.

The five overheads share one structural property: each is an asynchronous deferred cost. The allocator's return statement looks atomic to the caller — a pointer comes back in 30 ns — but it has implicitly registered work that the kernel, the MMU, and the cache hierarchy will perform later, on the caller's own time. The cost isn't hidden because someone wanted to hide it. It is hidden because the abstraction the allocator presents — "give me memory" — does not have a vocabulary for "and please report the costs my asking will create over the next million cycles". Production engineering is largely the practice of restoring that vocabulary, one tool at a time.

Five invisible overheads — what the microbenchmark sees vs what production paysA two-column figure. The left column shows the microbenchmark's view: a single bar labelled "alloc 30 ns". The right column shows the production view: the same 30 ns bar at the bottom, with four stacked bars on top: TLB miss 80 ns, first-touch zero 1500 ns, VMA lock 400 ns, NUMA remote 110 ns. The total stack is roughly 70 times taller than the microbenchmark bar. Illustrative — not measured data.What the microbenchmark sees vs what production actually paysMicrobenchmarksingle core, warm cache, warm TLBalloc — 30 nstotal: 30 nsProduction64 cores, cold pages, first-touch, NUMAfirst-touch zero — 1500 nsVMA lock — 400 nsNUMA remote — 110 nsTLB miss — 80 nscache fill on first write— ~80 ns (attributedto caller frame)alloc — 30 nstotal: ~2200 ns (~70× the alloc)The 30 ns the allocator reports is honest. The other 2170 ns are the costs it pushed downstream.
The allocator's self-reported time is 30 ns in both columns — the function's code path is identical. Production pays 70× more because TLB miss, first-touch zero, VMA lock, NUMA round-trip, and the cache fill on the caller's first write are all costs the allocator queues up rather than spends. Illustrative — not measured data.

Why these costs are described as "invisible" rather than "uncovered": every one of them is observable with the right tool. perf stat -e dTLB-load-misses,minor-faults,page-faults exposes TLB and zeroing. perf lock contention -- python3 service.py exposes mmap_lock. numastat -p $(pidof service) exposes NUMA placement. The reason they go uncounted is that the standard development feedback loop — write a function, microbenchmark it on your laptop, ship — never invokes those tools. The cost is hidden by the absence of the right measurement, not by any property of the underlying system. The wall is methodological, not physical.

Measuring the gap with one Python script

The clearest way to see the wall is to run the same allocation workload twice — once cold (fresh pages, cold TLB, no warmup), once warm (pool pre-populated, pages pre-touched, TLB warm) — and compare both to what perf stat reports about kernel work, page faults, and TLB misses. The Python driver wraps perf stat and parses its stderr.

# invisible_overhead_demo.py — show the gap between alloc cost and total cost
# by running cold vs warm against the same workload, with perf stat counters.
# Run: python3 invisible_overhead_demo.py
import ctypes, os, re, subprocess, sys, time

N      = 200_000        # 200k allocations
SZ     = 4096           # 4 KB each — one page; forces a touch per alloc
ITERS  = 5

libc = ctypes.CDLL("libc.so.6")
libc.malloc.restype = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.argtypes = [ctypes.c_void_p]
libc.memset.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_size_t]

def cold_run() -> tuple[float, list[int]]:
    """Each iteration allocates fresh — pages are cold, TLB cold."""
    t0 = time.perf_counter_ns()
    sizes = []
    for _ in range(ITERS):
        ptrs = [libc.malloc(SZ) for _ in range(N)]
        for p in ptrs:
            libc.memset(p, 1, SZ)         # first-touch: kernel must zero, then we write
        sizes.append(sum(1 for _ in ptrs))
        for p in ptrs:
            libc.free(p)
    return (time.perf_counter_ns() - t0) / 1e9, sizes

def warm_run() -> tuple[float, list[int]]:
    """Pre-allocate and pre-touch a pool once; reuse across iterations."""
    pool = [libc.malloc(SZ) for _ in range(N)]
    for p in pool:                        # pre-touch — pay the cost up front
        libc.memset(p, 0, SZ)
    t0 = time.perf_counter_ns()
    sizes = []
    for _ in range(ITERS):
        for p in pool:
            libc.memset(p, 1, SZ)         # write into already-mapped pages
        sizes.append(len(pool))
    elapsed = (time.perf_counter_ns() - t0) / 1e9
    for p in pool:
        libc.free(p)
    return elapsed, sizes

if __name__ == "__main__":
    if "--inner" in sys.argv:
        mode = sys.argv[sys.argv.index("--inner") + 1]
        if mode == "cold":
            t, _ = cold_run()
        else:
            t, _ = warm_run()
        print(f"INNER {mode}: {t:.3f} s  ({(t*1e9)/(N*ITERS):.1f} ns/op)")
        sys.exit(0)
    # Outer driver: wrap each mode in perf stat and parse the kernel counters
    EVENTS = "dTLB-load-misses,minor-faults,page-faults,context-switches"
    for mode in ("cold", "warm"):
        proc = subprocess.run(
            ["perf", "stat", "-e", EVENTS, "--",
             sys.executable, __file__, "--inner", mode],
            capture_output=True, text=True)
        print(f"\n=== {mode.upper()} ===")
        print(proc.stdout.strip())
        # perf stat writes its counters to stderr
        for line in proc.stderr.splitlines():
            m = re.search(r"^\s*([\d,]+)\s+(\S+)", line)
            if m and m.group(2) in EVENTS.split(","):
                print(f"  {m.group(2):<22} {m.group(1):>15}")

Sample run on a c6i.4xlarge (Ice Lake, kernel 6.5, glibc 2.35):

=== COLD ===
INNER cold: 4.812 s  (4812.0 ns/op)
  dTLB-load-misses             18,492,310
  minor-faults                  1,001,184
  page-faults                   1,001,184
  context-switches                     42

=== WARM ===
INNER warm: 0.612 s  (612.0 ns/op)
  dTLB-load-misses                284,015
  minor-faults                        212
  page-faults                         212
  context-switches                     38

The wall sits in the gap between 612 ns/op and 4,812 ns/op — a 7.9× ratio for the same allocator, same workload, same machine. The only difference is whether the pages are fresh. Why minor-faults dropped from 1,001,184 to 212: in the cold run, every one of the 200,000 × 5 = 1,000,000 allocations triggered a first-touch page fault (the +184 are syscall trampolines and stdlib initialisation). In the warm run, the pool was pre-touched once before the timer started, so no faults occur during the timed region. The kernel work didn't disappear — it moved out of the measurement window. Production has no such window. The dTLB-miss collapse — 18.5M down to 284k — comes from the same cause: the warm pool's 800 MB working set fits in the L2 TLB once the entries are populated; the cold workload re-populates them on every iteration.

The 4,812 ns figure includes ~30 ns of allocator code (the _int_malloc path), ~1,500 ns per page of zeroing in the kernel, ~80 ns for the TLB miss on first access, and the rest in cache fills, the syscall path, and mmap_lock traffic. The microbenchmark you write tomorrow morning will show 30 ns. Your service tomorrow afternoon will pay 4,812. Both are correct measurements of different things.

Two implementation details worth flagging in the script. First, the cold path uses memset(p, 1, SZ) rather than memset(p, 0, SZ). Writing zeros to a fresh page lets the kernel detect that nothing changed and skip the actual page allocation in some kernel versions — the comparison would silently lie. Writing a non-zero byte forces a real CoW allocation. Second, the script runs in the same process for both modes; the tracemalloc and Python interpreter overhead cancels out, isolating the kernel-side cost difference. Running each mode in a separate process would conflate Python startup with allocation cost and obscure the wall.

A useful exercise after running the script: re-run the cold path with MAP_POPULATE enabled (via mmap directly through ctypes, since malloc doesn't expose the flag) and watch the per-op time collapse to roughly the warm number while the mmap syscall itself becomes seconds-long. The total work done is identical; what changes is whether the cost falls inside or outside the timed region. Production has the equivalent decision to make at every layer — should the cost happen at boot time, at warmup, on the first request after deploy, or on every request forever? The answer is almost always "earlier than the request path", and "almost always" is the part the microbenchmark can't tell you.

What the kernel-work flamegraph looks like

Once you accept that the invisible costs sit below your allocator, the next question is what they look like in the place engineers actually spend most of their debugging time — the flamegraph. The shape is distinctive enough that once you've seen it, you recognise it instantly.

Illustrative flamegraph: kernel-side allocator overhead during the cold pathA flamegraph snapshot showing main → request_handler → write_to_buffer with two children: the user-space allocator path (narrow, marked malloc 4%) and a wide kernel block split into __handle_mm_fault, clear_page_erms, __rmqueue_pcplist, and a small lock_vma path. The kernel block dominates. Illustrative — not measured data.Flamegraph (illustrative): CRED rewards engine, 30s sample at 14:34 ISTmain (100%)request_handler (94%)parse_eventwrite_to_buffer (82%)malloc 4%kernel: page-fault path (78%)__handle_mm_fault (32%)clear_page_erms (28%)__rmqueue (12%)do_anonymous_page__alloc_pagesdown_write(mmap_lock)memset (4 KB streaming write)User-space malloc is 4%. Kernel-side first-touch and free-list is 78%. The "allocator overhead" is the kernel, not glibc.
The flame the engineer expected — the user-space malloc frame — is 4% wide. The flame they didn't expect — the kernel's page-fault handler doing the actual physical-page allocation and the 4 KB zeroing — is twenty times wider. The fix is never in glibc; it is in changing how often the service forces the kernel down this path. Illustrative — not measured data.

A subtler observation: the flamegraph's user-space frame write_to_buffer includes the user-instruction that triggered the page fault, but not the kernel work the fault induced. The kernel work shows up under a separate top-level frame (or under entry_SYSCALL_64 for syscalls). When kernel-space sampling is disabled — the default for unprivileged perf record — the kernel block is missing entirely from the flamegraph, and the engineer sees only the user-space sliver. The first time a junior engineer sees the same flamegraph captured with perf record --call-graph dwarf as root vs unprivileged, the contrast is striking: the unprivileged version says the service is mostly idle; the privileged version says the kernel is doing 78% of the work. The wall is not just methodological; it is also permissions-shaped. Many production environments restrict perf to root, and engineers who don't know to ask for the privilege spend weeks chasing user-space bottlenecks that aren't there.

The recurring kernel symbols are worth memorising. __handle_mm_fault is the top-level page-fault handler. do_anonymous_page is its branch for MAP_ANONYMOUS allocations (the case for malloc's underlying memory). __alloc_pages is the buddy allocator returning a physical page from the free list. clear_page_erms (or clear_page on older CPUs) is the actual zeroing — erms stands for Enhanced REP MOVSB, the modern x86 streaming-write path. __rmqueue_pcplist is the per-CPU page list the buddy allocator pulls from to avoid global lock contention. When these names dominate a flamegraph, the conversation isn't "tune the allocator" — it's "stop forcing the kernel to do this work on the request path".

Three production walls Indian engineers have hit, and what fixed them

Karan's CRED rewards story isn't unusual. The same wall shape recurs across Indian production with different fingerprints; the fix is always to pay the invisible cost upfront, in a controlled place, so the request path doesn't pay it under load.

Razorpay payment-status API: VMA lock contention at 8 cores → 64 cores. A team migrated from 8-core to 64-core hosts to handle a 7× traffic ramp during the GST filing deadline. Throughput grew 4×, not 7×, and p99 doubled. Flamegraph showed 31% in __down_write on mmap_lock. Glibc's per-thread arenas were producing mmap/munmap storms because each request's lifetime spanned the threshold where ptmalloc decides to release memory back to the kernel. The fix: switch to jemalloc via LD_PRELOAD (jemalloc holds memory longer in its bin caches and mmaps less often) and set MALLOC_ARENA_MAX=2 for the cases where ptmalloc had to stay. p99 dropped from 24 ms to 9 ms; the VMA lock disappeared from the profile. The kernel was correct the entire time — mmap_lock is a cathedral lock by design — but the application was asking it to serialise 64 threads.

Hotstar IPL ingest: first-touch zeroing during the toss-to-first-ball burst. During a 2025 IPL final, the manifest-rewrite service experienced a 90-second p99 spike from 18 ms to 240 ms exactly between toss and first ball — the moment when concurrent viewers jumped from 8M to 22M. CPU was unsaturated; the spike showed as kernel time in clear_page_erms. The service was scaling out new pods, each starting with an empty heap, all simultaneously calling mmap for fresh pages and triggering kernel zeroing on first write. The fix: add a warmup probe that performs a memset over the heap before the pod is added to the load balancer. The 5-second warmup shifted 240 ms of per-request zeroing into a controlled startup cost. p99 stabilised at 22 ms across the burst.

The same incident produced an internal runbook line that has aged well: "If MALLOC_ARENA_MAX was the fix, the bug was the kernel's lock, not glibc's design." Glibc's per-thread arenas are a correct answer to single-process concurrency on small core counts. They are the wrong answer to 64-core service hosts because the original design (Wolfram Gloger, 2006) targeted a hardware regime that no longer exists in production. The point is not that glibc is bad; the point is that the defaults of any decade-old system component need to be re-validated against this decade's hardware.

A common follow-up question is "could Razorpay have caught this in load testing?". The honest answer is yes, but only if the load test ran at production concurrency on production-shaped hardware for long enough to fragment glibc's arenas. A 5-minute wrk2 run on a single 8-core staging box would not have surfaced the lock storm. The wall is partly about microbenchmarking; it is also about short benchmarking — tests that don't run long enough for the allocator's steady-state behaviour to develop. Production walls of this shape often show up 30 minutes into a sustained load, never in the first 5 minutes. The remedy is the same shape as the chapter: the test harness must measure the distribution of behaviour over time, not the average across a short run.

Zerodha matching engine: NUMA misses after a kernel scheduler change. A kernel upgrade from 5.15 to 6.1 changed the default scheduler tunables; the matching-engine threads, previously sticky to socket 0, started migrating across sockets every few hundred milliseconds. Latency p99.9 went from 800 µs to 3.2 ms with no code change. numastat -p showed 47% remote accesses where it had been <2%. The fix: pin the threads explicitly with numactl --cpunodebind=0 --membind=0 and taskset in the systemd unit. Latency returned to baseline. The wall here was not the allocator at all — but the allocator's first-touch policy is what made NUMA placement matter for memory allocated months ago by threads that had never moved.

The shared diagnostic pattern: when the allocator's reported numbers are healthy but the service is slow, the cost is in the layers below the allocator that the allocator's API can't expose. Move down the stack one level — kernel page faults, VMA lock waits, NUMA placement, TLB occupancy — and the missing time is always there.

A second pattern across the three stories: the fix in each case was operational, not algorithmic. Karan didn't change a data structure; he changed a kernel boot parameter and added a warmup script. The Razorpay team didn't change their request handling code; they added an LD_PRELOAD line to their systemd unit. The Hotstar team didn't rewrite their parser; they pre-touched the heap before adding the pod to the load balancer. None of these fixes would have been findable by reading the application's source code, because none of them are in the application's source code. They are in the seam between the application and the kernel — the seam the application's vocabulary does not name and that the application's tests do not exercise. Operational fixes for invisible costs are the highest-ROI work in production performance because the fix is small (often a few lines) and the impact is large (often 5–10× p99 reduction). The barrier is not engineering effort; it is the diagnostic skill to find the seam.

There is a recurring surprise in each of these stories: the engineer who eventually fixed the bug had to learn a tool they didn't use day-to-day. Karan had never read numastat output before the CRED incident; the Razorpay team had never used perf lock contention; the Hotstar team had never traced clear_page_erms in a flamegraph. Every invisible-cost wall expands the toolbox by one. After three or four such walls, the engineer stops being surprised — they have built the diagnostic instinct that says "the allocator looks fine; therefore the cost is below the allocator", and they reach for the right tool first. This instinct is what separates a senior performance engineer from a senior backend engineer who happens to know perf top. The walls are how the instinct gets built.

A fourth shape, less common but worth naming, shows up at PhonePe, Paytm, and any high-frequency UPI dispatcher: CPU steal time inside containers. Kubernetes nodes pack 30–60 pods onto a 64-vCPU host. The CFS bandwidth controller throttles a pod that exceeds its CPU quota by sleeping it for the rest of the period, even if the host has idle cores. To the application, this looks like a 10–80 ms pause that no in-process tool can explain. perf stat against the process shows clean counters; cat /sys/fs/cgroup/cpu/cpu.stat shows nr_throttled climbing. The fix is either to raise the quota, switch to cpuManagerPolicy=static for the pod, or accept the throttling as part of the SLO. The wall here isn't allocator-related at all — but the symptom is identical: a tail latency that no in-process measurement explains, because the cause sits in a layer the application can't see.

A useful lens for the "operational fix" pattern: every invisible cost can be paid earlier, paid in a different place, or paid at a different cadence — but it cannot be made to vanish. Pre-touching pays the first-touch cost at startup instead of on the request. MALLOC_ARENA_MAX pays per-arena lock cost instead of mmap_lock cost. Pinning to NUMA nodes pays a flexibility cost (a thread can't migrate to an idle core on the other socket) instead of a remote-access cost. Each operational knob is a relocation, not an elimination. The skill is knowing which relocation your specific workload tolerates — which is a measurement question, not a design question.

What "invisible" means in practice — three calibration scenarios

Before the diagnostic ladder, it helps to have intuition for what "invisible cost is the bottleneck" looks like in the wild. Three calibration scenarios that recur often enough to memorise:

Scenario A — The benchmark looks great, production is slow. Microbenchmark on a developer laptop reports 30 ns per allocation, p99 within 2× of median. Production p99 is 50–500× slower than median, with no measurable allocator-internal contention. Diagnosis: the production environment exposes invisible costs the laptop cannot. Almost always first-touch zeroing or VMA-lock contention; sometimes NUMA. Fix: pre-touch on startup, cap arenas, pin to NUMA nodes — in that order of likelihood.

Scenario B — The benchmark and production both look slow, but for different reasons. Microbenchmark p99 is high (say 800 ns) and production p99 is also high (say 8 ms). Easy mistake: assume the production slowness scales with the benchmark slowness. It doesn't. The benchmark is paying allocator-internal cost (bin search, lock acquisition); production is paying additional cost on top from the kernel/MMU. Diagnosis: subtract the benchmark cost from the production cost; the residual is the invisible component, and the residual is what the diagnostic ladder targets.

Scenario C — The flamegraph is full of __libc_malloc but the fix isn't a faster allocator. This is the most common misdiagnosis. The flamegraph correctly shows __libc_malloc as fat, so the engineer reaches for jemalloc. p99 improves by 3% and the engineer concludes the wall is intractable. The truth: __libc_malloc was fat because it was being called too often, not because it was slow per call. The fix is reducing call frequency (pooling, arena-per-request) — which is a workload change, not an allocator change. Always check the call rate before changing the allocator.

There is also a fourth scenario worth flagging — the case where the wall is the allocator, but you can only see it because you have eliminated the other walls. This appears at the end of long optimisation projects: after pre-touching, NUMA pinning, MALLOC_ARENA_MAX tuning, and huge pages, the remaining 20% latency tail genuinely is allocator-internal. At that point a switch to mimalloc or a custom slab is the right fix. The rule is sequential: address invisible-cost walls first, then allocator walls. Reverse the order and the allocator change you ship looks ineffective because the kernel is still bottlenecking.

The triage rule that emerges: when production p99 is more than 10× the microbenchmark p99 on the same allocator, you are in scenario A or scenario C, and the answer is almost never "ship a faster allocator". When production p99 is 2–4× the microbenchmark p99, you are in scenario B, and the allocator's own internals genuinely matter — but only as one term in the sum. Calibrating against these three shapes before opening any tool turns the diagnostic ladder from a checklist into instinct, and is the difference between fixing the wall in 30 minutes and fixing it in three days.

Common confusions

Going deeper

The diagnostic ladder when "the allocator looks fine"

When a service runs slow and your allocator's stats say it isn't the cause, work the ladder in this order. The mental model engineers usually bring is wrong: they associate "page fault" with "swapping", and their service has no swap configured, so they ignore the page-fault counter. Minor faults are not about disk — they are about the kernel doing the allocation work the application asked for implicitly. A service at steady state should have minor faults proportional to its allocation rate; when minor faults run at 1M/sec and your allocation rate is 50k/sec, the allocator is returning the same pages as fresh repeatedly because something in your workload is forcing them to be released and re-acquired.

  1. perf stat -e minor-faults,major-faults,page-faults,dTLB-load-misses,context-switches -p $(pidof svc) sleep 30. If minor-faults divided by allocation rate is greater than 1, the service is touching fresh pages on every alloc — first-touch zeroing is the bill. If dTLB-load-misses exceeds 0.5% of dTLB-loads, the working set has overflowed the TLB and you need huge pages or a denser layout.
  2. numastat -p $(pidof svc). Look at the Other_Node column. Anything above 5% means threads are reading remote memory — the fix is numactl pinning or NUMA-aware allocation, not the allocator itself.
  3. perf record -F 99 -g -p $(pidof svc) -- sleep 30 && perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg. Look for kernel frames — __handle_mm_fault, down_write, clear_page_erms. Their width is your invisible-cost budget.
  4. bpftrace -e 'kprobe:__handle_mm_fault { @[comm] = count(); } interval:s:10 { print(@); clear(@); exit(); }'. Tells you exactly how many minor faults each process is causing per second.
  5. /proc/<pid>/status — read VmRSS (resident bytes) and VmData (anonymous mapped). If VmData is 5× VmRSS, you have a lot of mapped-but-not-resident memory; first access will fault. Pre-touch or MAP_POPULATE removes the cliff.

Why the order is sequential, not parallel: each step rules out a class of cause. Skipping straight to the flamegraph is tempting but produces a flamegraph you cannot interpret without knowing which counter is hot. Start with perf stat; the answer is usually in the first 30 seconds, and the remaining steps refine the diagnosis rather than discover it.

The MAP_POPULATE escape hatch and its allocator-equivalent flags

Linux's mmap accepts a MAP_POPULATE flag that asks the kernel to pre-fault every page of the mapping at mmap time, instead of lazily on first access. This collapses the first-touch invisible cost into a single explicit cost paid at map-time. The trade-off: the mmap call itself blocks for the duration of the population (1.5 µs × N pages = 1.5 ms per MB). For a 2 GB allocation you wait 3 seconds. This is unacceptable on a request path but ideal at startup. The pattern in production C++ services: mmap(NULL, sz, ..., MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0) for arena buffers allocated during initialisation. Java's equivalent is -XX:+AlwaysPreTouch, which forces the JVM to write a single byte to every page of the heap at startup. Go's runtime has no first-class equivalent but can be approximated by allocating large buffers at startup and explicitly writing zeros to each page. The cost of not using these flags is usually paid in the first few hundred requests after a deploy — exactly the requests that the load balancer is using to validate the new pod's health, which is why bad luck with first-touch faults often produces "the deploy looked healthy in canary, then p99 spiked when we promoted to 100%" stories.

Why first-touch is the right default, and how it shifts when you switch allocators

It is tempting to ask "why does the kernel zero pages lazily? Why not zero them eagerly in the background?" The kernel runs a kswapd/zone-reclaim thread that pre-zeros some pages, but it can't pre-zero everything because (a) it doesn't know which process will need pages next, (b) zeroing dirties cache lines that may evict useful working-set data, and (c) memory pressure makes pre-zeroed pages a luxury. First-touch policy is the right default because it places memory close to the thread that will use it (NUMA-locally) and avoids paying for memory that was never written. The same defence applies to lazy TLB filling, the single global VMA tree, and NUMA first-touch — each is locally optimal; the wall is what happens when many locally-optimal defaults compose against a workload they were not tuned for.

Switching from glibc's ptmalloc to jemalloc or tcmalloc does not eliminate these costs — it redistributes them. Why the redistribution matters: jemalloc holds memory longer in bin caches, so mmap/munmap storms drop and VMA-lock contention disappears, but RSS climbs and the kernel's reclaim path runs more often. tcmalloc releases via madvise(MADV_DONTNEED) instead of munmap, keeping the VMA tree small but telling the kernel "you can drop these pages" — and the next access faults them back in, paying first-touch again. mimalloc's segments hit the kernel less often than either but are sensitive to access pattern: random access across the segment range thrashes the TLB worse than jemalloc's arenas. Each allocator is a different distribution of the same total cost across kernel, TLB, and cache. The benchmark that decides which is right for your service must measure all three layers simultaneously, which is exactly what the diagnostic ladder above is for.

When MALLOC_ARENA_MAX is the right knob, and when it isn't

Glibc's ptmalloc defaults to one arena per CPU (capped at 8× the CPU count for the heap arena). On a 64-core host that means up to 512 arenas, each with its own bin caches, each contending for mmap_lock when it needs to grow. Setting MALLOC_ARENA_MAX=2 collapses the contention by forcing all threads through 2 shared arenas — at the cost of higher per-arena lock contention inside ptmalloc itself. This is the classic "trade contention at one layer for contention at another" knob, and the right value depends on the ratio of mmap rate to allocation rate. Why the right value isn't always 2: services with very high allocation rates and very few mmap calls (a steady-state JSON parser whose arena is already sized) are better off with the default — more arenas means less in-allocator contention. Services with bursty mmap patterns (anything that periodically grows or shrinks its working set, like a cache eviction storm) benefit from MALLOC_ARENA_MAX=2 because the kernel-level mmap_lock is a bigger bottleneck than the per-arena lock. The decision is workload-dependent and changes with hardware; what worked on 16 cores often fails on 64. The Razorpay incident above is the bursty-mmap case; a steady-state Hotstar HLS parser with a stable working set is the opposite case. Reading the MALLOC_ARENA_MAX recommendation in a blog post and applying it without measuring is how you turn a working service into a slower one.

A related knob worth naming is io_uring's batched memory-management path. The page-fault itself is invoked synchronously by hardware and cannot be batched, but the adjacent costs — mmap, madvise, mprotect — can. Linux 5.13 added io_uring submission for memory-management operations; for services that allocate large buffers in batches at startup (Hotstar's per-stream segment buffers, Zerodha's per-symbol order-book pages), this collapses the boot-time allocation phase from seconds to milliseconds. The first-touch zeroing still happens on first access — io_uring doesn't change physics — but the syscall and VMA-lock costs vanish. This is part of a broader pattern in modern kernel APIs: the syscall is no longer where engineers should look for invisible overhead; the batched-submission queue is. Combine MALLOC_ARENA_MAX (fewer arenas → less mmap traffic) with io_uring batching (cheaper mmap per call) and the kernel-side cost can drop by 5–10× on the same workload, no allocator change required.

Reproduce this on your laptop

sudo apt install linux-tools-common linux-tools-generic numactl
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Show the cold/warm gap with perf-stat counters
python3 invisible_overhead_demo.py

# Look at the page-fault rate of any service you have running
perf stat -e minor-faults,major-faults,dTLB-load-misses -p $(pidof <yourpid>) sleep 10

# Inspect NUMA placement of an existing process
numastat -p $(pidof <yourpid>)

You should see the cold run come in 5–15× slower than the warm run, and the minor-faults counter difference will explain almost all of it. On a laptop with only one NUMA node the NUMA column doesn't matter, but the TLB and first-touch costs are identical to production. The numbers vary by hardware (Apple Silicon zeroes pages faster because of dedicated streaming-store paths) but the ordering is invariant.

Where this leads next

This wall closes Part 11 — the allocator chapters. The cost lines that the allocator pushes downstream — TLB misses, page faults, syscalls, context switches — are the entire subject of Part 12. The transition is direct: every overhead this chapter named as "invisible" gets a chapter of its own next.

The progression from Part 11 to Part 12 mirrors the diagnostic ladder this chapter sketched: Part 11 explained what malloc does, what its arenas look like, why jemalloc and tcmalloc exist, and how regions and pools sidestep parts of the cost. Part 12 takes the invisible layer that Part 11 deliberately left aside — the kernel and hardware costs the allocator pushes downstream — and gives each one its own chapter. By the end of Part 12 the reader should be able to look at a flamegraph dominated by kernel symbols and name not just what the symbol is but which allocator decision created the work, what the typical magnitude is, and which of the four fix categories (allocator, kernel path, workload, hardware) applies.

The reader who finishes both parts has the full vocabulary to argue about memory in production. They can say "this service is paying for first-touch zeroing on the hot path; we should MAP_POPULATE at startup" instead of "the allocator is slow"; they can say "the VMA lock is contended because we have 64 threads all hitting mmap; we need MALLOC_ARENA_MAX=2" instead of "we need a faster allocator". The vocabulary is the difference between a fix that works and a fix that hopes.

References