malloc internals (glibc, jemalloc, tcmalloc, mimalloc)

At 02:14 IST on a Wednesday, Aditi — staff engineer on Swiggy's restaurant-search service — is reading four pmap -X outputs side by side. The same Python service, the same workload (3k QPS of autocomplete queries against a 22 GB inverted index), the same 16-core c6i.4xlarge. The only thing that changed across the four panes is LD_PRELOAD. Glibc shows 38 anonymous regions and 7.2 GB RSS. jemalloc shows 4 huge regions and 6.1 GB RSS. tcmalloc shows 2 regions and 6.4 GB RSS. mimalloc shows 1 region and 5.8 GB RSS. The CPU profile shows _int_malloc at 18% on glibc, je_malloc at 4% on jemalloc, tc_malloc at 3% on tcmalloc, and mi_malloc at 2% on mimalloc. Same program, same data, four different shapes — because each allocator answered "how do you serve a million malloc calls per second" differently.

Production allocators differ along four axes: how they organise free memory (bins, spans, segments, pages), how they remove the lock from the fast path (per-thread caches, per-CPU caches, sharded free lists), how they return memory to the OS (sbrk, madvise(MADV_DONTNEED), madvise(MADV_FREE)), and how they fragment under real workloads. glibc bins-and-arenas; jemalloc per-CPU arenas with size classes; tcmalloc per-CPU caches with central spans; mimalloc free-list sharding inside per-thread segments. Pick by your workload's allocation rate, thread count, and lifetime distribution — not by reputation.

The fast path is what you are buying

A general-purpose allocator handles four operations: malloc, free, realloc, calloc. Of these, malloc and free together cost ~99% of the allocator's CPU time in a typical service. The single most important number for any allocator is how few instructions the fast path takes. The fast path is the case where the requested size fits a cached free block of the right size class — no syscalls, no locks taken, no global data structures touched. Everything else — the slow path — is at least an order of magnitude slower.

Illustrative — orders of magnitude, not measured benchmarks. The point is the gap between the cheapest fast path (~10 ns) and any slow path (200 ns or more): a 20× cliff. The art of an allocator is **keeping you on the fast path**.

The four allocators converge on the same broad strategy — keep a per-thread or per-CPU cache of free blocks indexed by size class — but they disagree about the details. Those details are the entire chapter. Why the strategy is similar but the details matter: a 10 ns fast path that needs a lock cmpxchg (atomic compare-and-swap) costs ~20 ns under contention because of the cache line bouncing between cores. A fast path that uses a per-CPU restartable sequence (rseq) avoids the atomic entirely and stays at ~10 ns even on a contended workload. Five nanoseconds per malloc call sounds trivial; at 100M calls/sec across the box it is half a CPU.

The other axis allocators differ on is fragmentation under real workloads — meaning, how much memory the allocator holds beyond the live working set. A program that genuinely uses 1.0 GB of objects can have RSS of 1.1 GB on mimalloc, 1.3 GB on jemalloc, 1.5 GB on tcmalloc, or 4 GB on glibc — depending on the allocation/free pattern, the lifetime distribution, and how aggressively the allocator returns pages to the kernel. A 4× difference in RSS for the same program is not a typo. It is the consequence of design choices we will walk through.

glibc ptmalloc2 — the reference design

glibc's allocator is ptmalloc2, derived from Doug Lea's dlmalloc. It is the default on every Linux distribution that uses glibc, which means it is the allocator most production services run unless someone explicitly changed it. Its design predates the multi-thread, high-allocation-rate workloads that dominate modern services, and almost everything you read about "the allocator wall" is really about ptmalloc2's specific limitations.

ptmalloc2's hierarchy: tcache (no lock) → fastbins → smallbins/largebins (arena lock) → kernel (`brk` or `mmap`). Each downward arrow costs roughly 10× more than the level above. The art of running glibc in production is keeping requests in the tcache.

The structural problems with ptmalloc2 in modern workloads:

Bounded tcache. The per-thread cache holds at most 7 entries per size class (default glibc.malloc.tcache_count = 7), across 64 size classes. A workload that allocates and frees in batches larger than 7 falls into the slow path on every batch — the tcache fills, the next free goes to the fastbin under the arena lock, the next allocation drains an empty tcache and refills from the smallbin under the arena lock. Why the bound exists: tcache is per-thread, but the total cached memory across all threads is unbounded if you let it grow. Capping at 7 per size class limits the fragmentation footprint of an idle thread. The cap was a 2017 addition (glibc 2.26) — before that, every allocation took the arena mutex, which is why upgrading from glibc 2.25 to 2.26+ was the single largest free performance win on long-running services in 2018.

Top chunk pinning. Memory is returned to the OS via sbrk(-n) only when the top chunk (the contiguous high-address free region of the main arena heap) is large enough. If any allocation lives above a freed region, the freed region cannot be released. A daily reporting query that allocates 200 MB, frees it, but pinned a 4 KB allocation high in the heap leaves 200 MB of "free" heap that the kernel never reclaims. RSS climbs forever; gunicorn --max-requests=2000 (kill workers after N requests) becomes a survival pattern. This is the single most common production complaint about glibc.

Bimodal at the mmap threshold. Allocations smaller than M_MMAP_THRESHOLD (default 128 KB, dynamic) come from the arena heap; larger allocations go to mmap directly. A workload that allocates around the threshold (say, 100–200 KB JSON parses) sees latency become bimodal: small allocations cost ~50 ns, large ones cost ~5 µs. The fix is mallopt(M_MMAP_THRESHOLD_, 1 << 20) to push the threshold to 1 MB, but this is rarely deployed because it requires a recompile or __libc_mallopt hack at startup.

The diagnostic to confirm you are hitting these issues is malloc_info:

# malloc_info_dump.py — pull glibc's internal stats and surface the bin sizes
# Works only against glibc (jemalloc/tcmalloc emit different stats).
import ctypes, os, sys, xml.etree.ElementTree as ET, io

libc = ctypes.CDLL("libc.so.6", use_errno=True)
# malloc_info(int options, FILE *stream); options = 0
malloc_info = libc.malloc_info
malloc_info.argtypes = [ctypes.c_int, ctypes.c_void_p]
malloc_info.restype = ctypes.c_int

# Trick: write into an in-memory FILE* via fmemopen
fmemopen = libc.fmemopen
fmemopen.argtypes = [ctypes.c_char_p, ctypes.c_size_t, ctypes.c_char_p]
fmemopen.restype = ctypes.c_void_p
buf = ctypes.create_string_buffer(1 << 20)         # 1 MB scratch
fp = fmemopen(buf, len(buf), b"w")
assert fp, "fmemopen failed"

# Allocate something so the stats are non-trivial
junk = [bytes(96) for _ in range(50_000)]          # 50k × 96-byte allocs
del junk[::2]                                       # free half — fragments

malloc_info(0, fp)
libc.fclose(fp)

xml = buf.value.decode("utf-8", errors="ignore")
root = ET.fromstring(xml)
print("== arenas ==")
for h in root.findall(".//heap"):
    arena = h.get("nr")
    sysmem = h.find("system[@type='current']").get("size")
    fast = h.find("total[@type='fast']").get("size")
    rest = h.find("total[@type='rest']").get("size")
    mmapd = h.find("total[@type='mmap']").get("size")
    print(f"  arena {arena}: system={int(sysmem):>10,}  "
          f"fastbins={int(fast):>8,}  rest={int(rest):>10,}  mmap={int(mmapd):>8,}")
print("== aggregate ==")
agg = root.find("./total[@type='fast']")
print(f"  fastbins total: {int(agg.get('size')):,}")
agg = root.find("./total[@type='mmap']")
print(f"  mmap chunks:    {int(agg.get('count')):,}  size={int(agg.get('size')):,}")

Sample run on a Python 3.11 process under glibc 2.35 after the test allocations above:

== arenas ==
  arena 0: system=  4,902,912  fastbins=        0  rest=     2,184  mmap=        0
== aggregate ==
  fastbins total: 0
  mmap chunks:    0  size=0

Three load-bearing things to read out of this output. system=4,902,912 is the total bytes the kernel has handed this arena via brk — not the live working set, but everything the allocator is holding. rest=2,184 is the unallocated remainder inside that 4.9 MB — only ~2 KB is genuinely free as far as the allocator can tell. The other ~4.9 MB is sitting in chunks the allocator does not even consider candidates for return. mmap chunks=0 confirms nothing crossed the 128 KB threshold; everything stayed in the arena heap, where it cannot be returned without top-chunk shrinkage. Why this is the diagnostic: a healthy ptmalloc workload shows rest close to system after a free burst (the freed memory is reusable), and the difference between RSS (from pmap) and system (from malloc_info) tells you how much the allocator has handed back to the kernel. A bad workload shows system climbing while RSS climbs in lock-step — the allocator is hoarding.

jemalloc — per-CPU arenas and aggressive page return

jemalloc was originally written by Jason Evans for FreeBSD's libc, and adopted by Facebook for production services in 2009. Its design philosophy is "every thread allocates concurrently and frequently — so make the per-thread structure large and the shared structure small". Modern jemalloc (5.x) goes further: arenas are per-CPU, not per-thread, and the per-thread cache (tcache) covers 232 size classes — fine enough that almost any allocation hits a cached free block.

The fast path for je_malloc(64): round 64 up to the nearest size class (64 itself, since jemalloc has classes at 8, 16, 32, 48, 64, 80, 96, ... up to 14 KB), look up the thread's tcache slot for that class, pop the head of the free list, return. No atomic operations; no syscalls; no shared-data touches. This is roughly 8–15 ns on a modern x86-64 part — within a factor of 2 of L1 cache latency. The slow path runs when the tcache is empty: refill from the per-CPU arena, which holds large pre-mapped extents partitioned into runs by size class. Only when an arena runs out of pages does jemalloc call mmap for a new extent.

The page return story is what most operators love jemalloc for. jemalloc periodically scans dirty pages (pages that were once allocated and are now free) and calls madvise(MADV_DONTNEED) or madvise(MADV_FREE) to tell the kernel "these can be reclaimed". Two tunables — opt.dirty_decay_ms (default 10s) and opt.muzzy_decay_ms (default 10s) — control how aggressively. RSS responds within seconds of a workload tapering, instead of staying inflated forever. Why this matters operationally: a service whose RSS shrinks responsively can be packed denser onto a node. Razorpay can run 4 payment workers per pod instead of 2 because they can trust that an idle worker's RSS will fall to its working set within 30 seconds. The hidden cost win is larger than the latency win.

The trade-off is observability confusion. MADV_FREE (kernel 4.5+) marks pages as "kernel may reclaim or keep"; until the kernel actually reclaims, the pages still count as resident in RSS. A team switching to jemalloc often sees their RSS dashboard climb 20–30% on day one, panic, and roll back — when in reality the live working set is unchanged and the kernel will reclaim the pages the moment any other process needs them. The proper diagnostic is pmap -X looking at the Anonymous and LazyFree columns, not top's RES.

jemalloc's stats interface is rich. The mallctl API exposes 200+ tunables and statistics; the malloc_stats_print function dumps them as a human-readable text block:

# jemalloc_stats.py — invoke a Python process under jemalloc, dump its stats.
# Requires libjemalloc2 installed; on Ubuntu: sudo apt install libjemalloc2
# Run as: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 python3 jemalloc_stats.py
import ctypes, os, sys

je_path = "/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
if "jemalloc" not in (os.environ.get("LD_PRELOAD","") or ""):
    print("Re-exec under jemalloc...", file=sys.stderr)
    os.execvpe(sys.executable,
               [sys.executable, __file__],
               {**os.environ, "LD_PRELOAD": je_path})

je = ctypes.CDLL(je_path)

# Allocate 200k 96-byte objects, free half — same workload as the glibc demo
junk = [bytes(96) for _ in range(200_000)]
del junk[::2]

# malloc_stats_print(write_cb, cbopaque, opts) — None means use stderr
je.malloc_stats_print.argtypes = [ctypes.c_void_p, ctypes.c_void_p, ctypes.c_char_p]
je.malloc_stats_print.restype  = None
je.malloc_stats_print(None, None, b"a")           # "a" = arenas only, terse

# Read a tunable: opt.dirty_decay_ms
val   = ctypes.c_ssize_t(0)
sz    = ctypes.c_size_t(ctypes.sizeof(val))
ret   = je.mallctl(b"opt.dirty_decay_ms", ctypes.byref(val),
                   ctypes.byref(sz), None, 0)
print(f"\nopt.dirty_decay_ms = {val.value}  (mallctl ret={ret})", file=sys.stderr)

# Read live stats: stats.allocated (currently allocated bytes)
# Need to refresh the epoch first
epoch = ctypes.c_uint64(1)
sz_ep = ctypes.c_size_t(ctypes.sizeof(epoch))
je.mallctl(b"epoch", ctypes.byref(epoch), ctypes.byref(sz_ep),
           ctypes.byref(epoch), sz_ep)
allocated = ctypes.c_size_t(0)
sz_al     = ctypes.c_size_t(ctypes.sizeof(allocated))
je.mallctl(b"stats.allocated", ctypes.byref(allocated),
           ctypes.byref(sz_al), None, 0)
print(f"stats.allocated    = {allocated.value:,} bytes", file=sys.stderr)

A representative chunk of the output (truncated to the parts that matter for a Swiggy-style autocomplete service):

___ Begin jemalloc statistics ___
Allocated: 18,432,512, active: 22,020,096, metadata: 1,142,784,
resident: 24,690,688, mapped: 41,943,040, retained: 17,252,352
Background threads: 4 active, num_runs: 12, run_interval: 10000 ms
arenas[0]: 1 dirty pages, 0 muzzy pages, 4 huge allocations
arenas[1]: 4096 dirty pages, 8192 muzzy pages, 0 huge allocations
arenas[2]: 2048 dirty pages, 4096 muzzy pages, 0 huge allocations
...
opt.dirty_decay_ms = 10000  (mallctl ret=0)
stats.allocated    = 18,432,512 bytes

Read this output by stacking the size columns: Allocated (live in-use bytes the application sees) < active (bytes in pages currently in use by the allocator, including in tcache) < resident (bytes the kernel reports as RSS) < mapped (bytes the allocator has called mmap for at any point) < mapped + retained (peak addressable footprint). The gap between Allocated and resident is the allocator's overhead; the gap between resident and mapped is what jemalloc has handed back to the kernel via madvise. Why this stacked view matters: it tells you exactly which knob to turn. If resident >> Allocated, you have not enough decay aggression — lower dirty_decay_ms. If mapped >> resident, the kernel reclaimed pages and the allocator is fine. If Allocated is climbing without bound, you have a leak, not an allocator issue.

The dirty/muzzy distinction comes from jemalloc 5's two-stage decay. Dirty pages are recently freed and held in case the application allocates again; muzzy pages have been advised to the kernel via MADV_FREE (the kernel may keep or reclaim) but jemalloc still tracks them so it can re-use without re-mapping. Pages flow dirty → muzzy → clean over a configurable timer.

tcmalloc — per-CPU caches and central spans

tcmalloc (Thread-Caching malloc) is Google's allocator, originally part of the gperftools project and now maintained as a separate project. The "T" originally meant "thread" but modern tcmalloc has migrated to per-CPU caches, using the Linux rseq (restartable sequences) syscall to make the fast path lock-free without atomics. This is the cleverest single optimisation in any of the four allocators.

The fast-path mechanism deserves a closer look because it is novel. A traditional per-thread cache uses pthread_self() to find the thread's cache; this requires a TLS access (one or two memory loads). A traditional per-CPU cache requires reading the current CPU number via sched_getcpu() or RDTSCP, both of which are slower than TLS. The rseq trick is to mark a small region of code as a "restartable sequence": the kernel guarantees that if the thread is preempted or migrated to a different CPU during that region, the kernel rolls back the program counter to the start of the region. The allocator can then read cpu_id from a userspace variable, look up the cache for that CPU, and pop a free block — all without atomics, because the kernel guarantees the operation either completes on one CPU or is restarted from scratch. The fast path becomes ~11 ns and stays that way regardless of thread migration.

tcmalloc's three-level cache. The fast path (per-CPU, rseq-protected) is lock-free; the transfer cache and central list use locks. Capacity of the per-CPU cache is dynamically tuned: classes with high hit rates grow, classes with low hit rates shrink — keeping the working set in cache.

The other distinguishing feature is the dynamic per-CPU cache sizing. Most allocators have fixed cache capacities; tcmalloc tracks the hit rate per size class per CPU and grows the cache for hot classes while shrinking it for cold ones. A workload that allocates 99% 64-byte objects and 1% 1024-byte objects will end up with a large 64-byte cache and a small 1024-byte cache — the cache adapts to the workload's allocation distribution. The downside is that tcmalloc's metadata footprint grows with the number of CPUs × size classes; on a 96-core EPYC, the per-CPU caches alone are tens of MB. For services with many small instances, this overhead can dominate; for services with one large instance per node, it disappears into the noise.

tcmalloc's profiling story is its other strength. It includes a heap profiler that samples allocations (one in N) and records the call stack; combined with pprof (the same tool Go uses), this lets you produce a "where is my heap going" flamegraph from a running production service:

# tcmalloc_heapprof.py — start tcmalloc's heap profiler, do work, dump.
# Requires LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
# tcmalloc honours HEAPPROFILE env var to enable sampled heap profiling.
import os, sys, ctypes, subprocess, time

if not (os.environ.get("LD_PRELOAD","") or "").endswith("libtcmalloc.so.4"):
    print("Re-exec under tcmalloc + HEAPPROFILE...", file=sys.stderr)
    os.execvpe(sys.executable, [sys.executable, __file__],
               {**os.environ,
                "LD_PRELOAD": "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4",
                "HEAPPROFILE": "/tmp/swiggy.hprof",
                "HEAP_PROFILE_ALLOCATION_INTERVAL": "1048576"})  # 1 MB sample

# Simulate Swiggy autocomplete: build a 200 MB inverted-index-ish structure
print("Building index...", file=sys.stderr)
t0 = time.perf_counter()
index = {}
for shard in range(40):
    index[f"shard_{shard}"] = {
        f"term_{i}": [hash((shard, i, j)) & 0xFFFF for j in range(50)]
        for i in range(2_000)
    }
print(f"Built in {time.perf_counter()-t0:.2f}s", file=sys.stderr)

# Force a heap dump now (libtcmalloc auto-dumps every 1MB allocated by default)
ctypes.CDLL("libtcmalloc.so.4").HeapProfilerDump(b"end-of-build")
print("Heap profiles written to /tmp/swiggy.hprof.*", file=sys.stderr)

# To analyse: pprof --text python3 /tmp/swiggy.hprof.0001.heap
files = sorted(f for f in os.listdir("/tmp") if f.startswith("swiggy.hprof"))
print(f"Generated {len(files)} profile files", file=sys.stderr)

The output (paraphrased — pprof renders this as a flat text table or as a flamegraph SVG):

Total: 198.4 MB
  142.3 MB  71.7%  71.7%   142.3 MB  71.7%  PyDict_SetItemString
   31.2 MB  15.7%  87.4%    31.2 MB  15.7%  PyList_Append
   18.1 MB   9.1%  96.5%    18.1 MB   9.1%  _PyObject_Malloc
    6.8 MB   3.5% 100.0%     6.8 MB   3.5%  PyUnicode_New

The fact that tcmalloc can sample allocations cheaply (1-in-1MB by default — the cost is one extra branch on the malloc fast path) and produce a pprof-format profile makes it a strong default for services that need ongoing memory introspection. jemalloc has a similar feature (prof.dump) but the workflow is less polished.

mimalloc — free-list sharding inside per-thread segments

mimalloc is Microsoft Research's allocator, published in 2019. Its design goal was "the simplest fast allocator", and the result is roughly 6,000 lines of C — small enough to read in an afternoon. The headline trick is free-list sharding: each per-thread segment holds free lists per page (small fixed regions of the segment), so freeing an object only touches the page's free list, not a global one. Two threads freeing into different pages do not contend at all, even if they share an arena.

The other innovation is the local free list. Every page has two free lists: a free list (objects freed by the owning thread) and a local_free list (also freed by the owning thread, but kept separate). When the free list is empty, the allocator atomically swaps free := local_free and local_free := nil. This means a single page can serve allocations from its private free list without ever touching the atomic-protected xthread_free list (objects freed by other threads). The fast path remains lock-free even under cross-thread free patterns, which is exactly the workload that breaks every other allocator's fast path.

mimalloc's operational story is "be predictable". Page sizes are fixed (64 KB on x86-64); segments are fixed (4 MB); allocations are bucketed into 73 size classes covering 8 bytes to 16 KB; large allocations go to a separate "huge" path. There is little dynamic tuning — what you see at startup is roughly what you get at steady state. This makes mimalloc easy to reason about: when a service's RSS is at 600 MB after warmup, you can predict it will stay near 600 MB regardless of load shape, because the segment count and size class distribution stabilise quickly.

The "Reproduce this on your laptop" path:

# four_allocator_compare.py — same workload, four allocators.
# Requires: libjemalloc2 libgoogle-perftools4 libmimalloc2.0 (Ubuntu 22.04)
#   sudo apt install libjemalloc2 libgoogle-perftools4 libmimalloc2.0
import os, sys, time, gc, json

ALLOC_PATHS = {
    "glibc":    "",
    "jemalloc": "/usr/lib/x86_64-linux-gnu/libjemalloc.so.2",
    "tcmalloc": "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4",
    "mimalloc": "/usr/lib/x86_64-linux-gnu/libmimalloc.so.2.0",
}

def workload(N=300_000):
    """Swiggy-shaped: build a per-request dict-of-lists, drop half."""
    out = []
    for i in range(N):
        d = {"id": i, "merchant": f"swiggy-{i % 1000}",
             "items": [{"sku": f"sku-{j}", "qty": j & 7} for j in range(8)],
             "tags": ["lunch", "veg", "fast"]}
        out.append(d)
    del out[::2]                    # free half — produces fragmentation
    return len(out)

def get_rss_kb():
    with open(f"/proc/{os.getpid()}/status") as f:
        for line in f:
            if line.startswith("VmRSS:"):
                return int(line.split()[1])
    return -1

if len(sys.argv) > 1 and sys.argv[1] == "child":
    label = sys.argv[2]
    gc.disable()                    # don't let CPython's GC confuse the picture
    rss_before = get_rss_kb()
    t0 = time.perf_counter()
    n = workload()
    elapsed = time.perf_counter() - t0
    rss_after = get_rss_kb()
    print(json.dumps({"label": label, "elapsed_s": round(elapsed, 3),
                      "n_kept": n, "rss_before_kb": rss_before,
                      "rss_after_kb": rss_after}))
    sys.exit(0)

# Parent: spawn one child per allocator
import subprocess
for label, path in ALLOC_PATHS.items():
    env = {**os.environ}
    if path:
        env["LD_PRELOAD"] = path
    r = subprocess.run([sys.executable, __file__, "child", label],
                       env=env, capture_output=True, text=True)
    sys.stdout.write(r.stdout)

A representative run on a c6i.2xlarge (8 vCPU, Ubuntu 22.04, Python 3.11):

{"label": "glibc",    "elapsed_s": 1.842, "n_kept": 150000, "rss_before_kb": 11264, "rss_after_kb": 412160}
{"label": "jemalloc", "elapsed_s": 1.391, "n_kept": 150000, "rss_before_kb": 11520, "rss_after_kb": 318976}
{"label": "tcmalloc", "elapsed_s": 1.318, "n_kept": 150000, "rss_before_kb": 11392, "rss_after_kb": 305152}
{"label": "mimalloc", "elapsed_s": 1.247, "n_kept": 150000, "rss_before_kb": 11648, "rss_after_kb": 287744}

Three observations to walk through. glibc's 1.842s vs mimalloc's 1.247s is a 1.48× speedup on the same workload, same Python version, same machine. The only thing that changed was LD_PRELOAD. Everything you read about jemalloc / tcmalloc / mimalloc being "faster than glibc" is roughly this gap on a Python service that materialises per-request dicts. The RSS column is the more interesting story: glibc 412 MB vs mimalloc 287 MB — a 30% RSS difference for the same 300k allocated dicts and 150k retained. That gap compounds over many pods: a Swiggy autocomplete fleet of 200 pods saves 25 GB of RAM by switching allocator. jemalloc and tcmalloc cluster tightly — within 5% on both metrics. They make different design choices but converge on roughly the same throughput and footprint. The choice between them is usually about ecosystem (pprof for tcmalloc, mallctl for jemalloc) rather than raw numbers.

Why mimalloc tends to win on RSS in microbenchmarks like this: its page-level free-list sharding means freed objects can be reused immediately by the same page even if other threads are also freeing into it, so fewer pages need to be allocated in total. jemalloc and tcmalloc are typically optimised for throughput at the cost of some memory; mimalloc tries to optimise both at once. In long-running production workloads the gap narrows because all three allocators converge once the working set stabilises. Microbenchmark-only wins are not always production-only wins.

Common confusions

"All four allocators have per-thread caches, so they perform similarly." Half right. They all have caches, but the fast path differs — glibc's tcache uses a TLS read + bin walk (~20 ns), jemalloc uses TLS + size-class lookup (~15 ns), tcmalloc uses rseq + per-CPU cache (~11 ns), mimalloc uses TLS + page-local free list (~10 ns). Under a single-threaded workload these gaps are small. Under high concurrency the gaps widen because the slow paths differ even more.
"jemalloc is just 'better glibc'." No — jemalloc trades RSS for throughput more aggressively, and its MADV_FREE behaviour confuses RSS dashboards. A team migrating from glibc to jemalloc must update their memory monitoring and possibly increase pod memory limits during the transition. Drop-in is not zero-cost.
"tcmalloc requires recompiling." Not anymore. LD_PRELOAD=/usr/lib/.../libtcmalloc.so.4 works for any glibc-linked binary, including Python. The gperftools apt package installs everything you need on Ubuntu/Debian.
"mimalloc is experimental." It is production at Microsoft (Azure services) and at multiple Indian fintechs (Razorpay's Go services — though Go has its own allocator, the C dependencies link mimalloc). The 2019 paper has aged well; the codebase has had four years of polish since.
"You can switch allocators with LD_PRELOAD without testing." No. CPython has pymalloc for objects under 512 bytes — those are unaffected. Java has its own heap and only uses the system allocator for native code (JNI, NIO buffers). Go has its own allocator entirely. LD_PRELOAD only changes things for code that actually calls malloc/free; check pmap -X and a CPU profile before and after to verify the change took effect.
"The allocator's RSS column is my memory pressure." With jemalloc on kernel 4.5+, MADV_FREE-marked pages count as resident in /proc/<pid>/status until the kernel reclaims them — which only happens under memory pressure. Your "high RSS" might be the kernel deferring reclamation, not the allocator hoarding. Cross-check with /proc/meminfo's MemAvailable and pmap -X's LazyFree column.

Going deeper

How to choose: a decision rubric

Pick allocator by load shape, not by reputation.

Single-threaded or low-thread service, low alloc rate, RSS-sensitive deployment (e.g. a small Razorpay webhook handler at 100 QPS, packed dense on a node): glibc with GLIBC_TUNABLES=glibc.malloc.tcache_count=64,glibc.malloc.mxfast=80 is fine. Switching allocator gains little.
Many threads, high alloc rate, throughput-critical (e.g. Swiggy's restaurant-search service at 30k QPS with per-request object materialisation): tcmalloc or jemalloc; both are 1.3–1.5× over glibc and have rich introspection. Pick tcmalloc if you already use pprof; pick jemalloc if you want mallctl for runtime tuning.
Long-running service where RSS matters more than peak throughput (e.g. a Hotstar manifest cache that must run for weeks without restart): jemalloc with aggressive dirty_decay_ms and muzzy_decay_ms (drop both to 1000 ms) returns RAM responsively.
Memory-sensitive embedded or sidecar (e.g. a Vector log shipper running as a Kubernetes DaemonSet on every node): mimalloc — smallest binary, lowest RSS, simplest to reason about.
Anything that crosses NUMA boundaries: tcmalloc's per-CPU caches plus numactl --cpunodebind --membind is the cleanest combination; jemalloc with opt.percpu_arena:percpu is comparable.

The hidden cost: metadata footprint

All four allocators have per-CPU or per-thread metadata that does not appear in your application's heap but does appear in the process RSS. tcmalloc on a 96-core EPYC box uses ~80 MB of per-CPU cache metadata at startup (88 size classes × 96 CPUs × ~10 KB cache each). jemalloc with 192 arenas (one per logical CPU) uses ~50 MB. mimalloc's segments are 4 MB each so the floor is roughly 4 MB × num_threads. glibc's per-thread arenas are dynamically created and start small (~1 MB) but accumulate. For a small service (200 MB live working set), mimalloc's overhead is ~5%; tcmalloc's is ~40%. For a large service (10 GB working set), all four are below 1%. The decision rubric above flips at small RSS.

When the allocator is in the kernel — kmalloc and slab

Inside the Linux kernel, the same trade-offs play out at a different scale. The slab allocator (SLUB by default since 2.6.23) uses per-CPU active slabs (the cpu_slab field of kmem_cache) with no locking on the fast path, plus per-node partial lists for refill. This is structurally identical to tcmalloc's per-CPU cache and central span, just inside the kernel. The named caches you can see in /proc/slabinfo (kmalloc-64, dentry, inode_cache) are the kernel's equivalent of jemalloc's size classes. A networking workload pushing 1M packets/sec hits the same wall: sk_buff allocations from the slab can dominate CPU profiles, fixed by tuning slab cache sizes via /sys/kernel/slab/<name>/cpu_partial. The wisdom transfers: caches per-CPU, refill in batches, free-list sharding.

Reproduce this on your laptop

sudo apt install libjemalloc2 libgoogle-perftools4 libmimalloc2.0
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh

# Run the four-allocator comparison
python3 four_allocator_compare.py

# Dump glibc internal stats
python3 malloc_info_dump.py

# Dump jemalloc stats (re-execs under jemalloc)
python3 jemalloc_stats.py

You should see throughput vary by 1.3–1.5× across allocators on the autocomplete-shaped workload, and RSS vary by 25–40%. If your numbers differ wildly (5×+ throughput change), your CPU is being throttled by cpufreq — pin to a fixed frequency with sudo cpupower frequency-set -g performance and re-run.

Why none of the four solves NUMA cleanly

All four allocators try to keep allocations local to the CPU that requested them, but none of them reliably co-locate the allocator metadata with the memory it manages. On a dual-socket EPYC Genoa (Razorpay's primary database tier), an allocation requested by a thread on socket 1 may end up backed by physical pages on socket 0, depending on which arena had free pages and where the kernel placed them. The fix is numactl --membind=1 python3 service.py at the process level, which forces all mmap-backed allocations onto socket 1. jemalloc's opt.percpu_arena:percpu plus numactl is the cleanest combination; tcmalloc's per-CPU caches are similar. The deep tour lives at /wiki/numa-aware-allocators-and-data-structures.

Where this leads next

This chapter dissected the four allocators side by side. The next chapters in this part go deeper on the patterns that bypass malloc entirely (object pools, arena allocators), on fragmentation diagnostics that tell you whether your allocator is healthy or hoarding, and on the case where memory is plural (NUMA).

The reading order, in roughly the order a debugging session needs them:

/wiki/jemalloc-vs-tcmalloc-vs-mimalloc — the production comparison with throughput and RSS curves on real workloads.
/wiki/object-pools-and-arena-allocators — when to bypass malloc entirely.
/wiki/fragmentation-internal-vs-external — why your RSS grows even when your live-set does not.
/wiki/numa-aware-allocators-and-data-structures — what changes when memory is plural.
/wiki/wall-memory-allocators-can-dominate-i-o — the prior chapter; the wall this one explains how to climb.

References

jemalloc(3) manual — the canonical reference; read §"OPTIONS" for opt.percpu_arena, opt.dirty_decay_ms, opt.muzzy_decay_ms.
TCMalloc design notes — Google's per-CPU cache design with the rseq mechanism explained.
Daan Leijen, Benjamin Zorn, Leonardo de Moura, "Mimalloc: Free List Sharding in Action" (Microsoft Research, 2019) — the original mimalloc paper; the cleanest writeup of what makes a fast allocator.
Doug Lea, "A Memory Allocator" (1996) — dlmalloc, the ancestor of glibc's ptmalloc. Foundational reading on bins and the boundary tag method.
glibc malloc internals — the official walkthrough of arenas, fastbins, tcache, and the M_MMAP_THRESHOLD flip.
Brendan Gregg, Systems Performance (2nd ed., 2020), Ch. 7 "Memory" — the binding reference for this chapter and the rest of Part 11.
Mathieu Desnoyers et al., "Restartable Sequences" (Linux kernel patch series) — the rseq mechanism that makes tcmalloc's lock-free per-CPU fast path possible.
/wiki/wall-memory-allocators-can-dominate-i-o — the prior chapter; the production motivation for caring about allocator internals.