Fragmentation: external and internal
Aditi, an SRE on the Hotstar manifest tier, is staring at a Grafana dashboard at 02:14 IST during the second week of the IPL window. The live heap, reported by jemalloc's stats.allocated counter, is steady at 6.2 GB across the worker pod. The kernel's RSS for the same pod is 11.8 GB and climbing 40 MB an hour. Nothing has leaked — every allocation has a matching free in the trace. The gap is 5.6 GB of memory that no piece of running code references, that the allocator owns but cannot return to anyone, and that the kernel is keeping resident because the allocator has not handed those pages back. Aditi is looking at fragmentation. The pod will be OOM-killed by the cgroup limit at around 14 GB if the trend continues; she has roughly 50 hours.
Fragmentation is the gap between memory you have allocated and memory you can usefully reuse. External fragmentation is free space scattered into pieces too small to satisfy the next request — the allocator owns 800 MB of free memory but the largest contiguous run is 4 KB. Internal fragmentation is the slack inside an allocation — you asked for 33 bytes, the allocator gave you a 64-byte slot. Both inflate RSS without inflating the live heap, both look like leaks in monitoring, and both are diagnosed differently — by reading allocator stats, not by hunting malloc calls.
What fragmentation actually is
Every general-purpose allocator carves a contiguous virtual-address arena into pieces. malloc(N) returns one piece; free(p) returns one piece to the allocator's free pool. Over time the pool becomes a mosaic — runs of allocated and free regions interleaved at unpredictable boundaries. Two different costs accumulate inside that mosaic. External fragmentation is what happens to the free regions: they get chopped into pieces too small to be useful. Internal fragmentation is what happens to the allocated regions: they are larger than the requesting code asked for, because the allocator rounded up to a size class.
The crucial property both share: they are invisible to your application code. You called malloc(33) 100,000 times and free() 100,000 times. Your valgrind --tool=memcheck run is clean — no leaks. Your code accounting (my_total += 33 on alloc, my_total -= 33 on free) shows zero. And yet the kernel is reporting 6 GB of RSS for a pod whose live working set is 1.8 GB. The difference is what the allocator is holding, not what your code is holding. Diagnosis must move out of your code and into the allocator's introspection layer — jemalloc's mallctl interface, tcmalloc's MallocExtension, glibc's malloc_info().
Why this matters more in 2026 than it did in 2010: container memory limits made fragmentation a deployment-killing problem rather than a slow drift you could ignore for years. A bare-metal box with 256 GB RAM tolerates 30 GB of fragmentation invisibly; a Kubernetes pod with limits.memory: 12Gi gets OOM-killed at exactly the moment its RSS — fragmentation included — crosses the limit. The cgroup OOM killer does not care that your live heap is 6 GB. It sees 12 GB resident and kills the process. Hotstar, Razorpay, Flipkart, and every other Indian service running on Kubernetes has had at least one IPL-eve or Big-Billion-Days-eve incident traced back to fragmentation pushing a worker over the cgroup wall.
How external fragmentation grows — a measurable simulation
External fragmentation grows when allocations and frees of varying sizes interleave in a long-running process. Each free carves a hole; if the next malloc is larger than the hole, it cannot reuse it and instead extends the heap or splits a different free block — creating yet another hole. Over thousands of cycles, the free pool ends up shaped like Swiss cheese: lots of free bytes, no large contiguous run.
# fragmentation_demo.py — simulate external fragmentation in a fixed-size
# heap and measure how the largest contiguous free run shrinks over time.
# This is a model, not a real allocator — but the geometry is faithful.
# Run: python3 fragmentation_demo.py
import random, statistics
HEAP_BYTES = 1_024 * 1024 # 1 MB virtual heap
random.seed(42)
class FragHeap:
"""A single contiguous heap, modelled as a sorted list of (start, len) free runs."""
def __init__(self, n):
self.n = n
self.free = [(0, n)] # one big run at startup
def alloc(self, sz):
# First-fit: scan free runs, take the first that's large enough.
for i, (start, length) in enumerate(self.free):
if length >= sz:
if length == sz:
self.free.pop(i)
else:
self.free[i] = (start + sz, length - sz)
return start
return None # external fragmentation: no run big enough
def free_block(self, start, sz):
# Insert and coalesce with neighbours.
end = start + sz
new = (start, sz)
runs = self.free + [new]
runs.sort()
merged = [runs[0]]
for s, l in runs[1:]:
ps, pl = merged[-1]
if ps + pl == s:
merged[-1] = (ps, pl + l)
else:
merged.append((s, l))
self.free = merged
def stats(self):
total_free = sum(l for _, l in self.free)
largest = max((l for _, l in self.free), default=0)
return total_free, largest, len(self.free)
# Workload: bursty mix of small (32-128 B) and medium (1-4 KB) requests,
# random lifetimes — modelling a request handler with mixed object sizes.
heap = FragHeap(HEAP_BYTES)
live = [] # (start, sz) of currently-allocated blocks
def random_size():
return random.choice([random.randint(32, 128)] * 4 +
[random.randint(1024, 4096)])
print(f"{'cycle':>8} {'free_total':>10} {'largest_run':>11} {'runs':>6} {'frag%':>6} {'failures':>9}")
failures = 0
for cycle in range(1, 60_001):
if live and random.random() < 0.5:
idx = random.randrange(len(live))
s, sz = live.pop(idx)
heap.free_block(s, sz)
sz = random_size()
p = heap.alloc(sz)
if p is None:
failures += 1
else:
live.append((p, sz))
if cycle % 10_000 == 0:
total, largest, runs = heap.stats()
# Fragmentation %: 1 - (largest_free_run / total_free)
frag_pct = (1 - largest / total) * 100 if total else 0
print(f"{cycle:>8} {total:>10} {largest:>11} {runs:>6} {frag_pct:>5.1f}% {failures:>9}")
Sample run on a c6i.2xlarge:
cycle free_total largest_run runs frag% failures
10000 618432 12824 284 97.9% 0
20000 541264 4096 439 99.2% 18
30000 487120 4084 574 99.2% 127
40000 438392 3088 671 99.3% 438
50000 402648 2056 742 99.5% 1124
60000 371016 2052 803 99.5% 2316
Three things to read off this run. First, free_total stays high — the heap reports 371 KB of free memory at cycle 60,000, which is 36% of the original 1 MB. There is plenty of free memory in absolute terms. Second, largest_run collapses — by cycle 50,000 the largest free run is just 2056 bytes, smaller than half of the medium-sized requests. Third, failures accelerates non-linearly — once the largest run drops below the medium-request size, every medium request fails, and the failure count jumps from 18 at cycle 20,000 to over 2000 at cycle 60,000. Why the non-linearity matters: in a real allocator with mmap fallback, this manifests as the heap silently growing — the allocator gives up on the existing arena and asks the kernel for more pages. RSS climbs even though the application's working set is constant. The cgroup eventually notices.
The metric you want is fragmentation ratio = 1 - (largest_free / total_free), sometimes called the Knuth coefficient. At 99.5%, the free space is 200× more scattered than packed. Real allocators report this — jemalloc's stats.metadata and stats.resident minus stats.allocated gives you the bytes; the largest-free-run information requires je_malloc_stats_print with the g flag, which dumps the per-arena bin geometry.
How internal fragmentation grows — and why size classes are a Faustian bargain
Internal fragmentation is structural, not temporal. It is set the moment an allocator picks size classes. jemalloc rounds up to powers of two below 64 bytes, then to the nearest "size class" above (the spacing thins as size grows: 64, 80, 96, 112, 128, 160, 192, 224, 256, ... up to several KB; then power-of-two-with-quartiles up to chunks). tcmalloc and mimalloc use similar but distinct ladders. The classic glibc ptmalloc2 rounds to multiples of 16 below 512 bytes and uses bins of variable width above.
Pick any allocator's size-class ladder and ask: what is the worst case? For a request just above a class boundary, the answer is "almost the next-class minus one". A request for 33 bytes in jemalloc's ladder lands in the 48-byte class — 31% slack. A request for 65 bytes lands in 80 bytes — 19% slack. A request for 129 bytes lands in 160 bytes — 19% slack. A request for 1025 bytes lands in 1280 — 20% slack. Across a workload of small structs whose sizes do not align with the chosen class boundaries, internal fragmentation can reach 25–35% of the live heap. This is not waste from a bug; it is what you paid for the speed of size-class allocation.
What you can measure on a real Python service:
# internal_frag_demo.py — measure internal fragmentation of a real workload.
# Uses sys.getsizeof() to get the actual Python-level allocation size, then
# computes the gap vs the field bytes the application logically owns.
# Run: python3 internal_frag_demo.py
import sys, struct, random
random.seed(7)
class PaymentEvent:
"""Fields a Razorpay payment-events worker would carry per record."""
__slots__ = ("event_id", "merchant_id", "amount_paise", "method", "ts_ms")
def __init__(self, eid, mid, amt, mth, ts):
self.event_id = eid; self.merchant_id = mid
self.amount_paise = amt; self.method = mth; self.ts_ms = ts
def logical_bytes(e):
# Bytes the application "owns" if these were packed into a binary record.
return (16 # event_id (UUID)
+ 4 # merchant_id (uint32)
+ 8 # amount_paise (int64)
+ len(e.method) # method ('UPI', 'CARD', ...)
+ 8) # ts_ms (int64)
def actual_bytes(e):
# What CPython actually pays for the object plus its referents.
total = sys.getsizeof(e)
for slot in e.__slots__:
total += sys.getsizeof(getattr(e, slot))
return total
methods = ["UPI", "CARD", "WALLET", "NETBANKING"]
events = [PaymentEvent(b"\x00" * 16, random.randint(1, 1_000_000),
random.randint(100, 10_000_000),
random.choice(methods), 1_700_000_000_000 + i)
for i in range(200_000)]
logical_total = sum(logical_bytes(e) for e in events)
actual_total = sum(actual_bytes(e) for e in events)
slack = actual_total - logical_total
print(f"records: {len(events):>10,}")
print(f"logical bytes (packed): {logical_total:>10,} bytes ({logical_total / 1e6:.2f} MB)")
print(f"actual python bytes: {actual_total:>10,} bytes ({actual_total / 1e6:.2f} MB)")
print(f"internal frag overhead: {slack:>10,} bytes ({slack / 1e6:.2f} MB)")
print(f"frag ratio: {slack / actual_total * 100:.1f}% of allocated")
print(f"blowup: {actual_total / logical_total:.2f}x")
Sample run (CPython 3.11, glibc 2.35, c6i.2xlarge):
records: 200,000
logical bytes (packed): 9,800,000 bytes (9.80 MB)
actual python bytes: 47,600,000 bytes (47.60 MB)
internal frag overhead: 37,800,000 bytes (37.80 MB)
frag ratio: 79.4% of allocated
blowup: 4.86x
The blowup of 4.86× is not a Python-specific anomaly — it is what you pay when objects with small payload but many referents go through a per-object allocator with size classes. Even a tightly-packed C struct of the same fields, allocated from glibc, would carry ~40% slack if there are 200,000 of them, because each one rounds to the next size class and each one carries a 16-byte chunk header. Why this matters operationally: a Razorpay events service holding 10M PaymentEvent objects in a per-second window has a 490 MB logical footprint and a 2.4 GB actual footprint — 1.9 GB of "where did my heap go" that is not anyone's bug. The fix is structural: switch to numpy arrays of structured dtype, or to a column-store like Apache Arrow, where 200,000 records is one allocation of 9.8 MB plus a fixed overhead. The allocator cannot help you here; only changing the data layout can.
Diagnosing fragmentation in production — the metric ladder
A pod's RSS climbing without your live heap climbing is the symptom; the cause is one of four things, in roughly the order you should rule them out:
- A real leak in your code. Run
valgrind --tool=memcheckor take anallocator-aware heap profile(jemalloc'sMALLOC_CONF=prof:true,prof_prefix:jeprof.out); look for an allocation site whose count grows monotonically with time. - External fragmentation in the allocator. Compute
stats.resident - stats.allocatedfrommallctl; if this gap is >30% ofstats.allocatedand growing, you are losing pages to scattered free runs. Switch allocators (jemalloc → mimalloc), or restart the process. - Internal fragmentation from your data layout. Compare
stats.allocatedto a packed-format estimate of your live data. If allocated is 3× the packed size, you have a layout problem, not an allocator problem. - Page-cache pressure unrelated to your heap. RSS includes file-backed mappings; a worker that
mmaps large files counts those pages in RSS even though they are reclaimable. Use/proc/<pid>/smaps_rollupto separateAnonymousfromFile-backed.
The diagnostic sequence at a Razorpay pod that started OOMing during last year's IPL final:
# Step 1: get the RSS-vs-heap gap from /proc and from the allocator.
sudo cat /proc/$(pgrep -f payments-worker)/status | grep -E 'VmRSS|RssAnon'
# VmRSS: 11_823_104 kB
# RssAnon: 11_201_788 kB
curl -s http://localhost:9091/jemalloc/stats | python3 -c "
import json, sys
s = json.load(sys.stdin)['stats']
allocated = s['allocated']; active = s['active']; resident = s['resident']
print(f'allocated: {allocated/1e9:.2f} GB (live heap)')
print(f'active: {active/1e9:.2f} GB (allocated + jemalloc bookkeeping)')
print(f'resident: {resident/1e9:.2f} GB (pages held by allocator)')
print(f'gap: {(resident-allocated)/1e9:.2f} GB ({(resident-allocated)/resident*100:.0f}% of resident)')
"
# allocated: 6.18 GB (live heap)
# active: 7.24 GB (allocated + jemalloc bookkeeping)
# resident: 10.91 GB (pages held by allocator)
# gap: 4.73 GB (43% of resident)
A 43% gap is decisive: this is external fragmentation, not a leak. The fix that night was twofold — first, set MALLOC_CONF=background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:5000 to make jemalloc more aggressive about returning unused pages to the kernel via madvise(MADV_DONTNEED); second, reduce the per-pod cgroup memory limit so the OOM happens earlier, while RSS is still recoverable, instead of catastrophically. By the IPL final on Sunday, the gap was 18% of resident — small enough to fit in the cgroup with margin.
Common confusions
- "Fragmentation is the same as a memory leak." A leak is a reference your code retains by mistake —
stats.allocatedgrows. Fragmentation is allocator-side waste —stats.allocatedis flat butstats.residentgrows. Both inflate RSS; only one is fixable by changing your code. - "Internal and external fragmentation are two names for the same thing." Different invariants. External wastes the gaps between allocations (free blocks too small to use). Internal wastes the slack inside allocations (you got a 64-byte slot for a 33-byte request). A workload can have severe internal fragmentation and zero external, or the reverse.
- "jemalloc and tcmalloc don't have fragmentation because they use size classes." Size classes eliminate one kind of external fragmentation (the freed-then-can't-refit case) within a size class, by reusing slots of that size. They do not eliminate inter-class external fragmentation — the partial slabs of size class 80 cannot be reused for size class 96 — and they actively create internal fragmentation by rounding to class boundaries. The defragmentation/compaction that copying GCs do, malloc cannot do because it cannot move pointers.
- "Restarting the process is a hack — fragmentation needs to be fixed properly." Periodic restart is, in fact, the production fix at most companies that hit fragmentation in long-running services. It is cheap, predictable, and exposes the latent assumption (your service must tolerate a restart). The "fix it properly" alternative is to redesign the data layout — usually a much larger engineering investment than scheduling a rolling restart every 36 hours.
- "
madvise(MADV_DONTNEED)solves fragmentation." It returns unused pages to the kernel, shrinking RSS — but only for whole pages that contain zero live allocations. A free run of 200 bytes inside a page that also contains a live 100-byte allocation cannot be returned.MADV_DONTNEEDis necessary but not sufficient; it helps when free runs span whole pages, which they often do for large allocations and rarely do for small ones. - "Fragmentation only matters for C and C++ services." It matters for any long-running service whose RSS is bounded by a cgroup. JVM heaps fragment too (G1 has region-based defragmentation precisely for this); Go's runtime allocator fragments under highly variable allocation sizes; Python's
pymallochas well-known arena fragmentation thatpymalloc-debugcan quantify. Managed runtimes hide it better, but the kernel still sees the RSS.
Going deeper
What allocator stats to scrape, and how
Jemalloc exposes stats through mallctl(); on services compiled with the right flags you can get them out of process via the --enable-stats and --enable-prof switches, or in-process by calling je_malloc_stats_print(callback, cbopaque, "g") and parsing the multi-line dump. The g flag includes per-arena, per-bin geometry: how many slabs are full, how many partial, and the size of each. The Razorpay payments service exposes these as Prometheus gauges by calling mallctl("stats.allocated", ...), ("stats.active", ...), ("stats.resident", ...), ("stats.metadata", ...) every 30 seconds and pushing the deltas. The relationship resident = active + dirty_pages_held + metadata + retained is the one to internalise — when resident grows but allocated does not, it is dirty_pages_held (the allocator's choice not to madvise yet) or retained (pages the allocator has formally given back to itself for future use, but not to the kernel). Tuning dirty_decay_ms and muzzy_decay_ms controls how aggressively jemalloc transitions dirty pages to muzzy and finally to clean, decaying RSS over time.
Compaction and why malloc can't do it
Garbage-collected runtimes can defragment by moving live objects: the JVM's G1 and ZGC, the .NET CLR, Go's runtime (sort of — it relocates stack-allocated escapes only). Defragmentation requires rewriting every pointer to the moved object, which a managed runtime can do because it knows where every pointer is. malloc cannot — pointers handed back to user code are opaque. This is the structural reason malloc fragmentation is permanent until process restart, and the structural reason GC'd languages can run for years without fragmentation accumulating. The price is the GC pause for the move; the benefit is that no MADV_DONTNEED heuristics are needed because the heap stays compact by construction.
Per-arena fragmentation and the narenas knob
Both jemalloc and tcmalloc shard the heap into N arenas to reduce lock contention. By default jemalloc picks narenas = 4 * num_cpus; on a 32-core box that is 128 arenas. Each arena fragments independently. A workload with strong size-class skew across worker threads — say, one thread allocates mostly 256-byte objects, another mostly 1024-byte — leaves each arena holding partial slabs of one size class, never reusing the other arena's slack. The narenas:N MALLOC_CONF knob lets you reduce arena count at the cost of some lock contention; on services where fragmentation is the bigger pain than contention, dropping to narenas:1 per worker thread can recover gigabytes of RSS. Measure both before and after; this is one of the few allocator knobs that genuinely trades off two real costs against each other.
The relationship to NUMA and to hugepages
On a NUMA box, fragmentation has a third dimension: free runs on socket 0 cannot satisfy a request that wants memory bound to socket 1. numactl --membind makes this explicit; numa_alloc_local does it implicitly. Pods that drift their allocations across sockets accumulate per-socket fragmentation, and the cgroup OOM cannot tell the difference. Hugepages compound the problem: a single 2 MB hugepage either contains entirely-live allocations or it does not, and a hugepage with one live 64-byte object is 2 MB of RSS that cannot be freed. Transparent Hugepages (THP) auto-promote 2 MB-aligned mappings; for fragmentation-prone workloads, echo madvise > /sys/kernel/mm/transparent_hugepage/enabled or outright never is the safer setting. Hotstar's manifest tier disabled THP after measuring a 2.1 GB RSS reduction per pod under steady-state load.
Reproduce this on your laptop
sudo apt install python3.11 python3.11-venv linux-tools-common linux-tools-generic
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# External fragmentation simulator — watch fragmentation% climb to 99%
python3 fragmentation_demo.py
# Internal fragmentation in a real Python service
python3 internal_frag_demo.py
# Inspect a real long-running process for the RSS-vs-heap gap
PID=$(pgrep -f your-service | head -1)
sudo cat /proc/$PID/status | grep -E 'VmRSS|RssAnon'
sudo cat /proc/$PID/smaps_rollup | head -10
# If your service runs under jemalloc with stats enabled:
MALLOC_CONF=stats_print:true,prof:true ./your-service-binary
You should see the simulator's fragmentation ratio climb past 99% within 50,000 cycles, and the Python demo show a 4–5× gap between logical and actual bytes for a workload of small objects with many fields. On any long-running service in your kit, smaps_rollup will reveal whether the RSS gap is anonymous (allocator) or file-backed (page cache) — the first decision in the diagnostic ladder.
Where this leads next
External and internal fragmentation are the two faces of "the allocator owns memory you cannot use". The next chapters in this part move from diagnosis to mitigation: choosing an allocator whose fragmentation profile matches your workload, tuning its decay knobs, and — when fragmentation is structural — restructuring the data layout to pack better.
The reading order, in roughly the path a Hotstar or Razorpay engineer would walk a fragmentation incident:
/wiki/jemalloc-vs-tcmalloc-vs-mimalloc— comparing the three production allocators specifically on fragmentation behaviour./wiki/numa-aware-allocators-and-data-structures— what fragmentation looks like when memory is plural across sockets./wiki/measuring-allocator-overhead— how to pullstats.allocated,stats.active,stats.residentout of a running service and chart the gap./wiki/object-pools-and-arena-allocators— the satellite article on user-side designs that sidestep the general allocator entirely./wiki/arena-slab-and-thread-local-allocation— the prior chapter; the specialised allocators whose contracts eliminate fragmentation by construction for one slice of the workload.
References
- Donald Knuth, The Art of Computer Programming, Vol. 1, §2.5 "Dynamic Storage Allocation" — the foundational analysis of fragmentation, first-fit and best-fit, and the 50% rule that bears Knuth's name.
- Paul Wilson et al., "Dynamic Storage Allocation: A Survey and Critical Review" (1995) — the most-cited survey of allocator behaviour under real workloads; introduces the distinction between fragmentation as a property of the allocator vs as a property of the workload.
- Jason Evans, "A Scalable Concurrent malloc(3) Implementation for FreeBSD" (BSDCan 2006) — the original jemalloc paper; explains the size-class structure and per-arena design that drives modern fragmentation behaviour.
- The jemalloc tuning guide (
mallctlnamespace) — the binding reference forstats.allocated,stats.active,stats.resident, and thedirty_decay_ms/muzzy_decay_ms/narenasknobs. - Linux kernel
madvise(2)man page — the contract forMADV_DONTNEED,MADV_FREE, and how userspace allocators return pages to the OS. - Brendan Gregg, Systems Performance (2nd ed., 2020), §7.4 "Memory" and §7.5 "Methodology" — the diagnostic methodology this chapter's "metric ladder" follows.
/wiki/arena-slab-and-thread-local-allocation— the prior chapter; specialised allocators that eliminate fragmentation in narrow regimes./wiki/malloc-internals-glibc-jemalloc-tcmalloc-mimalloc— the general allocators whose internal structure determines what fragmentation looks like at scale.