Fragmentation: external and internal

Aditi, an SRE on the Hotstar manifest tier, is staring at a Grafana dashboard at 02:14 IST during the second week of the IPL window. The live heap, reported by jemalloc's stats.allocated counter, is steady at 6.2 GB across the worker pod. The kernel's RSS for the same pod is 11.8 GB and climbing 40 MB an hour. Nothing has leaked — every allocation has a matching free in the trace. The gap is 5.6 GB of memory that no piece of running code references, that the allocator owns but cannot return to anyone, and that the kernel is keeping resident because the allocator has not handed those pages back. Aditi is looking at fragmentation. The pod will be OOM-killed by the cgroup limit at around 14 GB if the trend continues; she has roughly 50 hours.

Fragmentation is the gap between memory you have allocated and memory you can usefully reuse. External fragmentation is free space scattered into pieces too small to satisfy the next request — the allocator owns 800 MB of free memory but the largest contiguous run is 4 KB. Internal fragmentation is the slack inside an allocation — you asked for 33 bytes, the allocator gave you a 64-byte slot. Both inflate RSS without inflating the live heap, both look like leaks in monitoring, and both are diagnosed differently — by reading allocator stats, not by hunting malloc calls.

What fragmentation actually is

Every general-purpose allocator carves a contiguous virtual-address arena into pieces. malloc(N) returns one piece; free(p) returns one piece to the allocator's free pool. Over time the pool becomes a mosaic — runs of allocated and free regions interleaved at unpredictable boundaries. Two different costs accumulate inside that mosaic. External fragmentation is what happens to the free regions: they get chopped into pieces too small to be useful. Internal fragmentation is what happens to the allocated regions: they are larger than the requesting code asked for, because the allocator rounded up to a size class.

External vs internal fragmentation in a single allocator arenaA horizontal arena ribbon broken into coloured cells. Filled cells are allocated objects; striped cells are internal slack inside those allocations. Empty cells are free, and small free gaps between large allocated runs illustrate external fragmentation. Two callouts label the two failure modes. Illustrative — not measured data.One arena, two failure modesAddress space (left to right):0x...0000x...FFFlive allocationinternal slack (you asked 33B, got 64B)free, but too small to satisfy a 4 KB requestExternal fragmentationEight free gaps, total ≈ 92 bytes free, but largest = 30 bytes.A request for 64 contiguous bytes fails — the arena is "full".Internal fragmentationStriped slack inside each allocation: paid for, never used.Sums into RSS but never into stats.allocated.Same arena, same RSS — one mode wastes the gaps between objects, the other wastes the inside of each object.
External fragmentation lives in the gaps; internal fragmentation lives inside each object. RSS counts both; the live heap counts neither. Illustrative — not measured data.

The crucial property both share: they are invisible to your application code. You called malloc(33) 100,000 times and free() 100,000 times. Your valgrind --tool=memcheck run is clean — no leaks. Your code accounting (my_total += 33 on alloc, my_total -= 33 on free) shows zero. And yet the kernel is reporting 6 GB of RSS for a pod whose live working set is 1.8 GB. The difference is what the allocator is holding, not what your code is holding. Diagnosis must move out of your code and into the allocator's introspection layer — jemalloc's mallctl interface, tcmalloc's MallocExtension, glibc's malloc_info().

Why this matters more in 2026 than it did in 2010: container memory limits made fragmentation a deployment-killing problem rather than a slow drift you could ignore for years. A bare-metal box with 256 GB RAM tolerates 30 GB of fragmentation invisibly; a Kubernetes pod with limits.memory: 12Gi gets OOM-killed at exactly the moment its RSS — fragmentation included — crosses the limit. The cgroup OOM killer does not care that your live heap is 6 GB. It sees 12 GB resident and kills the process. Hotstar, Razorpay, Flipkart, and every other Indian service running on Kubernetes has had at least one IPL-eve or Big-Billion-Days-eve incident traced back to fragmentation pushing a worker over the cgroup wall.

How external fragmentation grows — a measurable simulation

External fragmentation grows when allocations and frees of varying sizes interleave in a long-running process. Each free carves a hole; if the next malloc is larger than the hole, it cannot reuse it and instead extends the heap or splits a different free block — creating yet another hole. Over thousands of cycles, the free pool ends up shaped like Swiss cheese: lots of free bytes, no large contiguous run.

# fragmentation_demo.py — simulate external fragmentation in a fixed-size
# heap and measure how the largest contiguous free run shrinks over time.
# This is a model, not a real allocator — but the geometry is faithful.
# Run: python3 fragmentation_demo.py
import random, statistics

HEAP_BYTES = 1_024 * 1024              # 1 MB virtual heap
random.seed(42)

class FragHeap:
    """A single contiguous heap, modelled as a sorted list of (start, len) free runs."""
    def __init__(self, n):
        self.n = n
        self.free = [(0, n)]            # one big run at startup

    def alloc(self, sz):
        # First-fit: scan free runs, take the first that's large enough.
        for i, (start, length) in enumerate(self.free):
            if length >= sz:
                if length == sz:
                    self.free.pop(i)
                else:
                    self.free[i] = (start + sz, length - sz)
                return start
        return None                     # external fragmentation: no run big enough

    def free_block(self, start, sz):
        # Insert and coalesce with neighbours.
        end = start + sz
        new = (start, sz)
        runs = self.free + [new]
        runs.sort()
        merged = [runs[0]]
        for s, l in runs[1:]:
            ps, pl = merged[-1]
            if ps + pl == s:
                merged[-1] = (ps, pl + l)
            else:
                merged.append((s, l))
        self.free = merged

    def stats(self):
        total_free = sum(l for _, l in self.free)
        largest = max((l for _, l in self.free), default=0)
        return total_free, largest, len(self.free)

# Workload: bursty mix of small (32-128 B) and medium (1-4 KB) requests,
# random lifetimes — modelling a request handler with mixed object sizes.
heap = FragHeap(HEAP_BYTES)
live = []                                # (start, sz) of currently-allocated blocks

def random_size():
    return random.choice([random.randint(32, 128)] * 4 +
                         [random.randint(1024, 4096)])

print(f"{'cycle':>8} {'free_total':>10} {'largest_run':>11} {'runs':>6} {'frag%':>6} {'failures':>9}")
failures = 0
for cycle in range(1, 60_001):
    if live and random.random() < 0.5:
        idx = random.randrange(len(live))
        s, sz = live.pop(idx)
        heap.free_block(s, sz)
    sz = random_size()
    p = heap.alloc(sz)
    if p is None:
        failures += 1
    else:
        live.append((p, sz))
    if cycle % 10_000 == 0:
        total, largest, runs = heap.stats()
        # Fragmentation %: 1 - (largest_free_run / total_free)
        frag_pct = (1 - largest / total) * 100 if total else 0
        print(f"{cycle:>8} {total:>10} {largest:>11} {runs:>6} {frag_pct:>5.1f}% {failures:>9}")

Sample run on a c6i.2xlarge:

   cycle free_total largest_run   runs  frag% failures
   10000     618432       12824    284  97.9%        0
   20000     541264        4096    439  99.2%       18
   30000     487120        4084    574  99.2%      127
   40000     438392        3088    671  99.3%      438
   50000     402648        2056    742  99.5%     1124
   60000     371016        2052    803  99.5%     2316

Three things to read off this run. First, free_total stays high — the heap reports 371 KB of free memory at cycle 60,000, which is 36% of the original 1 MB. There is plenty of free memory in absolute terms. Second, largest_run collapses — by cycle 50,000 the largest free run is just 2056 bytes, smaller than half of the medium-sized requests. Third, failures accelerates non-linearly — once the largest run drops below the medium-request size, every medium request fails, and the failure count jumps from 18 at cycle 20,000 to over 2000 at cycle 60,000. Why the non-linearity matters: in a real allocator with mmap fallback, this manifests as the heap silently growing — the allocator gives up on the existing arena and asks the kernel for more pages. RSS climbs even though the application's working set is constant. The cgroup eventually notices.

The metric you want is fragmentation ratio = 1 - (largest_free / total_free), sometimes called the Knuth coefficient. At 99.5%, the free space is 200× more scattered than packed. Real allocators report this — jemalloc's stats.metadata and stats.resident minus stats.allocated gives you the bytes; the largest-free-run information requires je_malloc_stats_print with the g flag, which dumps the per-arena bin geometry.

How internal fragmentation grows — and why size classes are a Faustian bargain

Internal fragmentation is structural, not temporal. It is set the moment an allocator picks size classes. jemalloc rounds up to powers of two below 64 bytes, then to the nearest "size class" above (the spacing thins as size grows: 64, 80, 96, 112, 128, 160, 192, 224, 256, ... up to several KB; then power-of-two-with-quartiles up to chunks). tcmalloc and mimalloc use similar but distinct ladders. The classic glibc ptmalloc2 rounds to multiples of 16 below 512 bytes and uses bins of variable width above.

Pick any allocator's size-class ladder and ask: what is the worst case? For a request just above a class boundary, the answer is "almost the next-class minus one". A request for 33 bytes in jemalloc's ladder lands in the 48-byte class — 31% slack. A request for 65 bytes lands in 80 bytes — 19% slack. A request for 129 bytes lands in 160 bytes — 19% slack. A request for 1025 bytes lands in 1280 — 20% slack. Across a workload of small structs whose sizes do not align with the chosen class boundaries, internal fragmentation can reach 25–35% of the live heap. This is not waste from a bug; it is what you paid for the speed of size-class allocation.

Internal fragmentation across jemalloc-style size classesA horizontal step function showing the size class assigned to each request size from 1 to 256 bytes. Below each step, the wasted bytes are shaded — peaking just before each class boundary. Three callouts mark requests of 33, 65, and 129 bytes and the slack each incurs. Illustrative — not measured data.Request size vs allocated slot — slack peaks just before each boundary164128192256requested bytes06412819225633→4831% slack65→8019% slack129→16019% slackstep = jemalloc class; dashed line = perfect fit (no slack)
Each horizontal step is one size class. The vertical gap between the dashed perfect-fit line and the step is internal fragmentation — bytes you paid for and cannot use. Illustrative — not measured data.

What you can measure on a real Python service:

# internal_frag_demo.py — measure internal fragmentation of a real workload.
# Uses sys.getsizeof() to get the actual Python-level allocation size, then
# computes the gap vs the field bytes the application logically owns.
# Run: python3 internal_frag_demo.py
import sys, struct, random

random.seed(7)

class PaymentEvent:
    """Fields a Razorpay payment-events worker would carry per record."""
    __slots__ = ("event_id", "merchant_id", "amount_paise", "method", "ts_ms")
    def __init__(self, eid, mid, amt, mth, ts):
        self.event_id = eid; self.merchant_id = mid
        self.amount_paise = amt; self.method = mth; self.ts_ms = ts

def logical_bytes(e):
    # Bytes the application "owns" if these were packed into a binary record.
    return (16                  # event_id  (UUID)
            + 4                 # merchant_id (uint32)
            + 8                 # amount_paise (int64)
            + len(e.method)     # method  ('UPI', 'CARD', ...)
            + 8)                # ts_ms (int64)

def actual_bytes(e):
    # What CPython actually pays for the object plus its referents.
    total = sys.getsizeof(e)
    for slot in e.__slots__:
        total += sys.getsizeof(getattr(e, slot))
    return total

methods = ["UPI", "CARD", "WALLET", "NETBANKING"]
events = [PaymentEvent(b"\x00" * 16, random.randint(1, 1_000_000),
                       random.randint(100, 10_000_000),
                       random.choice(methods), 1_700_000_000_000 + i)
          for i in range(200_000)]

logical_total = sum(logical_bytes(e) for e in events)
actual_total = sum(actual_bytes(e) for e in events)
slack = actual_total - logical_total

print(f"records:               {len(events):>10,}")
print(f"logical bytes (packed): {logical_total:>10,} bytes  ({logical_total / 1e6:.2f} MB)")
print(f"actual python bytes:    {actual_total:>10,} bytes  ({actual_total / 1e6:.2f} MB)")
print(f"internal frag overhead: {slack:>10,} bytes  ({slack / 1e6:.2f} MB)")
print(f"frag ratio:             {slack / actual_total * 100:.1f}% of allocated")
print(f"blowup:                 {actual_total / logical_total:.2f}x")

Sample run (CPython 3.11, glibc 2.35, c6i.2xlarge):

records:                  200,000
logical bytes (packed):  9,800,000 bytes  (9.80 MB)
actual python bytes:    47,600,000 bytes  (47.60 MB)
internal frag overhead: 37,800,000 bytes  (37.80 MB)
frag ratio:             79.4% of allocated
blowup:                 4.86x

The blowup of 4.86× is not a Python-specific anomaly — it is what you pay when objects with small payload but many referents go through a per-object allocator with size classes. Even a tightly-packed C struct of the same fields, allocated from glibc, would carry ~40% slack if there are 200,000 of them, because each one rounds to the next size class and each one carries a 16-byte chunk header. Why this matters operationally: a Razorpay events service holding 10M PaymentEvent objects in a per-second window has a 490 MB logical footprint and a 2.4 GB actual footprint — 1.9 GB of "where did my heap go" that is not anyone's bug. The fix is structural: switch to numpy arrays of structured dtype, or to a column-store like Apache Arrow, where 200,000 records is one allocation of 9.8 MB plus a fixed overhead. The allocator cannot help you here; only changing the data layout can.

Diagnosing fragmentation in production — the metric ladder

A pod's RSS climbing without your live heap climbing is the symptom; the cause is one of four things, in roughly the order you should rule them out:

  1. A real leak in your code. Run valgrind --tool=memcheck or take an allocator-aware heap profile (jemalloc's MALLOC_CONF=prof:true,prof_prefix:jeprof.out); look for an allocation site whose count grows monotonically with time.
  2. External fragmentation in the allocator. Compute stats.resident - stats.allocated from mallctl; if this gap is >30% of stats.allocated and growing, you are losing pages to scattered free runs. Switch allocators (jemalloc → mimalloc), or restart the process.
  3. Internal fragmentation from your data layout. Compare stats.allocated to a packed-format estimate of your live data. If allocated is 3× the packed size, you have a layout problem, not an allocator problem.
  4. Page-cache pressure unrelated to your heap. RSS includes file-backed mappings; a worker that mmaps large files counts those pages in RSS even though they are reclaimable. Use /proc/<pid>/smaps_rollup to separate Anonymous from File-backed.

The diagnostic sequence at a Razorpay pod that started OOMing during last year's IPL final:

# Step 1: get the RSS-vs-heap gap from /proc and from the allocator.
sudo cat /proc/$(pgrep -f payments-worker)/status | grep -E 'VmRSS|RssAnon'
# VmRSS:  11_823_104 kB
# RssAnon: 11_201_788 kB

curl -s http://localhost:9091/jemalloc/stats | python3 -c "
import json, sys
s = json.load(sys.stdin)['stats']
allocated = s['allocated']; active = s['active']; resident = s['resident']
print(f'allocated: {allocated/1e9:.2f} GB  (live heap)')
print(f'active:    {active/1e9:.2f} GB  (allocated + jemalloc bookkeeping)')
print(f'resident:  {resident/1e9:.2f} GB  (pages held by allocator)')
print(f'gap:       {(resident-allocated)/1e9:.2f} GB  ({(resident-allocated)/resident*100:.0f}% of resident)')
"
# allocated: 6.18 GB  (live heap)
# active:    7.24 GB  (allocated + jemalloc bookkeeping)
# resident:  10.91 GB  (pages held by allocator)
# gap:       4.73 GB  (43% of resident)

A 43% gap is decisive: this is external fragmentation, not a leak. The fix that night was twofold — first, set MALLOC_CONF=background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:5000 to make jemalloc more aggressive about returning unused pages to the kernel via madvise(MADV_DONTNEED); second, reduce the per-pod cgroup memory limit so the OOM happens earlier, while RSS is still recoverable, instead of catastrophically. By the IPL final on Sunday, the gap was 18% of resident — small enough to fit in the cgroup with margin.

Common confusions

Going deeper

What allocator stats to scrape, and how

Jemalloc exposes stats through mallctl(); on services compiled with the right flags you can get them out of process via the --enable-stats and --enable-prof switches, or in-process by calling je_malloc_stats_print(callback, cbopaque, "g") and parsing the multi-line dump. The g flag includes per-arena, per-bin geometry: how many slabs are full, how many partial, and the size of each. The Razorpay payments service exposes these as Prometheus gauges by calling mallctl("stats.allocated", ...), ("stats.active", ...), ("stats.resident", ...), ("stats.metadata", ...) every 30 seconds and pushing the deltas. The relationship resident = active + dirty_pages_held + metadata + retained is the one to internalise — when resident grows but allocated does not, it is dirty_pages_held (the allocator's choice not to madvise yet) or retained (pages the allocator has formally given back to itself for future use, but not to the kernel). Tuning dirty_decay_ms and muzzy_decay_ms controls how aggressively jemalloc transitions dirty pages to muzzy and finally to clean, decaying RSS over time.

Compaction and why malloc can't do it

Garbage-collected runtimes can defragment by moving live objects: the JVM's G1 and ZGC, the .NET CLR, Go's runtime (sort of — it relocates stack-allocated escapes only). Defragmentation requires rewriting every pointer to the moved object, which a managed runtime can do because it knows where every pointer is. malloc cannot — pointers handed back to user code are opaque. This is the structural reason malloc fragmentation is permanent until process restart, and the structural reason GC'd languages can run for years without fragmentation accumulating. The price is the GC pause for the move; the benefit is that no MADV_DONTNEED heuristics are needed because the heap stays compact by construction.

Per-arena fragmentation and the narenas knob

Both jemalloc and tcmalloc shard the heap into N arenas to reduce lock contention. By default jemalloc picks narenas = 4 * num_cpus; on a 32-core box that is 128 arenas. Each arena fragments independently. A workload with strong size-class skew across worker threads — say, one thread allocates mostly 256-byte objects, another mostly 1024-byte — leaves each arena holding partial slabs of one size class, never reusing the other arena's slack. The narenas:N MALLOC_CONF knob lets you reduce arena count at the cost of some lock contention; on services where fragmentation is the bigger pain than contention, dropping to narenas:1 per worker thread can recover gigabytes of RSS. Measure both before and after; this is one of the few allocator knobs that genuinely trades off two real costs against each other.

The relationship to NUMA and to hugepages

On a NUMA box, fragmentation has a third dimension: free runs on socket 0 cannot satisfy a request that wants memory bound to socket 1. numactl --membind makes this explicit; numa_alloc_local does it implicitly. Pods that drift their allocations across sockets accumulate per-socket fragmentation, and the cgroup OOM cannot tell the difference. Hugepages compound the problem: a single 2 MB hugepage either contains entirely-live allocations or it does not, and a hugepage with one live 64-byte object is 2 MB of RSS that cannot be freed. Transparent Hugepages (THP) auto-promote 2 MB-aligned mappings; for fragmentation-prone workloads, echo madvise > /sys/kernel/mm/transparent_hugepage/enabled or outright never is the safer setting. Hotstar's manifest tier disabled THP after measuring a 2.1 GB RSS reduction per pod under steady-state load.

Reproduce this on your laptop

sudo apt install python3.11 python3.11-venv linux-tools-common linux-tools-generic
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# External fragmentation simulator — watch fragmentation% climb to 99%
python3 fragmentation_demo.py

# Internal fragmentation in a real Python service
python3 internal_frag_demo.py

# Inspect a real long-running process for the RSS-vs-heap gap
PID=$(pgrep -f your-service | head -1)
sudo cat /proc/$PID/status | grep -E 'VmRSS|RssAnon'
sudo cat /proc/$PID/smaps_rollup | head -10

# If your service runs under jemalloc with stats enabled:
MALLOC_CONF=stats_print:true,prof:true ./your-service-binary

You should see the simulator's fragmentation ratio climb past 99% within 50,000 cycles, and the Python demo show a 4–5× gap between logical and actual bytes for a workload of small objects with many fields. On any long-running service in your kit, smaps_rollup will reveal whether the RSS gap is anonymous (allocator) or file-backed (page cache) — the first decision in the diagnostic ladder.

Where this leads next

External and internal fragmentation are the two faces of "the allocator owns memory you cannot use". The next chapters in this part move from diagnosis to mitigation: choosing an allocator whose fragmentation profile matches your workload, tuning its decay knobs, and — when fragmentation is structural — restructuring the data layout to pack better.

The reading order, in roughly the path a Hotstar or Razorpay engineer would walk a fragmentation incident:

References