Data layout for cache friendliness
Aditi is profiling Zerodha Kite's order-book risk-check at the 09:15 IST market open. The hot loop scans an array of 2 million pending orders and computes the rupee-weighted exposure per client. The loop body is six instructions: load client_id, load qty, load price, multiply, accumulate into a small hashmap, advance pointer. On her laptop with a 50,000-order test set, the loop runs at 4.8 ns per order. In production with 2 million orders, it runs at 31 ns per order — six and a half times slower per order, on hardware that is faster than her laptop. Nothing in the algorithm changed. Nothing in the code changed. What changed is that the Order struct in production has 92 fields — tag, expiry, instrument_isin, client_remarks_v2, twenty-three reserved bytes for compliance — and at 384 bytes per struct, only one order fits per cache line. The CPU is fetching 384 bytes from DRAM to read the 24 bytes of client_id, qty, price she actually needs. Ninety-four per cent of every cache line she pulls in goes unread before the line is evicted to make room for the next struct. The algorithm has not slowed down. The cache has.
Cache-friendly layout is the discipline of arranging data so that the bytes the CPU is about to read are already next to the bytes it just read. Two layouts of the same logical data — array-of-structs vs struct-of-arrays, hot/cold splitting, alignment to line and page boundaries — produce dramatically different hit rates and dramatically different runtime, even when the algorithm is identical. The rule is one sentence: every cache line you pull from DRAM must carry useful payload, not metadata you ignore.
The two shapes of the same data — AoS vs SoA
There are two ways to lay out a collection of records in memory, and they are not equivalent to the cache. Array-of-structs (AoS) is what every textbook draws and what every ORM produces: one contiguous run of Order structs, each struct holding all of its fields together. orders[i].price and orders[i].qty sit a few bytes apart; orders[i].price and orders[i+1].price sit 384 bytes apart on Aditi's struct. Struct-of-arrays (SoA) transposes the matrix: instead of one array of records, you keep parallel arrays — prices[], qtys[], client_ids[], expiries[], ... — one array per field. prices[i] and prices[i+1] sit 8 bytes apart; prices[i] and qtys[i] sit megabytes apart. Same data, same total bytes, completely different access patterns.
The naming hints at the structural difference. AoS makes the record the unit of locality — fetch a record, you get all its fields together. SoA makes the field the unit of locality — fetch a field, you get its values across many records. The cache's question is "which unit of locality matches the access pattern?" If the loop reads many fields per record, AoS aligns; if the loop reads few fields per record, SoA aligns. Most analytical workloads — risk computation, financial aggregation, ad-attribution scoring, fraud-feature extraction — read few fields per record, which is why high-performance backend code at Razorpay, Stripe, Cloudflare, and the major HFT firms in Mumbai converges on SoA-style layouts even when the canonical OO design would suggest AoS.
Which one the cache prefers depends entirely on what your loop reads. If your loop touches every field of every record (rare — usually a serialiser), AoS wins because each record's fields cluster on adjacent lines. If your loop touches a handful of fields across every record (common — every analytical aggregation, every risk computation, every column-store query), SoA wins because the fields you want are tightly packed and the fields you ignore never cross the L1 boundary. The exposure-sum loop reads qty and price from every order; with AoS the CPU is forced to fetch all 384 bytes of every struct to read the 16 it cares about. With SoA, the CPU fetches only the qty and price arrays — 16 bytes per order, the entire fetch is payload.
Why this matters in cache cycles, not abstractly: a 64-byte cache line carrying 24 useful bytes has an effective bandwidth of (24/64) × DRAM-bandwidth = 37.5% of theoretical. A 64-byte line carrying 64 useful bytes is 100%. The L1d on a Sapphire Rapids core has ~1 TB/s of bandwidth; if your effective fraction drops to 37.5%, you have just turned a 1 TB/s pipe into a 375 GB/s pipe — and there is no profiler frame, no compiler flag, no algorithm tweak that will hide that loss. The only fix is to put the bytes you read next to each other.
This is the same reason analytical databases — DuckDB, ClickHouse, the Apache Arrow ecosystem, Snowflake's micro-partitions — all store data column-major rather than row-major. They are bandwidth-bound on every aggregation, and the scan loop reads only a handful of columns per query. Row-major would force them to drag every column of every row through the cache for every SELECT SUM(amount) FROM tx WHERE region='ap-south-1'. Column-major makes the scan exactly as expensive as the columns it touches and not a byte more. AoS-vs-SoA is the same trade-off at the in-memory data structure level that row-vs-column is at the storage level; the hardware reasons are identical.
Game engines hit the same wall from a different angle. Unity's ECS (Entity Component System) and Unreal's Mass framework abandoned the AoS object-oriented model for SoA because the per-frame physics step iterates over thousands of entities reading only position and velocity, never material or audio_emitter. The transition from class Entity { Position p; Velocity v; Material m; AudioEmitter ae; ... } arrays to struct ComponentArrays { Position[] positions; Velocity[] velocities; ... } doubled frame rates on the same hardware. Unity called the framework DOTS (Data-Oriented Tech Stack); the same insight, the same name. Whether you call it SoA, columnar storage, or data-oriented design, the underlying physics is one cache line carrying useful payload instead of dragged-along metadata.
A short historical note worth carrying: the early 2000s C++ community spent a decade arguing that "object-oriented design" was the right way to structure performance-critical code. The Mike Acton talk at CppCon 2014 ("Data-Oriented Design and C++") was the inflection point — Acton, then engine architect at Insomniac Games, walked through layout-driven 5–10x speedups on real game-engine workloads and argued that thinking of memory as bytes-on-pages rather than as objects-with-fields was the only model that scaled. Within five years that view had moved into mainstream high-performance C++, then into Rust's idiomatic patterns, then into the Linux kernel's per-CPU and per-core data structures. The systems-performance community now treats layout-first thinking as the baseline; OO-first thinking is the legacy pattern that keeps producing slow code in 2026.
Measuring the layout cost — a numpy stride benchmark
The cleanest way to feel the layout penalty is to write the same arithmetic over the same logical data twice — once with AoS, once with SoA — and watch the per-element cost diverge. numpy makes this direct: a structured dtype gives you an AoS-shaped array in memory; parallel typed arrays give you SoA. The arithmetic is identical, the data is bit-for-bit equivalent, and the only difference is how the bytes are arranged. The L1/L2/LLC counters from perf stat will tell you exactly where the cost is being paid.
# layout_bench.py
# Compare AoS vs SoA scans on the same 8M-record logical table.
# Layout 1: structured dtype (AoS) — every record is contiguous,
# with `qty` and `price` interleaved among 6 unread fields.
# Layout 2: parallel typed arrays (SoA) — qty[] and price[] are
# separate arrays; the unread fields live elsewhere.
# Workload: compute exposure = sum(qty * price) over all records.
import numpy as np
import time
N = 8_000_000 # ~8 million orders, ~3 GB AoS (matches Zerodha test fixture)
# --- AoS layout: structured dtype with 8 fields per record ---
aos_dtype = np.dtype([
('tag', 'S16'), # 16 bytes — compliance tag
('client_id', 'i8'), # 8 bytes — read by hashmap path, not here
('qty', 'i8'), # 8 bytes — READ
('price', 'f8'), # 8 bytes — READ
('expiry', 'i8'), # 8 bytes — unread
('instrument', 'S24'), # 24 bytes — unread
('remarks', 'S40'), # 40 bytes — unread
('reserved', 'S24'), # 24 bytes — unread
]) # 136 bytes per record; ~3 records per pair of cache lines
aos = np.zeros(N, dtype=aos_dtype)
aos['qty'] = np.random.randint(1, 1000, N, dtype=np.int64)
aos['price'] = np.random.uniform(50, 5000, N)
# --- SoA layout: just the two columns we read, side by side ---
soa_qty = aos['qty'].copy() # contiguous int64 array
soa_price = aos['price'].copy() # contiguous float64 array
def scan_aos():
t0 = time.perf_counter_ns()
# numpy gathers 'qty' and 'price' fields with a 136-byte stride.
total = (aos['qty'].astype(np.float64) * aos['price']).sum()
return total, (time.perf_counter_ns() - t0) / 1e6
def scan_soa():
t0 = time.perf_counter_ns()
total = (soa_qty.astype(np.float64) * soa_price).sum()
return total, (time.perf_counter_ns() - t0) / 1e6
if __name__ == "__main__":
# Warm up the caches and let the prefetcher learn the stride pattern.
for _ in range(3): scan_aos(); scan_soa()
aos_total, aos_ms = scan_aos()
soa_total, soa_ms = scan_soa()
print(f"AoS scan: {aos_ms:7.1f} ms ({aos_ms*1e6/N:5.1f} ns/record)")
print(f"SoA scan: {soa_ms:7.1f} ms ({soa_ms*1e6/N:5.1f} ns/record)")
print(f"speedup : {aos_ms/soa_ms:5.2f}x")
assert abs(aos_total - soa_total) < 1.0, "totals must match"
Sample run on a c6i.4xlarge (Ice Lake, 8 cores, 32 KB L1d, 1.25 MB L2, 30 MB L3, DDR4-3200):
AoS scan: 612.3 ms ( 76.5 ns/record)
SoA scan: 98.7 ms ( 12.3 ns/record)
speedup : 6.20x
Six-times faster, same arithmetic, same hardware, same totals. The wall-clock difference comes entirely from how many cache lines per record the scan has to drag through L1. The arithmetic is so cheap — one multiply, one accumulate per element — that on the SoA layout the loop is bounded by DRAM bandwidth, while on the AoS layout it is bounded by wasted DRAM bandwidth. Why the SoA version lands at ~12 ns/record: at 16 bytes payload per record (an int64 plus a float64), each 64-byte line holds 4 records, so the scan touches one line every 4 records — roughly 200 ms of DRAM streaming time for 8M records on a DDR4-3200 channel that delivers ~25 GB/s. The 12.3 ns is dominated by L1→register cost plus the multiplication latency, not by DRAM. The AoS version touches one full 136-byte record per element, plus a partial line for the next, and pulls roughly 1.1 GB through the cache hierarchy instead of 128 MB — about 8.5x more data motion, which matches the 6.2x slowdown after subtracting the constant ALU cost.
A second confirmation comes from varying the record size. If you re-run the AoS benchmark after stripping the cold fields out of the dtype — leaving only qty and price so each record is 16 bytes — the AoS time drops to within 5% of SoA. The two layouts converge once each record fits the bytes-needed budget. This is the cleanest possible proof that the slowdown was layout, not algorithm: the same Python, the same numpy, the same for-equivalent vectorised op, but with smaller records the gap disappears.
To confirm the diagnosis isn't algorithm-level, run the script under perf stat and look at the cache counters:
sudo perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses \
python3 layout_bench.py
The AoS run shows ~5x more L1 misses and ~3x more LLC misses than SoA. Instructions retired is roughly the same — both versions execute the same multiply-accumulate sequence. Cycles spent stalled on memory is where the 6.2x slowdown lives. If you only had a wall-clock measurement, you might suspect the multiplication. The PMU counters say "no — the multiplications are free; the cache misses are the cost".
A practical habit: every time you write a loop that scans more than 100,000 records and runs in a hot path, run it once under perf stat -d -d -d (the triple-detailed mode emits L1, L2, L3, and TLB counters together). The five seconds it takes to run the command answers a question that profilers usually don't — what is this loop actually waiting for? Most production performance investigations on hot-loop code paths converge on one of three diagnoses: front-end stalls (branch mispredict, see /wiki/branch-prediction-and-why-it-matters), back-end memory stalls (the AoS-style problem of this chapter), or back-end execution stalls (a long-latency divide or sqrt sitting on the critical path). The PMU tells you which one in roughly 10 seconds; the flamegraph never tells you. This is why senior engineers on performance-critical teams reach for perf stat before they reach for perf record.
Hot/cold splitting — the layout pattern that does not require column-major
AoS-vs-SoA is the textbook framing, but in production code you rarely get to redesign your structs into pure column-major arrays. What you can almost always do is hot/cold splitting: take your existing struct, identify the 1–3 fields that the hot loop reads on every iteration, and move only those fields onto a separate cache-line-aligned slice. The rarely-read fields ("cold" fields) stay in the original struct, accessible by index. The hot loop scans the small, dense hot slice; the rare path that needs cold fields takes one extra indirection.
Razorpay's payment-routing service uses this pattern for its Transaction records. The full struct is 312 bytes — vpa, merchant fields, KYC pointers, audit log, retry history, fraud scoring features. The fast routing decision reads only 24 bytes per transaction: amount_paise, payer_bank_id, merchant_category_code. Three fields. Twelve bytes hot per transaction in the original layout meant 312 bytes loaded per transaction — 96% waste. After hot/cold splitting, the hot slice is 24 bytes per transaction packed contiguously; the cold slice is 288 bytes per transaction in a separate array, addressed by the same index. The router scans the hot slice; the rare audit query takes the index and goes to the cold slice. P99 routing latency dropped from 18 ms to 2.4 ms on the same 64-core EPYC pod, with no algorithmic change.
The cost of the indirection from hot-array index back to cold-array record is one extra cache miss per cold-path access, roughly 70 ns on a typical server. If the cold path runs at 1 Hz (an audit query per second), 70 ns is invisible. If it runs at 10 kHz, it costs 0.7 ms/sec of CPU — still negligible. Only when the cold path approaches the hot path's frequency does the indirection matter, and at that point the "cold" name is a misnomer and the field belonged in the hot slice anyway.
Why hot/cold splitting beats inserting __attribute__((cold)) annotations: compiler hints like __builtin_expect or [[unlikely]] only affect branch prediction and instruction layout — they do not change where data sits in memory. The cache problem is data-side, not instruction-side. The compiler cannot rewrite your struct definition; the only entity that can is you. A struct with cold fields interleaved among hot fields will always pay the bandwidth tax, no matter how aggressively the compiler reorders the code that touches it. Layout is a human-side decision that no compiler in the next decade will make automatically — Rust, Go, Java, and C++ have all considered it; none have shipped it.
The pattern generalises: in any data-intensive loop where you read few fields per record, hot/cold splitting is a near-free optimisation. The cost is a small refactor (separate the hot fields, update the constructor, fix the audit-log path); the gain is whatever fraction of your cache line was previously waste. The asymmetry is what makes it worth doing — your cache line was almost always >50% waste, so the gain is almost always >2x. Hot/cold splitting is what production codebases at Stripe, Cloudflare, and Razorpay reach for after they outgrow their initial schemas; it is also what the Linux kernel does with task_struct (the scheduling fields are kept on a hot cache line, the bookkeeping fields are scattered). When you next find yourself benchmarking a hot loop and seeing a long pole on mem_load_retired.l1_miss, the first thing to ask is "how many bytes of each line am I actually reading?" If the answer is less than half, hot/cold splitting will move the needle.
Why this beats "just convert everything to SoA": a full SoA conversion of a 12-field struct touches every read site in the codebase — every audit log, every test fixture, every JSON serialiser, every gRPC stub. Hot/cold splitting touches only the hot path. The ROI per line of code changed is enormous: you change ~50 lines (a new struct, a hot constructor, the hot loop) and get the same cache-bandwidth gain that a full SoA refactor would deliver across thousands of lines. In any production codebase the refactor cost is the binding constraint, not the cache-friendliness ceiling.
A practical economic frame helps decide whether a hot/cold split is worth the engineering time. Estimate four numbers: the call frequency of the hot loop (Hz), the number of records per call, the bytes-fetched-per-record before the split, and the bytes-fetched-per-record after. The CPU time saved per second is freq × records × (before − after) / DRAM_bandwidth_per_core. On the Razorpay router, that worked out to: 12,000 Hz × 200,000 records × (312−24) bytes / 12 GB/s = 58 ms of CPU saved per second per pod, multiplied by 64 pods = 3.7 CPU-cores recovered fleet-wide. At AWS pricing for the c6i family, that translates to roughly ₹38,000/month, which paid back the two-week refactor in eight weeks. The same arithmetic gives you a quick sanity check on whether to even start the work; if the saved CPU is below 0.1 cores fleet-wide, the layout fix is not the right place to invest engineering time.
Alignment, padding, and the page boundary
Layout decisions that are correct at the cache-line level can still bleed throughput at the next-coarser unit — the page, which is 4 KB on x86 by default and 2 MB / 1 GB with hugepages. The TLB caches virtual-to-physical translations at page granularity. If your hot data is scattered across many small pages, you blow out the TLB exactly as you'd blow out L1 if your data were scattered across many cache lines. The fix is the same shape: keep hot data dense, on as few pages as possible.
A 4 KB page holds 64 cache lines. The dTLB on Sapphire Rapids has 64 entries for 4 KB pages — so 64 × 64 = 4096 cache lines, or 256 KB of data, can be addressed without a TLB miss. Sounds large, but a 1 GB hot SoA array is 250,000 pages, and TLB capacity is 64; every page change after the first 64 is a translation miss that costs a page-table walk (~100 ns per walk on cold L3). Hugepages collapse the math: a single 2 MB hugepage covers 32,768 cache lines; the same 1 GB array becomes 500 hugepages, easily inside TLB reach. The translation cost vanishes.
A practical pattern: when you allocate a hot slice, allocate it from a single contiguous block that aligns to the page. numpy.zeros(N, dtype=...) calls the system allocator; on Linux this almost always returns page-aligned memory for sufficiently large N. For smaller N or when you want guarantees, numpy.empty with posix_memalign underneath, or mmap with MAP_HUGETLB, gives you 2 MB hugepages that cover 512 cache lines per TLB entry. For the Aditi exposure-loop scenario, switching the hot SoA arrays to hugepages dropped TLB misses by 12x and shaved another 18% off the SoA scan time — the L1 cache was already happy, the TLB was the next bottleneck.
The pragmatic recipe for hugepage adoption looks like this: identify the hot SoA arrays in your service (usually one or two), allocate them with mmap(MAP_HUGETLB) or via a numpy harness that calls posix_memalign to a 2 MB boundary, and verify the migration with cat /proc/<pid>/smaps | grep -i huge. The kernel will report which mappings are backed by hugepages. On a service with 8–16 GB of resident hot data, the move from 4 KB pages to 2 MB hugepages typically reduces TLB miss rate from 4–8% of memory accesses down to under 0.1%, often shaving 5–15% off scan times that were already SoA-tight.
A second pattern is alignment of the start of each hot field array to a cache-line boundary. If your hot SoA has three arrays — qty[], price[], client_id[] — and they happen to start at addresses 0x...040, 0x...0E0, 0x...130, then the first elements of price and client_id straddle cache lines that may already be partly evicted by the time qty[] finishes streaming. Cache-line-aligning the start of each array means the prefetcher gets a clean run on each. numpy does this for you when arrays come from np.zeros/np.empty directly; if you carve subarrays out of a buffer with frombuffer, you have to align the offsets yourself.
Transparent Hugepages (THP) is the kernel feature that promotes 4 KB pages to 2 MB pages opportunistically. It is enabled by default on most Linux distributions but has known pathologies — fragmentation under churning allocations causes the kernel to spend CPU on background defragmentation, occasionally producing latency spikes. Production services with strict tail-latency budgets often run THP in madvise mode rather than always, and explicitly opt their hot allocations in via madvise(MADV_HUGEPAGE). The numactl --interleave flag also interacts with THP — interleaved allocations may not be eligible for THP on some kernel versions, so test before assuming.
A third, less-discussed pattern is the order in which fields appear inside a struct. C and C++ pad between fields to align each one to its natural boundary — an int64 after a bool adds 7 bytes of padding. Rearranging fields from largest-to-smallest minimises padding: a struct with int64 a; bool b; int64 c consumes 24 bytes (8 + 1 + 7 padding + 8); rearranged to int64 a; int64 c; bool b it consumes 17 bytes. On a 1-million-record array that saves 7 MB. The rule is "declare fields largest-first when the layout is for size; declare hot-fields-first when the layout is for cache locality". The two rules conflict, which is why hot/cold splitting eventually wins: it lets you have both — the hot slice is hot-fields-first and densely packed; the cold slice is largest-first and size-optimised.
The largest hugepage size on x86-64 is 1 GB. A single 1 GB page covers 16 million cache lines, an entire mid-sized hot SoA in one TLB entry. The trade is allocator inflexibility — you cannot allocate 700 MB inside a 1 GB hugepage and use the rest for something else; the page is allocated whole. Production services with steady multi-GB hot data and predictable lifetimes (caches, in-memory indices, ML model weights) reach for 1 GB pages; transient workloads with churning allocations stay on 2 MB.
The sub-microsecond cost of misalignment is small per access but real, and on extremely hot loops (>10^9 accesses/sec) it adds up. The diagnostic is perf stat -e dtlb_load_misses.miss_causes_a_walk,dtlb_load_misses.walk_pending; values above ~0.5% of loads on a hot loop are a signal to consider hugepages or larger contiguous allocations. Aditi's production loop sits at 0.08% TLB miss rate after the SoA + hugepage refactor; before it, the same loop sat at 9.4% and was paying ~12 ns per access for page-walk completions on top of the cache-miss tax.
A cleaner way to think about this trade is in terms of bandwidth efficiency — the ratio of useful payload bytes to bytes pulled through the cache. The Aditi exposure-loop on the original 384-byte AoS struct ran at 24/384 = 6.25% efficiency. After hot/cold splitting where the hot slice was 24 bytes per record packed contiguously, the same loop ran at 24/24 = 100% efficiency. The 16x ratio in efficiency closely tracks the observed 6x speedup, with the remaining 2.5x coming from compute being part of the cost (you cannot drop below the compute-only baseline no matter how efficient memory becomes). The arithmetic is universal: bandwidth efficiency is the ceiling on memory-bound speedup; you can never run faster than the ratio of payload to fetched bytes allows.
When the layout choice fails — five edge cases
The AoS-to-SoA migration is not always a free win. Five real production patterns push back, and they are worth recognising before a refactor lands and the regression shows up two weeks later.
The first is the mixed-purpose hot loop. Many production loops do not read just qty and price; they read those and compute a per-record decision based on expiry, client_id, instrument_isin. If your hot loop touches 10 of 12 fields, AoS is genuinely better — every line you fetch is mostly payload, and the inter-field reads happen at L1 hit speed because the fields are in the same line. Splitting an AoS into 12 separate SoA arrays in this case turns 10 reads from one cache line into 10 reads from 10 different cache lines, multiplying L1 pressure tenfold. The fix is to count fields read before refactoring; if the count is more than half, leave AoS alone.
The second is the insert-mostly path. SoA layouts are fast to scan but expensive to grow when records are appended one at a time — every insert has to extend N parallel arrays, each potentially triggering a realloc-and-copy on a different schedule. AoS appends one record per insert, one allocation per growth. Trading platforms like Zerodha that ingest thousands of new orders per second often keep the write path on AoS and convert to SoA in a separate "frozen" view that the read path scans. The conversion happens once per batch (every few hundred ms), not once per insert.
The third is the cache-miss-bound vs branch-bound distinction. SoA helps cache-miss-bound loops. Branch-bound loops — where every record produces a different control-flow path through if ladders — are not memory-bound at all, and SoA may not move the needle. The diagnostic is the front-end vs back-end stall ratio in perf stat -e topdown-fetch-bubbles,topdown-slots-issued; if the loop is front-end-bound (branch-mispredict-driven), reorganising memory layout will not help. The fix in that case is branchless code (CMOVs, predicated SIMD), not layout.
The fourth is JIT-language overhead drowning the layout signal. A pure Python loop iterating with for o in orders: total += o.qty * o.price is bounded by interpreter dispatch — hundreds of ns per iteration regardless of layout, because the bytecode interpreter is doing far more work than the cache miss it would otherwise have. SoA helps numpy, NumPy-Numba, NumPy-Cython, and any compiled loop. It does not help interpreted CPython loops because the interpreter cost dominates by 30–100x. PyPy partially closes this gap; Numba closes it fully. If your hot loop is CPython, the layout discussion is premature; the first move is to vectorise via numpy or compile via Numba, then worry about layout.
The fifth is the cold-data-frequently-read mismatch. Hot/cold splitting assumes the cold path is rare — say, less than 1% of accesses. If the cold path turns out to be 30% (an audit query that every customer-support ticket triggers, fired hundreds of times per minute), the indirection becomes the dominant cost, and the original AoS layout was actually correct for the real workload. The fix is to measure access frequencies before splitting, not assume them. Profilers like py-spy and pprof answer this directly — count the calls into the cold-path code path and compare to hot-path calls.
Why measuring before refactoring matters more in layout work than in algorithmic work: an O(n²)-to-O(n log n) refactor is right at any access frequency — the gain is asymptotic and dominates eventually. A layout refactor is right only at certain frequency ratios; the gain is a constant factor and can be eaten back by a wrongly-classified field. The break-even point is roughly hot:cold ≈ 100:1; below that ratio, the indirection cost catches up with the densification gain. Production telemetry — service-level metrics that count the QPS of each endpoint that touches the struct — is the data you need before approving the refactor. Engineering instinct alone is wrong about this often enough that the discipline of measuring is what separates layout fixes that ship from layout fixes that get reverted.
Common confusions
-
"SoA is always better than AoS." Only when the hot loop reads few fields per record. A serialiser that writes every field of every record over the wire prefers AoS — it can
memcpyone contiguous record at a time and the cache fetches it as a unit. A graphics engine that processes per-vertex normal+colour+texcoord together prefers AoS for the same reason. The rule is read-pattern-driven: count the fields your hot loop reads, divide by the total fields, and if the fraction is below ~30%, SoA or hot/cold splitting will help. -
"Smaller structs are always faster." Not when smaller means losing alignment. A 56-byte struct (instead of 64) saves 12% memory but causes some records to straddle cache lines, doubling the cache lines touched on every access for those records. Padding the struct up to 64 bytes — exactly one cache line — is often faster than the unpadded version, even though it uses more memory. The cost of memory is GB-cheap; the cost of cache misses is ns-expensive.
-
"The compiler will reorder my struct fields for cache friendliness." It will not. C, C++, and Rust all guarantee struct field order matches the source. The compiler can pad between fields for alignment but cannot move fields around — the layout is part of the ABI. If you want hot fields at the front, you write them at the front of the struct definition. (Exceptions:
#[repr(C)]enforces this strictly in Rust;#[repr(Rust)]allows reordering but the standard library does not aggressively reorder for cache.) Java and Go are different — both can reorder fields for alignment but neither does so for cache locality. -
"Cache friendliness only matters for hot loops." Hot loops are where the gains are biggest, but a one-time scan of a 100 GB dataset is also a hot loop while it runs. The Zerodha end-of-day reconciliation reads every order from the day exactly once; AoS-to-SoA savings on a 24-second scan turn into 4-second scans, which is the difference between making the 19:00 SEBI submission deadline and missing it. "Hot" is about access frequency per byte, not how often the code runs.
-
"AoS-vs-SoA is the same trade-off as row-store vs column-store." Mostly the same idea, with different mechanics. Row-store vs column-store is about disk layout and decompression boundaries — Parquet's column chunks are independently compressible. AoS-vs-SoA is about in-memory cache-line packing — there is no compression involved. A column-store database can use AoS in memory after loading a column block; an in-memory analytical engine like DuckDB usually keeps SoA all the way through. They share the underlying intuition (read what you need, nothing more) but the implementations are independent.
-
"My data is small enough to fit in L2; layout doesn't matter." L2 hit time is 12–14 cycles on Sapphire Rapids — about 4 ns. L1 hit time is 4–5 cycles, about 1.3 ns. A loop that runs at 1 cycle/iteration in L1 runs at 3 cycles/iteration in L2, which is a 3x slowdown invisible to the programmer because both are "fast". Layout still matters at the L1 boundary even when working sets fit in L2. The only working-set size where layout truly stops mattering is one that fits in L1d (32–48 KB), and most production records exceed that immediately.
Going deeper
When the prefetcher saves AoS
Modern hardware prefetchers — Intel's L2 spatial prefetcher, AMD's L1/L2 stride detectors — can hide a surprising amount of AoS cost when the access pattern is purely sequential and the stride is regular. If your loop walks orders[i].qty for i = 0..N, the prefetcher detects the constant 312-byte stride after a few accesses and starts speculatively fetching orders[i+8].qty's line ahead of time. The L1 misses still happen, but they overlap with computation; the apparent latency drops from 200 cycles to 5–10 cycles per miss.
The catch is that prefetcher hit rate depends on stride regularity. The moment your loop becomes data-dependent — orders[idx[i]].qty where idx[] is random — the prefetcher gives up, and AoS reverts to its full cost. Hash-table probes, graph traversals, and tree walks are all AoS-hostile for this reason. SoA is more forgiving: even random access into qty[] only fetches 8 bytes per access, and the cache footprint of "all the qtys" is small enough to fit in L2 for moderate-sized tables. Production code at Razorpay uses AoS for sequentially-scanned tables (the daily reconciliation feed, ordered by timestamp) and SoA for index-driven random access (the per-merchant dashboard, jumping around by merchant_id).
For Python-heavy data work, the same prefetcher question maps to "use pandas / Polars / numpy versus a list of dataclass instances". A df['amount'].sum() on a pandas Series runs at numpy speed because the column is one contiguous array — the prefetcher reads it linearly, hit-rate near 100%. The same aggregation over a List[Transaction] is 50–100x slower because every record dispatch goes through the CPython object model and every record is a heap pointer with no spatial locality. Polars, the Rust-backed DataFrame library popular in Indian fintech for backtesting, uses Apache Arrow as its in-memory format, so columns are already in the layout LLVM-vectorised kernels can stream over directly. When the choice is "pandas vs list-of-objects" for analytical work, the column-major library wins on layout grounds before any other consideration enters.
Padding traps and the diminishing returns of cache-line alignment
It is tempting to round every record up to a multiple of 64 bytes "just in case". For some structs this helps (false-sharing prevention, see /wiki/false-sharing-the-silent-killer). For most read-mostly structs it hurts: padding a 40-byte struct to 64 bytes wastes 24 bytes per record, which on a 100M-record array costs 2.4 GB of cache and DRAM bandwidth. The right rule is "round up to a cache line only when there is a coherence reason to do so" — write contention from multiple cores, or hot/cold separation. For purely read-side records that fit naturally inside a line, leave them alone.
A subtler trap is the "I'll add a _ field at the front for alignment" pattern. Some codebases pad in front of a hot struct to ensure the hot field starts at offset 0. This works once — until someone adds a new field at the top of the struct, the padding is now wrong, and the layout silently de-aligns. The maintainable pattern is alignas(64) on the struct or array, not manual byte padding. The compiler tracks the alignment; reviewers don't have to.
The Hotstar IPL viewer-counter mismatch
In May 2024 Hotstar's per-region viewer-counter service started showing 70 ms p99 instead of its 6 ms target during the IPL playoffs. The hot loop scanned a []ViewerSession with 1.4M sessions and computed region -> active_count. The struct was 192 bytes; the loop read region (4 bytes) and last_seen_ts (8 bytes) per session. Twelve useful bytes out of 192. The fix was a one-day refactor: a parallel regionByIdx []uint16 and lastSeenByIdx []int64 arrays alongside the original []ViewerSession. The hot scan switched to the new arrays; cold paths (kicking idle sessions, audit logs) kept using the original struct.
P99 dropped from 70 ms to 7 ms after the refactor. The total memory used grew by 14 MB (the two new arrays). The codebase changed in three places — the constructor, the hot scan, and a smoke test. This is the canonical economic profile of a layout fix: enormous performance gain for a tiny code change, because the original code was spending 95% of its time on data the loop didn't read. When you next see a Go service that spends most of its CPU time in a struct-scanning loop, the first hypothesis to test is "what fraction of the bytes per record does this loop actually use?" If the answer is below 30%, hot/cold splitting will pay for itself within the day.
The Hotstar postmortem documents one detail worth carrying forward. The team had previously assumed the bottleneck was lock contention on the per-region map, because flamegraphs showed time in sync.Map.Load and sync.Map.Store. They added shard-locking, lock-free maps, atomic counters — three weeks of refactors, none of them moved the needle, because the real cost was the cache lines being dragged through DRAM in between the locked sections. The lock was waiting for memory, not for other locks. Only when an SRE ran perf stat and saw 18% L1d miss rate on an arithmetic-only loop did the team look at the struct definition. The lesson generalised: when the flamegraph and the PMU disagree about where time is being spent, believe the PMU.
Polymorphism, virtual tables, and the layout it forces
Object-oriented codebases that rely on inheritance hit a layout problem that AoS-vs-SoA does not directly address. Every virtual method call in C++, every interface method dispatch in Java, every dyn Trait call in Rust costs an extra indirection through a vtable pointer at the start of each object. The vtable pointer adds 8 bytes per record, but worse, it forces every object to be addressed by pointer rather than by value — which means an array of polymorphic objects is actually an array of pointers, with the objects themselves scattered across the heap. A scan of one million Shape* pointers loads one cache line per pointer just to follow the pointer, then another cache line per object to read its fields. Two cache misses per record, no spatial locality on the objects themselves.
The cache-friendly answer is the type-tagged union (or enum in Rust, oneof in protobuf), where each record carries a small discriminator and the variants are stored inline. The scan loop branches on the tag and dispatches statically; no vtable, no pointer chase. Game engines, regex engines, JIT compilers, and graph processors all converge on this pattern when performance matters. The trade is type-safety expressivity for cache locality; the gain is often 3–5x on hot scans because every record is now in-place rather than pointer-chased. This is the layout reason languages with strong sum-type support (Rust, OCaml, Swift, Kotlin) have an edge on classical-OO languages (Java, C#) for performance-critical work — not because the compilers are better, but because the idiomatic data structures fit cache better.
Reproduce this on your laptop
# Linux/macOS, Python 3.11+
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
# AoS vs SoA scan benchmark (8M records, ~1 GB AoS):
python3 layout_bench.py
# Confirm the cache story with perf (Linux only):
sudo apt install linux-tools-common linux-tools-generic
sudo perf stat -e cycles,instructions,L1-dcache-load-misses,LLC-load-misses \
python3 layout_bench.py
Expect 4–8x SoA speedup on any modern x86 server (2018+); ARM laptops show 3–5x because their LLC is closer to DRAM bandwidth. The shape — AoS pulls 8–10x more data through the cache, SoA pulls payload-only — is universal.
Where this leads next
A single-character lesson runs through this chapter: the cache fetches 64 bytes whether you read 8 of them or 64. Every layout decision that ignores this fact is paying a tax — the AoS tax of dragged-along unread fields, the misalignment tax of straddled lines, the TLB tax of scattered pages. Once you carry the model, the right shape for any hot loop becomes calculable: count the bytes you read, divide by the line size, and any answer below 1.0 (most are far below) is wasted bandwidth that a layout change can recover.
The mental shift is from thinking about records to thinking about cache lines. A senior engineer reading a flamegraph sees not "the order-loop is hot" but "the order-loop is bandwidth-bound, and at 24 useful bytes out of every 384-byte struct it can never run faster than 6.25% of the L1 ceiling". That arithmetic — done in their head, in 30 seconds — converges on the right hypothesis before any profiling tool is opened. The hypothesis is testable in another 30 seconds with perf stat -d. The fix is testable in another hour with a parallel SoA array. Three rounds of 30-second decisions decide whether the loop runs at full speed or at 16% of full speed; the discipline is what makes the difference, not the hours of effort.
The next chapters extend this from in-cache layout to the memory hierarchy below it:
- /wiki/cache-lines-and-why-64-bytes-rules-everything — the unit of motion that makes layout matter at all.
- /wiki/hardware-prefetchers-and-when-they-help — when the prefetcher can hide AoS cost, and when it cannot.
- /wiki/tlb-and-address-translation-costs — the page-level analogue of the cache-line story, and where hugepages enter.
- /wiki/numa-topology-and-page-placement — when the layout question becomes "which socket's DRAM holds these bytes".
- /wiki/columnar-storage-and-why-analytical-engines-prefer-it — the on-disk analogue, where SoA becomes Parquet and ORC.
- /wiki/cache-coherence-mesi-moesi — the protocol that makes layout in multi-core code more than just a single-core question.
- /wiki/simd-and-vector-instructions-sse-avx-512 — SIMD requires SoA layout to vectorise efficiently; AoS forces gather/scatter at much higher cost.
A working backend engineer at Razorpay or Zerodha or Hotstar does not redesign their structs every week. But they do read flamegraphs every week, and the answer to "this loop is slower than it should be" is layout often enough — perhaps a quarter of the time — that the discipline of asking "how many bytes per record does this loop actually read?" pays off as a permanent habit. The bytes you don't read are the bytes you pay the most for.
One final calibration: the 6x speedup measured on the c6i.4xlarge in this chapter is the floor, not the ceiling. On hardware where DRAM is more constrained (small EC2 instance types with throttled memory bandwidth, ARM-based mobile-class servers), the ratio approaches 10x. On 4-socket servers where AoS records straddle NUMA nodes, the ratio crosses 15x once cross-socket coherence enters. The trend goes one way: as cores get faster relative to DRAM, the cost of wasted bandwidth grows. A loop that pays a 2x layout tax today will pay a 4x tax on the same code in five years' time, because DRAM is not getting faster, but cores are. Cache-friendly layout is a discipline that compounds: every refactor that pulls payload-only data into the hot loop is an investment that pays down for the lifetime of the code.
References
- Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) — §6.2 (data layout), still the single best treatment of cache-friendly layout 18 years on.
- Brendan Gregg, Systems Performance (2nd ed., 2020) — §6.4 (cache analysis), §6.6 (PMU counters for layout-driven misses).
- Intel® 64 and IA-32 Architectures Optimization Reference Manual — §3.6 (data layout for cache locality), §B.4 (relevant uncore PMU events).
- Daniel Lemire, "Data structures benchmarks: AoS vs SoA" — careful microbenchmarks showing where the layout effects appear and where they don't.
- Mark Raasveldt and Hannes Mühleisen, "Don't Hold My Data Hostage — A Case for Client Protocol Redesign" (VLDB 2017) — DuckDB's argument for column-major all the way through, with measurements.
- Apache Arrow columnar format spec — the production answer to "what does SoA look like as a serialisation format".
- /wiki/cache-lines-and-why-64-bytes-rules-everything — the line-level substrate that makes layout decisions matter.
- /wiki/false-sharing-the-silent-killer — the coherence-driven reason some structs need padding even when the hot loop is single-threaded-friendly.
The unifying message for any reader who works on hot-path Indian-scale services: cache-friendly layout is not exotic optimisation. It is the lowest-effort, highest-ROI lever in performance engineering after fixing algorithmic mistakes. A senior engineer at Razorpay or Hotstar will tell you the same thing — most of the wins they ship in any given quarter come from layout work, because the surrounding codebase rarely had layout in mind when it was written. The opportunity is sitting in plain sight in nearly every backend service over a thousand lines.