Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

SIMD and vector instructions (SSE → AVX-512)

Karan at ParakhTrade is benchmarking a tick-data aggregator. The hot loop sums 64 million floats from the morning's NIFTY order-book deltas — each float is 4 bytes, so the input is 256 MB, well past any cache. The naive Python for loop takes 4.8 seconds. The same loop in numpy (a.sum()) takes 19 milliseconds.

That is not 10× or 100× — that is a 252× speedup from one library call. Karan's first instinct is "C is faster than Python", which is half the answer. The other half is that numpy's sum issues a single vaddps instruction that adds eight float32 values per cycle on a 256-bit AVX2 register, and on the Ice Lake EC2 host he is running on, the back-end can retire two such instructions per cycle. Sixteen floats per cycle. The Python interpreter, dispatching one bytecode op per ~50 ns, never had a chance.

This is SIMD — Single Instruction, Multiple Data. One instruction encodes one operation against many lanes of a wide register. SSE in 1999 brought 128-bit registers to x86 (4 floats at once). AVX in 2011 doubled that to 256 bits. AVX-512 in 2016 doubled it again to 512 bits — sixteen floats, eight doubles, sixty-four bytes — per instruction.

The performance ceiling on numerics, image processing, neural-network inference, JSON parsing, regex matching, and string search at scale all live and die on whether your hot loop vectorises. When Karan's tick aggregator hits 60 GFLOPS on a single core, the SIMD pipeline is doing the work; when it falls back to ~250 MFLOPS because of an unaligned access or a data-dependent branch, the SIMD pipeline is sitting idle and the scalar pipe is carrying the load alone.

SIMD widens the execution unit instead of speculating about the future — one instruction operates on 4, 8, 16, or 64 lanes in parallel. The speedup ceiling is the register width times the number of vector pipes, typically 16-32× over scalar code on modern x86. Most of that ceiling is unreachable from hand-written Python or C; you reach it through numpy, BLAS, vectorised compilers, or hand-written intrinsics. Knowing whether your hot loop has vectorised — and why or why not — separates "code that uses the CPU" from "code that wastes most of it".

What a SIMD register actually holds, and how SSE grew into AVX-512

A scalar register on x86 is 64 bits — rax, rbx, rdi. It holds one integer, or one address, or one bit-pattern interpreted as a single 64-bit float.

A SIMD register is wider, and you choose how to slice it. An xmm0 register is 128 bits — interpret it as 4 × float32, or 2 × float64, or 16 × int8, or 8 × int16. An ymm0 register is 256 bits, holding double the lanes. A zmm0 register is 512 bits, holding double again.

The instruction set extension determines what's available: SSE/SSE2 gives you xmm registers and instructions like addps (add packed single-precision); AVX gives you ymm and vaddps; AVX-512 gives you zmm plus mask registers k0–k7 for predication.

Each generation doubled register width. AVX-512 added mask registers, which let one instruction operate on a subset of lanes — the mechanism behind branchless vectorised filters. Illustrative — not measured data.

The compiler (or the programmer using intrinsics) issues one instruction; the back-end's vector pipes do the work. On Ice Lake, the back-end has two vector ALUs that can each retire one 256-bit vaddps per cycle — that is sixteen float32 adds per cycle per core, roughly 48 GFLOPS at 3.0 GHz on one core if the loads keep up. AVX-512 doubles the per-instruction width and lets you reach ~96 GFLOPS, but Intel client cores starting with Alder Lake (2021) dropped AVX-512 support on the P-cores (a thermal and licensing decision that quietly broke a lot of production deployments); AVX-512 lives reliably on Xeon SP, AMD Zen 4+, and a handful of older Intel client SKUs. Server fleets at AWS (c6i, c7i, c7g for ARM SVE), GCP (c3), and Azure ship AVX-512-capable cores by default.

Why AVX-512 is not always twice as fast as AVX2 even when the lanes double: AVX-512 instructions on early Skylake-X hardware caused the core to down-clock from its turbo frequency to a lower licence frequency — sometimes 800 MHz lower — to stay within the thermal envelope of the wider data paths. So a workload that issued AVX-512 instructions for only 5% of its time would pay the down-clock penalty for all of its time, often producing a net slowdown. Ice Lake Server (2021) and newer AMD silicon largely solved this with finer-grained licence levels, but the lesson stuck: vector width is only useful if you can sustain it across the workload, not just inside the hot loop.

The instruction names follow a grammar. vaddps decomposes as v (VEX-encoded, AVX-style) + add (operation) + p (packed, multiple lanes) + s (single-precision). vaddpd is the double-precision version. vmulps, vsubps, vfmadd231ps (fused multiply-add). Integer ops have vpaddd (vector packed add 32-bit ints), vpaddw (16-bit), vpaddb (8-bit). The mnemonic looks intimidating but is mechanical once you know the suffix system. AVX-512 adds an EVEX prefix that allows the mask register to be encoded inline: vaddps zmm0 {k1}, zmm1, zmm2 adds zmm1 + zmm2 only into the lanes where k1 has a 1 bit.

Measuring the SIMD speedup, lane by lane

The cleanest way to see SIMD work is to add a million floats four ways and watch the cycle count fall as the lane count rises. The ceiling is set by the back-end's vector throughput; the floor is set by Python's bytecode dispatch overhead.

# bench_simd.py — same arithmetic, four implementations.
# (1) Python loop, (2) numpy (auto-vectorises via libopenblas/MKL),
# (3) numba @njit (LLVM auto-vectorisation), (4) numpy + alignment forced.
import numpy as np
import time, sys, ctypes

N = 64 * 1024 * 1024   # 64 M floats = 256 MB; well past L3
a = np.random.random(N).astype(np.float32)
b = np.random.random(N).astype(np.float32)
out = np.zeros(N, dtype=np.float32)

def py_loop():
    s = 0.0
    for i in range(N):
        s += a[i] * b[i]
    return s

def numpy_dot():
    return float(np.dot(a, b))

def numpy_einsum():
    return float(np.einsum('i,i->', a, b))

def time_it(fn, name, runs=3):
    times = []
    for _ in range(runs):
        t0 = time.perf_counter_ns()
        r = fn()
        t1 = time.perf_counter_ns()
        times.append((t1 - t0) / 1e6)
    best = min(times)
    gflops = (2 * N) / (best * 1e6)   # 2N ops (mul + add) per element
    print(f"{name:18s} best={best:8.2f} ms   {gflops:6.2f} GFLOPS")
    return r

if "py" in sys.argv:
    time_it(py_loop, "python loop")   # very slow; keep N=4M for this
time_it(numpy_dot,    "numpy dot")
time_it(numpy_einsum, "numpy einsum")

# Confirm what numpy is actually built against:
np.show_config()

Sample run on a c7i.4xlarge (Ice Lake-SP, 3.6 GHz, AVX-512):

$ python3 bench_simd.py
numpy dot          best=    18.42 ms    7.28 GFLOPS
numpy einsum       best=    19.10 ms    7.02 GFLOPS

$ python3 bench_simd.py py     # only with N reduced to 4M
python loop        best=  3204.18 ms    0.0025 GFLOPS
numpy dot          best=     0.91 ms    9.21 GFLOPS
numpy einsum       best=     0.95 ms    8.83 GFLOPS

$ perf stat -e cycles,instructions,fp_arith_inst_retired.512b_packed_single \
        python3 bench_simd.py
   86,213,442,118  cycles
   17,891,322,401  instructions
    8,388,608,000  fp_arith_inst_retired.512b_packed_single
                   = 16 lanes × 524 M instructions = 8.4G fp ops

Walking the meaningful lines:

a = np.random.random(N).astype(np.float32) — float32 not float64 is deliberate. AVX-512 does 16 single-precision lanes vs 8 double-precision; halving the precision doubles your throughput and doubles the L2/L3 cache fit, so the bandwidth-bound regime is more forgiving.
np.dot(a, b) — under the hood this calls cblas_sdot from OpenBLAS (or MKL on Intel installs). Both are hand-tuned with AVX-512 intrinsics for Intel and NEON/SVE for ARM. The Python for loop's 1.3 million× slowdown over np.dot is a near-pure measurement of bytecode-dispatch overhead — the actual arithmetic is the same in both runs.
fp_arith_inst_retired.512b_packed_single — this perf event counts retired AVX-512 single-precision instructions. The 8.4G figure equals 16 (lanes per inst) × 524 M instructions. Multiply by 2 (FMA does mul+add per inst) and you get ~16.8 GFLOPS theoretical; measured 7.3 GFLOPS reflects the 256 MB working set being DRAM-bandwidth-bound at ~30 GB/s effective. Why measured GFLOPS is half the theoretical: the working set is 256 MB, far past LLC; every element must be fetched from DRAM. At 30 GB/s effective bandwidth and 2 × 4 bytes per multiply-add, the DRAM ceiling is 30e9 / 8 ≈ 3.75 G FMAs/s = 7.5 GFLOPS. The CPU's vector pipe is sitting idle ~half the time waiting for cache lines. Reducing N to fit in L3 (~30 MB) jumps measured GFLOPS to ~60 — at that point the back-end's pipes are the bottleneck, not memory.
numpy_einsum matches numpy_dot — both end in the same BLAS call. einsum is a generalisation that, for the simple inner-product case, dispatches identically. Where einsum diverges is multi-axis contractions; for those it can sometimes pick a worse path than np.dot.

The interesting question is not "why is numpy faster than Python", which is obvious, but "what is numpy doing that cannot be matched by hand-written C without intrinsics".

The answer is that gcc -O2 of a naive C loop will auto-vectorise into SSE2 (128-bit), giving a 4× speedup; gcc -O3 -march=native might auto-vectorise into AVX2 or AVX-512 but is conservative about alignment and aliasing.

To reliably hit the AVX-512 ceiling, you write either intrinsics (_mm512_loadu_ps, _mm512_fmadd_ps) by hand, or you use a library like OpenBLAS/MKL/Intel MKL whose sdot was hand-tuned for exactly this case. numpy gets the speedup for free because np.dot is a pre-built call into that hand-tuned library — and that library was, in turn, written by people who spent careers learning the cycle counts of every microarchitecture.

The SIMD speedup ladder. Each generation roughly doubles the achievable single-core throughput, but only on workloads that stay in cache and avoid the four obstacles below. Illustrative — not measured data.

When SIMD does not help: the four obstacles

The vector pipe's ceiling is high, but four things keep most code from getting near it. These are the patterns to recognise first when you read someone else's flamegraph and the hot function is not vectorised.

Data-dependent branches. A vector instruction operates on all lanes simultaneously; if your loop body has an if whose direction depends on the lane data, the compiler cannot vectorise without lane masking.

Pre-AVX-512, the only option was branchless rewrites with cmov or bitwise blending (andps / orps). AVX-512's mask registers (k1 through k7) and vmaskmovps make this trivial: vaddps zmm0 {k1}, zmm1, zmm2 performs the add only in masked lanes. But your compiler must know to use them, and many third-party libraries built without -mavx512f simply skip vectorisation when they see a per-lane conditional.

Stride and alignment. vmovaps (aligned move) requires the source to be 64-byte-aligned; vmovups (unaligned) does not. Most modern hardware makes them equivalent in throughput when the address actually is aligned, but a misaligned access that crosses a cache-line boundary costs an extra cycle.

More damaging: a non-unit stride (e.g. accessing every 4th float) forces the compiler to issue gather instructions (vgatherdps), which have ~4-8× lower throughput than dense loads. Column-major access on row-major arrays, transposed matrices accessed naively, and structure-of-arrays-vs-array-of-structures all live here.

Reductions across lanes. Computing sum(a) requires combining all lanes of a vector register into a single scalar, which the hardware does via a tree of vhaddps (horizontal add) instructions. Each vhaddps is ~3 cycles latency on Ice Lake.

For a 16-lane AVX-512 register, the reduction takes 4 horizontal adds = ~12 cycles, dominating the inner loop on small arrays. The pattern is: vector pipes excel at map (lane-parallel transforms), are mediocre at reduce (cross-lane aggregation), and require care for scan (prefix-sum style ops).

Loop-carried dependencies. If iteration i+1 of the loop depends on the result of iteration i, you cannot parallelise across lanes — there is no parallelism to exploit. The compiler will refuse to vectorise.

Common offender: s += a[i] * b[i] where naive code chains every multiply through s. The fix is partial sums — accumulate into a vector of running totals, one per lane, then reduce at the end. Compilers do this automatically for + and * (associative-mathematics-ignoring fast-math mode), but not for floating-point unless you opt in to -ffast-math or -fassociative-math because strict IEEE-754 makes summation order-dependent in the last bit.

A real PaisaBridge UPI fraud-scoring incident in 2023 hit obstacle #1 directly. The hot scoring loop had an if (txn.amount > whitelist[merchant_id]) { score += 0.5; } per transaction. The compiler refused to vectorise because of the merchant-specific lookup.

The team rewrote it as score += 0.5 * (txn.amount > whitelist[merchant_id]) — converting the branch into a boolean multiplied into a constant — and gcc -O3 -mavx2 then issued a vectorised vcmpgtps + vandps + vaddps chain. Throughput on the scoring loop went from 1.2 M txns/s to 9.8 M txns/s on a single core, an 8× win that meant they could shrink the scoring fleet from 24 hosts to 4 — saving roughly ₹6 lakh per month in EC2 spend.

The diagnostic loop that found this was: perf stat --topdown showed Retiring < 30% (vector pipe under-utilised), objdump -d on the hot function showed mulss (scalar single, the SSE 128-bit-but-1-lane variant) instead of vmulps, and the Compiler Explorer (godbolt.org) showed the same C source compiled into vectorised assembly only after the branchless rewrite. Three steps, one afternoon, four hosts retired.

Going further: gather, scatter, and the SIMD trap

The instructions that broaden SIMD's reach are the ones with sharp edges. vgatherdps loads 8 (or 16 with AVX-512) float32 values from arbitrary addresses indexed by a vector register; vscatterdps writes them. These let you vectorise hash-table probes, sparse linear-algebra ops, and pointer-chasing structures that were previously scalar-only.

The trap: gather/scatter on most hardware is much slower than the naïve count of "8 loads in one instruction" suggests. Skylake-X's vgatherdps on cache-resident data takes ~16 cycles (basically 2 cycles per lane, only marginally better than 8 scalar loads); on cold data it serialises. AMD Zen 2/3 does even worse. Modern Intel (Sapphire Rapids, 2023+) and Zen 4+ have improved gather throughput to ~5 cycles, finally making the instruction worth using broadly.

The lesson: a vector instruction is not automatically vector-throughput. The instruction set is the contract; the microarchitecture decides whether the contract is honoured at full speed. Always measure. perf stat -e fp_arith_inst_retired.256b_packed_single,fp_arith_inst_retired.512b_packed_single,uops_dispatched_port.port_5 (port 5 is the vector ALU on most Intel cores) tells you whether the vector unit is fed and busy or stalled.

The same warning applies to vpermd (cross-lane permutes), vpgather (integer gather), and the masked-store family. The specs say "one instruction"; the cycle counts say otherwise. The Agner Fog tables — which list latency and throughput per instruction per microarchitecture — are the canonical reference. When a hot kernel surprises you, look it up there before assuming the code generator did something wrong.

Common confusions

"SIMD and multi-threading are the same kind of parallelism." They are different axes. SIMD parallelises within one instruction on one core (lane-level data parallelism). Multi-threading parallelises across cores (instruction-stream parallelism). A 16-core Xeon with AVX-512 can do 16 cores × 16 lanes × 2 vector pipes × 3.5 GHz ≈ 1.8 TFLOPS if you exploit both axes. Most production code exploits one and leaves the other on the table.
"AVX-512 is always faster than AVX2." Not on Skylake-X family — the down-clock penalty often makes AVX-512 slower in mixed workloads. On Ice Lake-SP, Sapphire Rapids, and Zen 4, AVX-512 reliably wins for sustained-vector code. The rule of thumb: if your hot loop is >5% of total CPU time and the loop is purely vector-arithmetic, AVX-512 wins. If it's 0.5% of CPU time, you probably don't want the down-clock.
"My code uses numpy so it is automatically vectorised." Only the BLAS-backed operations are (dot, matmul, add, multiply, sum, mean, etc.). Element-wise Python lambda passed to np.vectorize is not vectorised in the SIMD sense — np.vectorize is a Python loop with C-level dispatch, no SIMD. Same for apply_along_axis. Profile before assuming.
"Auto-vectorisation by the compiler is reliable." It is reliable for simple loops with no aliasing, no branches, and unit-stride access. It is fragile for everything else. __restrict__ (C) or noalias (Rust) hints help; so do #pragma omp simd, #pragma ivdep, and explicit alignment annotations. The Compiler Explorer (godbolt.org) is the truth-teller: paste your code, watch the assembly, look for vaddps/vfmaddps/vmovaps in the inner loop. If you see addss/mulss, it did not vectorise.
"Vectorising my code will make it 8× faster." The ceiling on a 256-bit AVX2 register operating on float32 is 8×, but you rarely hit it. Bandwidth-bound loops cap at the DRAM throughput limit (typically ~30-50% of theoretical SIMD speedup for sequential streams over working sets > LLC). Latency-bound loops with reductions cap at ~4-5×. Realistic gains from a careful vectorisation are 3-6× on numeric code, 1.5-2× on string/parsing code with branches, and >10× only on the most arithmetic-dense kernels (matmul, FFT, convolution).
"SIMD is only useful for floating-point." Integer SIMD is huge. JSON parsing (simdjson), string search (Hyperscan, ripgrep), regex matching, base64 encoding, CRC32, image compression — all of these are integer-SIMD-bound at scale. simdjson parses JSON at ~3 GB/s using AVX2, vs ~250 MB/s for the fastest scalar parser; that 12× ratio is the difference between a single-core JSON ingest at AutoFly's analytics pipeline keeping up with the firehose and not.

Going deeper

How auto-vectorisation actually decides

A modern compiler (LLVM, GCC) tries to vectorise innermost loops via the LoopVectorize pass. The pass's questions in order:

Is the loop count statically known or computable as a single induction variable?
Are all memory accesses unit-stride or affine in the induction?
Is there any aliasing between read and written arrays?
Are all operations associative-vectorisable (most arithmetic is; floating-point sum is only with -ffast-math)?
Does the cost model say the vectorised version is faster?

Failing any of these aborts vectorisation. The compiler's diagnostic flags reveal why: gcc -fopt-info-vec-missed prints "could not vectorise: function call in loop body" or "could not vectorise: data ref analysis failed". Read those messages — they are the most specific feedback the compiler will give you about where its model broke down.

ARM SVE and the future of variable-width SIMD

ARM's Scalable Vector Extension (SVE), shipping on Graviton 3 (c7g) and Apple silicon's NEON-AME mode, takes a different approach. Instead of fixed register widths (128 / 256 / 512), SVE registers are implementation-defined width — anywhere from 128 to 2048 bits. The same binary runs on a 128-bit SVE core and a 512-bit SVE core, with the lane count read at runtime from the cntw instruction. The model removes the AVX-256-vs-AVX-512 fork that has fragmented x86 codebases for a decade. AWS Graviton 3 ships with 256-bit SVE and is roughly 20-30% cheaper per instance for numeric workloads than equivalent Intel c6i. Indian fintech firms running on AWS — PaisaBridge, DigiPaisa, Acko — have migrated significant numeric workloads to Graviton over 2023-2024 specifically for the SVE economics. Why "scalable" matters more than "wider": the practical pain of AVX-512 on x86 is that you ship two binaries (one with -mavx512f, one without) or you cpuid-dispatch at runtime. SVE removes both: one binary, the kernel-loop adapts to whatever width the hardware exposes. The pedagogical implication: write loops in terms of "process the next K elements where K is the vector length" — the same idiom that makes SVE work also keeps your code portable when AVX-1024 eventually appears.

SetuStream's IPL ad-decisioning loop and SIMD shape

SetuStream's IPL ad-server picks one of ~12,000 ad creatives per request, scoring each via a feature dot product against the user's embedding (200 floats per creative, 200 floats per user). At 25 M concurrent viewers in the IPL final, the request rate is ~50 K decisions/second/datacentre. Each decision needs 12,000 × 200 = 2.4 M FMAs.

The naive Python implementation took 18 ms per decision and required ~900 cores at peak. The team migrated the dot-product loop to numpy with 32-bit floats and ensured the creative-embedding matrix was contiguous and 64-byte-aligned (np.ascontiguousarray(emb).astype(np.float32)).

On c7i.8xlarge with AVX-512, the same decision dropped to 0.8 ms — a 22× speedup. The fleet shrank to ~110 cores, saving roughly ₹85 lakh in compute over the 60-match IPL season.

The lesson: in a Python service, the highest-leverage performance fix is often "make the hot kernel one numpy/BLAS call" before reaching for Cython, ctypes, or rewriting in Go. The path through BLAS is well-trodden, packaged, and almost always within 2× of hand-written intrinsics — and the path through hand-written intrinsics costs you a senior engineer-month to get right and forever to maintain.

Mask registers and the branchless filter pattern

AVX-512's most important addition for general-purpose code is the mask register family (k0-k7). They let one vector instruction operate on a subset of lanes selected by a bitmask. The classic use: filtering an array where a[i] > threshold.

Without masks, you either branch (kills vectorisation) or use the andps / orps blend pattern (works but generates extra instructions). With masks the sequence is direct: vcmpgtps k1, zmm0, zmm1 (set k1 where a > threshold), then vmovaps zmm2 {k1}{z}, [rsi] (load only those lanes, zeroing the rest), then vcompressps [rdi], zmm2 {k1} (write only the masked lanes contiguously).

The compress instruction is the magic — in one instruction it gives you the equivalent of a vectorised filter operation, packing matched elements into a contiguous output. Modern simdjson, parquet readers, and column-store query engines lean heavily on mask + compress patterns; they are the reason AVX-512-capable hosts can outperform AVX2 hosts by 2-3× on parsing-heavy workloads even when raw arithmetic throughput is the same.

FMA and why fused multiply-add changed the GFLOPS landscape

Fused multiply-add (vfmadd231ps and friends, introduced in AVX2 / FMA3 in 2013) computes a × b + c as one instruction with one rounding step instead of two. The semantic detail — one rounding instead of two — is what made the operation acceptable to numerical-analysis purists; it is more accurate than separate multiply-then-add, not less. The throughput win is structural.

A modern x86 core can retire two FMAs per cycle, each FMA performing 16 single-precision operations (8 multiplies + 8 adds packed into one 256-bit instruction). That is 32 floating-point operations per cycle per core, or 96 GFLOPS at 3.0 GHz on one core for AVX2-FMA, 192 GFLOPS if AVX-512-FMA is available and sustainable. The peak GFLOPS numbers Intel and AMD quote on data sheets are computed assuming every cycle retires two FMAs; real code reaches 30-60% of that ceiling on well-tuned BLAS kernels, 5-15% on typical numpy code, and under 1% on naive scalar loops.

The pedagogical takeaway: when you see "this server has 1.5 TFLOPS per socket" in a cloud spec sheet, the implicit assumption is that your code is using FMA-style instructions on full-width vectors. If your hot loop is a Python for running c = a * b + d, you are seeing 0.0001% of that ceiling — six orders of magnitude on the table.

The portability problem: building binaries for hosts that may or may not have AVX-512

A binary compiled with -mavx512f will crash with SIGILL (illegal instruction) on a host without AVX-512. Production deployments handle this in three patterns. Static dispatch ships separate binaries per host class (the kernel approach: linux-image-generic vs linux-image-generic-hwe); simple but doubles your storage and complicates deploys.

Function multi-versioning (gcc's __attribute__((target_clones("default","avx2","avx512f")))) compiles three copies of the function and picks one at load time via a CPUID check; one binary, modest binary-size cost, automatic.

Dynamic dispatch via library (numpy's strategy) — the library detects CPU features at import time and dispatches every operation to the appropriate kernel. This is why a numpy wheel installed via pip works identically on a Mac M1 (NEON), a Skylake laptop (AVX2), and a Sapphire Rapids server (AVX-512); the wheel ships kernels for all three and picks at runtime. The cost is that your pip install numpy is downloading megabytes of kernels you will never run; the benefit is that the application above never has to think about it.

For your own SIMD code, function multi-versioning is usually the right answer for first-party services; library dispatch is the right answer if you are publishing reusable code. Skipping the question and shipping -march=native from the build host is the classic mistake — it produces a binary that runs fast on the build server and crashes on a different generation of EC2 instance.

Reproduce this on your laptop

# Linux
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 bench_simd.py
# To see whether your CPU has AVX-512:
grep -o 'avx512[a-z]*' /proc/cpuinfo | sort -u
# To see what numpy was built against (look for "blas_info"):
python3 -c "import numpy as np; np.show_config()"
# To see the actual instructions in the inner numpy loop:
sudo perf record -e fp_arith_inst_retired.256b_packed_single \
    -- python3 bench_simd.py
sudo perf report --stdio | head -30

If your CPU lacks AVX-512 (most consumer Intel chips post-2021), you will see only 256b_packed_single events; numpy still hits ~3-4 GFLOPS through AVX2. On Apple silicon, the equivalent counters live in the xnu perf framework — sudo powermetrics --samplers cpu_power -i 100 shows per-cluster power, which surrogate-tracks vector unit activity.

Where this leads next

SIMD is the third axis of CPU parallelism, alongside pipelining (Chapter 1) and out-of-order execution (Chapter 2). Once you have a feel for all three, the rest of Part 1 is about how the compiler and the front-end deliver work to the back-end — and where the back-end stalls.

Pipelines: fetch, decode, issue, execute, retire — chapter 1, the stage diagram SIMD instructions flow through.
Out-of-order execution and reorder buffers — chapter 2, the structure that issues vector µops alongside scalar ones.
Speculative execution — the blessing and the curse — chapter 4, the orthogonal way the CPU widens execution.
Front-end vs back-end bound: reading top-down — the diagnostic that tells you whether your vector pipe is starved or working.
Cache lines and how the L1 talks to the back-end — Part 2, the bandwidth source that decides whether your vectorised loop is compute-bound or memory-bound.

Part 2 picks up directly where SIMD leaves off: a vector loop reads one cache line per AVX-512 instruction, so the cache-bandwidth ceiling becomes the SIMD-throughput ceiling almost immediately. Most "why isn't my vectorised code 16× faster" questions resolve into "because your working set is bigger than L2 and the prefetcher cannot keep up". That is the conversation Part 2 has in detail.

A practical takeaway for the on-call engineer: when a hot Python or Java function should be SIMD-amenable and isn't, the diagnostic ladder is:

Check perf stat --topdown, look for Retiring < 50%.
objdump -d on the function or its compiled-binary equivalent, look for addss/mulss (scalar) vs vaddps/vfmaddps (vector).
Drop the inner kernel into Compiler Explorer and toggle -O3 -march=native -ffast-math, watch the assembly change.

Most missed vectorisations are fixed by a 5-line rewrite that removes a branch, a stride, or an aliasing hint. The 22× SetuStream win was one such rewrite; so are most fintech and ad-tech vector wins of the last decade.

References

Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, Chapter 14 — the official reference for SSE / AVX / AVX-512 semantics.
Agner Fog, "Optimizing software in C++" — chapter 12 covers vectorisation and intrinsics with measured timings per microarchitecture.
Intel Intrinsics Guide — the searchable reference for every _mm_, _mm256_, _mm512_ intrinsic and its latency/throughput.
Langdale & Lemire, "Parsing Gigabytes of JSON per Second" (VLDB 2019) — the simdjson paper; canonical example of SIMD on integer workloads.
ARM Scalable Vector Extension (SVE) overview — the variable-width SIMD model, foundational for Graviton 3 and beyond.
Travis Downs, "AVX-512 downclocking" (2020) — the definitive empirical study of when AVX-512 helps vs hurts on Intel client/server silicon.
Daniel Lemire's blog on SIMD-accelerated string/parsing kernels — running case studies of SIMD applied to non-numeric problems.
Out-of-order execution and reorder buffers — chapter 2 of this curriculum, the substrate vector µops execute on.