The 30-year arc of systems performance

In 1995, the Pentium Pro shipped at 200 MHz, and the bottleneck of every interesting program was the CPU. Aditi's father, then a fresh ECE graduate at C-DOT in Bengaluru, ran an FFT benchmark overnight on a Pentium and watched the percent-CPU pinned at 100. The fix his team reached for was the obvious one: a faster CPU next year. Thirty years later Aditi is an SRE at Razorpay, her c6i.4xlarge runs at 3.5 GHz on each of 16 cores, the percent-CPU on her flamegraph reads 38, the p99 of the payment API is 240 ms against an SLO of 200, and the bottleneck is none of the things her father had vocabulary for. It is queueing past the knee, an LLC miss every 17 instructions, a noisy neighbour on the same NUMA node, and a backup-request fan-out that doubled the load on the very replica it was meant to bypass. The frequency-scaled future her father expected stalled around 2005. Everything since has been the arc of finding bottlenecks one layer deeper than the last one — a slow stack-walk inwards from CPU to cache to memory bus to NUMA to scheduler to allocator to runtime to network to queue to architecture itself. This chapter is the closing synthesis: the 30-year shape of that walk, why each layer became visible only after the layer above was solved, and what the next layer down looks like from where we stand in 2025.

Systems-performance has moved from "make the CPU faster" (1995) to "extract IPC from the existing CPU" (2005) to "stop missing the cache" (2010) to "stop crossing NUMA nodes" (2015) to "bound the tail" (2018) to "contain the blast radius" (2022). Each generation's bottleneck became the next generation's solved problem, and the new bottleneck was always one layer deeper than the tools of the previous era could see. Every chapter of this curriculum lives somewhere on that arc, and the next decade's bottleneck — already visible in 2025 — is the coordination cost across cells, accelerators, and shared memory pools that pretend to be local but are not.

The five eras and what each one made visible

The arc divides cleanly into five eras, each defined by what the dominant performance limit was, what tool made the new bottleneck visible, and what mental model engineers had to learn. The fact that the eras divide cleanly is suspicious — history never really divides cleanly — but the reason it works here is that each era's transition was forced by a hardware or workload change that was not itself smooth. Frequency scaling stalled in 2005 because of a thermal wall, not because engineers got bored. Multi-socket NUMA became dominant because Moore's-law-on-cores stalled around 2013 and Intel/AMD started shipping 32-core, 64-core, 128-core packages with non-uniform memory. Each era ended because the physics changed, not because the engineers did.

The five eras, in chronological order:

Era Years Dominant bottleneck Tool that made it visible Mental model
1. Frequency 1995–2005 CPU clock speed time, top "faster chip = faster program"
2. IPC / pipeline 2005–2010 Branch mispredicts, pipeline stalls perf stat, VTune "an instruction is not a clock tick"
3. Memory hierarchy 2010–2015 L1/L2/L3/DRAM latency, TLB perf record + flamegraphs "the cache is the working set"
4. NUMA & multi-socket 2015–2020 Remote memory, coherence traffic numactl, numastat, eBPF "memory is plural, not singular"
5. Tail & blast-radius 2020–present p99.9 spikes, noisy neighbours, blast radius HdrHistogram, wrk2, cellular dashboards "the mean is a lie; live in the tail; bound what one fault can take down"

A 22-year-old engineer at Zerodha in 2025 inherits all five eras at once. The Kite order-matching engine runs on 64-core EPYC nodes (era 4), with perf record flamegraphs in production (era 3), HdrHistogram-based SLO tracking (era 5), per-symbol cellular partitions (era 5), and underneath it all the same C++ inner loop that has to hit IPC > 2 (era 2) on an L1-resident hot path (era 3). The mental models do not replace each other; they stack. A senior performance engineer in 2025 reasons across all five layers in the same incident debrief, because a single p99 spike can bottom out at any of them.

The five eras of systems-performance, 1995–2025A horizontal timeline from 1995 to 2025 with five labelled eras: frequency (1995-2005), IPC (2005-2010), memory hierarchy (2010-2015), NUMA (2015-2020), tail and blast-radius (2020-present). Each era has a small icon and the dominant bottleneck under it. A vertical arrow on the right shows that the bottleneck moves "one layer deeper" with each era, from CPU at the top down to architecture at the bottom.Five eras of systems-performance — each bottleneck one layer deeper199520052010201520202025Era 1FrequencyCPU clockEra 2IPC / pipelinebranch mispredictEra 3Memory hierarchycache missEra 4NUMA / multi-socketremote DRAMEra 5Tail / blast-radiusp99.9, cellstime, topperf statflamegraphnumastat, eBPFHdrHistogramtools that revealed the new bottleneck:CPUcacheDRAMsocketarchA 2025 incident can bottom out at any layer. A senior engineer reasons across all five at once.
Illustrative — the era boundaries are sharper in retrospect than they felt to engineers living through them. The 2005 frequency wall and the 2013 multi-socket inflection were the two most abrupt transitions; the others felt gradual until tooling caught up.

A second framing that pays its way: each era's tools were invented in response to the previous era's bottleneck becoming inadequate. perf stat (era 2's signature tool) did not exist when era 1 was happening — it shipped in the Linux kernel around 2009, exactly when the IPC era was demanding visibility. Flamegraphs (era 3) were Brendan Gregg's 2011 invention, exactly when memory-hierarchy effects became dominant. eBPF (era 4) reached production in the Linux 4.9 kernel in 2016, exactly when NUMA effects became the dominant hidden cost. HdrHistogram (era 5) was Gil Tene's 2013 work but did not become an industry default until around 2019 — exactly when p99.9 SLOs became the standard contract. The tool always lags the bottleneck by 2–3 years, and engineers who notice the lag get to work on the problem before the rest of the industry has the vocabulary for it. That lag is why engineering teams who are willing to read foreign-language tools — read the kernel perf source, read the eBPF verifier code, read AWS's cellular architecture talks before they become widely-cited — get a 2–3 year head start on every era.

A measurement that walks down the stack — Aditi's perf ladder

The cleanest demonstration that all five eras coexist in a single 2025 incident is to walk down the perf and eBPF stack on a real-looking workload. The Python driver below (a) generates a representative load on a synthetic in-memory KV service, (b) runs perf stat with one set of counters per era, and (c) prints the era-by-era diagnosis. The point of the script is not the synthetic workload — it is the structure of the diagnostic ladder. A senior engineer reaches for these counters, in this order, on every p99 incident in 2025. The ladder is the shape of how era-5 incidents get root-caused into one of the lower eras.

# era_ladder.py
# Walk the systems-performance diagnostic ladder for a synthetic
# KV-lookup workload. Each era's signature counter is captured by
# perf stat, parsed in Python, and interpreted in the printout.
#
# Run: python3 -m venv .venv && source .venv/bin/activate
#      pip install hdrh
#      sudo apt install linux-tools-common linux-tools-generic
#      python3 era_ladder.py 30000000

import subprocess, re, sys, time, random, statistics
from hdrh.histogram import HdrHistogram

N_OPS = int(sys.argv[1]) if len(sys.argv) > 1 else 30_000_000
KV_SIZE = 4 * 1024 * 1024     # 4M entries; ~64MB working set forces L3+
KEY_RANGE = KV_SIZE * 4        # cold-key ratio of 4:1 -> heavy LLC misses

def workload():
    """Synthetic KV lookup loop — exercises pipeline, cache, NUMA, tail."""
    kv = [i * 31 + 7 for i in range(KV_SIZE)]   # int values
    h  = HdrHistogram(1, 1_000_000_000, 3)
    rng = random.Random(42)
    for _ in range(N_OPS):
        t0 = time.perf_counter_ns()
        k  = rng.randrange(KEY_RANGE) % KV_SIZE
        _  = kv[k] ^ kv[(k * 1664525 + 1013904223) % KV_SIZE]
        h.record_value(time.perf_counter_ns() - t0)
    return h

def perf_stat(events, cmd):
    """Invoke perf stat -e <events> -- <cmd>, return parsed counters."""
    full = ["perf", "stat", "-x,", "-e", events, "--"] + cmd
    p = subprocess.run(full, capture_output=True, text=True)
    counters = {}
    for line in p.stderr.splitlines():
        parts = line.split(",")
        if len(parts) >= 3 and parts[2]:
            counters[parts[2].strip()] = parts[0].strip()
    return counters

def diagnose(c):
    """Era-by-era interpretation of perf counters."""
    print("\n--- Diagnostic ladder (Aditi's 2025 incident playbook) ---")
    cyc, ins = float(c.get("cycles", 0)), float(c.get("instructions", 0))
    ipc = ins / cyc if cyc else 0
    bm  = float(c.get("branch-misses", 0)) / max(float(c.get("branches", 1)), 1)
    l1m = float(c.get("L1-dcache-load-misses", 0)) / max(float(c.get("L1-dcache-loads", 1)), 1)
    llm = float(c.get("LLC-load-misses", 0)) / max(float(c.get("LLC-loads", 1)), 1)
    print(f"  Era 1 (freq):    cycles={cyc:.2e}  ins={ins:.2e}")
    print(f"  Era 2 (IPC):     IPC={ipc:.2f}      branch-miss-rate={bm*100:.2f}%")
    print(f"  Era 3 (mem):     L1d miss={l1m*100:.2f}%  LLC miss={llm*100:.2f}%")
    if ipc < 1.0 and llm > 0.20:
        print("  -> verdict: era-3 memory-bound; LLC-resident set exceeded")
    elif ipc < 1.5 and bm > 0.05:
        print("  -> verdict: era-2 frontend-bound; branch predictor blown")
    else:
        print("  -> verdict: era-1 throughput-OK; look at era-4/5 next")

if __name__ == "__main__":
    events = ("cycles,instructions,branches,branch-misses,"
              "L1-dcache-loads,L1-dcache-load-misses,"
              "LLC-loads,LLC-load-misses")
    print(f"Running {N_OPS:,} KV lookups under perf stat...")
    counters = perf_stat(events, [sys.executable, "-c",
                                   "import era_ladder; era_ladder.workload()"])
    diagnose(counters)

Sample run on a 16-core c6i.4xlarge (Skylake-X, 32 MB LLC, Linux 6.5):

$ python3 era_ladder.py 30000000
Running 30,000,000 KV lookups under perf stat...

--- Diagnostic ladder (Aditi's 2025 incident playbook) ---
  Era 1 (freq):    cycles=1.18e+11  ins=4.21e+10
  Era 2 (IPC):     IPC=0.36      branch-miss-rate=0.42%
  Era 3 (mem):     L1d miss=18.40%  LLC miss=24.70%
  -> verdict: era-3 memory-bound; LLC-resident set exceeded

The lines that carry the lesson:

  • Era 1 (freq): cycles=1.18e+11 — the era-1 view sees nothing wrong. Cycles are being spent, instructions are being retired, the chip is busy. An engineer in 1998 would close the ticket. Why this single line is the most important pedagogical moment in the entire curriculum: era-1 metrics (cycles, instructions, percent-CPU) have not been a useful diagnostic on their own since around 2008. A "100% CPU" reading in 2025 is the starting signal, not the diagnosis. Every senior engineer learns this; every junior engineer takes 18 months to learn it; the curriculum's 120 chapters exist largely to compress that 18 months.
  • IPC=0.36 branch-miss-rate=0.42% — the era-2 view is more informative. IPC = 0.36 means each cycle retires 0.36 of an instruction; the chip is not busy in any useful sense, it is stalled. But the branch-miss rate is fine, so the frontend is not the problem.
  • L1d miss=18.40% LLC miss=24.70% — the era-3 view is the diagnosis. Why these two numbers are the smoking gun: an 18% L1d miss rate combined with a 25% LLC miss rate means roughly 25% × 18% = 4.5% of all loads go all the way to DRAM. At Skylake-X latencies (4 cycles for L1, 12 for L2, ~40 for L3, ~250 for DRAM), the average load latency is 0.82 × 4 + 0.13 × 12 + 0.041 × 40 + 0.005 × 250 ≈ 7.5 cycles — which alone explains the IPC of 0.36 versus the ideal 4.0 on a 4-wide superscalar. The CPU is not slow; the memory subsystem is making it wait. The KV_SIZE = 4 * 1024 * 1024 setting (~64 MB working set) forced this — it exceeds the 32 MB LLC, so every cold key falls through.
  • -> verdict: era-3 memory-bound — the diagnostic-ladder script crossed two layers of bottleneck (era-1: not useful, era-2: rule out frontend) before bottoming out at era-3. The verdict tells Aditi to look at data-layout fixes (struct-of-arrays, prefetching, working-set reduction) — not at adding cores or frequency.
  • What the script does NOT show, but the era-4/5 ladder would: a numastat invocation showing the per-NUMA-node memory hit/miss counts (era 4), and an wrk2/hdrh invocation showing the p99/p99.9 distribution under load (era 5). The full 2025 ladder has nine rungs; this script shows the first three. The chapter on /wiki/case-cpu-saturation-without-user-load walks the era-2/3/4 rungs explicitly; the chapter on /wiki/coordinated-omission-and-hdr-histograms walks the era-5 rung.

The thirty-six lines of Python that produced the verdict are doing the same work that, in 2005, would have taken a week of profiler runs in VTune followed by a manual cache-line analysis from Agner Fog's tables. The era-3 diagnosis happens in 90 seconds because the tools — perf stat, HdrHistogram, the diagnostic ladder pattern — were invented and standardised by the engineers of eras 3, 4, and 5 specifically so that future engineers would not have to invent them again.

A subtle property of the diagnostic ladder is that the cost of misdiagnosis grows with depth. An era-1 misdiagnosis ("we need a faster CPU") costs the team a hardware procurement cycle and is reversible in weeks. An era-3 misdiagnosis ("we need a bigger LLC") costs an instance-class migration and is reversible in days. An era-4 misdiagnosis ("we need to pin to NUMA node 0") encodes assumptions in production code that are reversible only with a rewrite. An era-5 misdiagnosis ("we need cellular architecture") costs 1-2 engineer-years of investment that is essentially irreversible — once a service is built cellular, it stays cellular. The asymmetry of misdiagnosis cost is why the ladder runs top-down: cheap, fast diagnostics first, expensive structural diagnoses only after the cheap ones are exhausted. Teams that skip rungs — jumping straight to "we need to cellularise" without first checking whether the bottleneck is era-2 IPC — pay the irreversibility cost without the benefit.

The era diagnostic ladder — cost of misdiagnosis grows with depthA vertical ladder with five rungs. From top to bottom: era 1 (cycles, top), era 2 (IPC, perf stat), era 3 (cache miss, flamegraph), era 4 (NUMA, numastat), era 5 (tail/cells, HdrHistogram + cellular dashboards). A bar to the right of each rung shows the misdiagnosis cost growing from 1 unit at era 1 to 50+ units at era 5.Diagnostic ladder — misdiagnosis cost grows exponentially with depthEra 1 — top, time, cyclescost: 1 (hardware procurement)Era 2 — perf stat (IPC, branch-miss)cost: 3 (compiler / loop tuning)Era 3 — flamegraph + LLC counterscost: 8 (data-layout rewrite)Era 4 — numastat, eBPFcost: 20 (allocation + pinning rewrite)Era 5 — HdrHistogram + cellularcost: 50+ (architectural rewrite)
Illustrative — the cost units are engineer-weeks of remediation if the diagnosis is wrong. The asymmetry is why the ladder runs top-down: cheap diagnoses first, expensive ones only after the cheap ones rule out their layer.

What changed in each era — the structural transitions

The era boundaries were not gentle. Each one was forced by a hardware or workload change that broke the previous era's mental model, and the engineering culture spent 2–3 years catching up before the new tools and vocabulary stabilised.

The 2005 frequency wall. Intel's 90nm Pentium 4 "Prescott" was the cliff. Clock speeds had been doubling every 24 months since 1995, and engineers expected the same trajectory to continue. Instead, Prescott hit 3.8 GHz at 130 W TDP, ran into thermal limits that no consumer-grade cooling could clear, and Intel cancelled the 4 GHz part. The Core 2 architecture (2006) shipped at 2.4 GHz — slower than the Prescott it replaced — and made up the deficit with wider pipelines, better branch prediction, and shared L2 cache. The mental-model transition: clock speed is not the figure of merit; instructions-per-cycle is. Engineers who had spent a decade optimising for "tighter loops on a faster chip" had to relearn for "wider issue width on the same chip", and perf stat became the standard tool because the visible metric (clock speed) had stopped being the relevant one. The 2005 wall is also when the multi-core era began in earnest — if a single core could not get faster, you needed two — and parallel programming entered the mainstream backend curriculum.

The 2010 memory-hierarchy reckoning. With IPC understood, the next visible bottleneck was that IPC was limited by memory access patterns. Drepper's 2007 paper "What Every Programmer Should Know About Memory" had warned of this, but most working engineers ignored it until 2010-ish, when web-scale workloads (Google's Bigtable, Amazon's Dynamo, Facebook's TAO) started running into the same wall: their hot loops fit in cache but their working sets did not, and the gap between L3 latency (~40 cycles) and DRAM latency (~250 cycles) was the cliff their p99 fell off. Brendan Gregg's flamegraph (2011) made this visible as a UI element; suddenly engineers could see the fat __memmove_avx_unaligned bar in their profiles and recognise "memory bandwidth bound" as a category. The cultural shift was that data layout became part of API design — struct-of-arrays vs array-of-structs, padding to avoid false sharing, NUMA-aware allocation — and library authors started thinking about cache footprint as a first-class API property.

The 2015 NUMA inflection. The first dual-socket servers shipped in the mid-2000s, but they were rare in production; most workloads ran on single-socket boxes where memory was uniform. By 2015, c4.8xlarge (Intel Haswell, 36 vCPUs across 2 sockets) and equivalent AMD parts were the default for serious workloads at AWS, GCP, and Azure. The hardware was now non-uniform by default: a memory access from socket 0 to socket 1's DRAM cost roughly 1.5–2× a local access. Workloads that had been written assuming uniform memory — Java apps with one big heap allocated on whichever socket the JVM happened to start on, Postgres with shared_buffers placed by the OS without NUMA hints — saw 30–50% throughput drops as soon as their working set spilled across the socket boundary. The mental model shift: memory is plural. The toolchain (numactl, numastat, lstopo from hwloc) had existed since the 2000s but became operationally mandatory around 2015. eBPF, which entered the mainline kernel in 4.9 (2016), made the previously-invisible cross-socket traffic measurable in production with no instrumentation cost, and that turned NUMA from a folklore problem into a debuggable one.

The 2018–2020 tail-latency awakening. The tail-at-scale paper (Dean & Barroso, 2013) had laid the groundwork five years earlier, but the industry adopted its mental model slowly. By 2018, every serious backend at every serious company had p99 SLOs, and by 2020, p99.9 SLOs. Gil Tene's "How NOT to Measure Latency" talk (2013) had already shown that the tools most teams used (wrk without -R, raw averages, percentile floors) were systematically lying about tail behaviour — coordinated omission silently dropped the worst measurements, making the tail look 5–10× better than it actually was. By 2020, HdrHistogram had become the default measurement primitive, wrk2 had replaced wrk, and "tail latency" had moved from a Google Research curiosity to a Razorpay-incident vocabulary. The shift in mental model: the mean is a marketing number; the tail is where you live; and any SLO that does not specify a percentile is meaningless. The AWS-published cellular architecture talks (2020-2021) closed the loop by showing that the same numbers also predicted blast-radius behaviour during partial failures.

The 2022 cellular and blast-radius era. The 2017 S3 outage (covered in /wiki/amazon-why-cells-not-clusters) was the structural prompt, but it took five years for the industry to absorb it. By 2022, Razorpay had cellularised UPI, Hotstar had cellularised IPL video catalogues, Flipkart had cellularised Big Billion Days inventory, and the cellular pattern had moved from "AWS's clever architecture" to "the default approach for any service whose worst-case incident must not affect 100% of customers". The mental-model shift: failures are not just slow; they are blast-radius events, and the engineering investment is in bounding the blast radius before the next incident hits. The era-5 toolchain (per-cell observability, shuffle-sharded routing, per-cell deployment systems) is still being built; the curriculum closes here because era 5 is the present, not the past.

A pattern that holds across all four transitions: the workload preceded the toolchain, the toolchain preceded the vocabulary, and the vocabulary preceded the curriculum. Workload pressure forced engineers to look harder at production data; the looking produced ad-hoc tools; the ad-hoc tools became standardised; the standardised tools developed shared vocabulary; and only then did the educational layer (textbooks, courses, the curriculum you are reading) catch up. The lag from "first production observation" to "standard educational treatment" is roughly 8-12 years per era. This curriculum, finished in 2025, treats eras 1-5 thoroughly. Era 6 will get its educational treatment around 2032-2035 in some future curriculum, written by engineers who are right now living through the era-6 bottlenecks without yet having the vocabulary to name them cleanly.

An empirical detail worth marking about era transitions: the mental-model shift always preceded the tool by 18-24 months. Engineers at Google and Sun knew about NUMA effects in 2010-2012 — their internal post-mortems referenced cross-socket latency as a root cause years before numastat became a default-installed package on Ubuntu. Engineers at Amazon knew about coordinated-omission in their internal load-test tools in 2015, three years before wrk2 became the public default. The lesson for an engineer reading this curriculum in 2025 is that the next era's vocabulary is being established right now, in private post-mortem documents at hyperscalers and large fintechs, and the public toolchain will catch up in 18-24 months. Reading the published architecture talks of AWS, Cloudflare, and the Indian platforms' engineering blogs — even when the content seems aspirational — is the cheapest way to get a 2-year head start on the era 6 vocabulary.

A useful Indian-platform observation: each era arrived at Indian companies roughly 2-3 years after it arrived at AWS / Google. Razorpay shipped its first NUMA-aware payment processor around 2018 (era 4 caught up). Flipkart adopted HdrHistogram-based SLOs around 2020 (era 5 partial). Cellular adoption began around 2022. The lag is not because Indian engineers are slow; it is because the workloads needed to reach the scale where the bottleneck became dominant. A payment processor doing 100 tx/sec does not have a NUMA problem. The same processor at 10,000 tx/sec does. Each era arrives at each platform at roughly the moment its workload crosses the threshold where the previous era's optimisations stop being enough. This is also why the curriculum's eras are not a monoculture timeline — they are a workload-scale timeline, and the year a given team enters era N depends on when their workload makes era N visible.

What the next era looks like — the 2025 inflection points

The closing question of any historical synthesis is "what is the next bottleneck?", and the honest answer is that it is already visible in 2025 if you look at the right Brendan Gregg talk or the right AWS architecture review. Three signals:

Coherence cost across cells. Cellular architecture (era 5) bounds the blast radius, but it does not eliminate the need for cross-cell coordination — every cellular service has a thin global layer (routing, control plane, identity) that must be eventually-consistent across cells. The cost of that coordination scales with cell count, and at 1000+ cells it is starting to dominate. The next-era toolchain — per-cell-pair latency budgets, async coordination protocols, blast-radius-aware caching — is being prototyped at AWS, Cloudflare, and the larger Indian fintechs but does not yet have a standardised vocabulary. Expect the era-6 inflection around 2027-2028, when the cellular toolchain matures into a "post-cellular" generation that takes coordination cost as the primary bottleneck.

Accelerator memory hierarchies. GPU and TPU clusters in 2025 have memory hierarchies that look like 1995 CPUs — small, fast on-chip memory, large slow off-chip memory, cache-coherence-by-convention rather than by hardware — and the performance engineering for them is roughly where CPU performance engineering was in 2005. The flamegraph for a GPU kernel is barely a thing yet; the LLM-inference equivalent of perf stat is being invented in real time. Indian context: Sarvam AI and Krutrim, scaling LLM inference for enterprise workloads in 2024–2025, are running into the same "memory hierarchy is the cost" wall that web-scale CPU workloads hit in 2010. The accelerator era will recapitulate eras 2, 3, and 4 over roughly 2025-2030, compressed by the existence of the CPU-era playbook but still requiring real engineering investment.

Shared-memory pools that pretend to be local. AWS Nitro, CXL-attached memory, and disaggregated storage are blurring the boundary between "local memory" and "remote memory" — your process sees a flat virtual address space, but underneath, some pages live on a CXL pool 100 ns away and some live in local DRAM 70 ns away. The era-3 toolchain (cache-aware data layout) and era-4 toolchain (NUMA-aware allocation) both assume the kernel knows where memory is. CXL violates that assumption by an order of magnitude, and the page-placement decisions become as performance-critical as cache-line placement was in 2010. The era-6 mental model — "the address space is a fiction; the memory bus is a network" — is starting to surface in 2025 talks at hyperscaler conferences. Expect it to become mainstream around 2028.

The pattern across all three signals is the same one that defined eras 2-5: the new bottleneck is one layer deeper than the tools of the current era can see. The engineer in 2025 who learns to read CXL traffic in flamegraphs, GPU kernel launch tail latencies in HdrHistogram form, and cross-cell coordination cost in numastat-style summaries is doing the same work that an engineer in 2008 did when they learned to read perf stat output. The arc is not over; it has just gotten longer.

A fourth signal worth flagging: the energy era. By 2025, AWS, Google, and Microsoft hyperscalers report that compute power consumption is the binding constraint on data-centre expansion in many regions — power, not silicon, is the limiter. Indian data centres in Hyderabad, Mumbai, and Chennai are facing the same constraint as power-grid capacity tightens against AI-workload demand. The next-era performance metric will likely include joules per query or carbon per request as a first-class number alongside p99 latency, and the toolchain to measure it (RAPL counters on x86, pcm-power, per-rack PDU telemetry) is at roughly the same maturity that perf stat was in 2008. Why energy is the next era and not just the current era's last symptom: at hyperscale, the throughput-per-watt of a workload determines fleet sizing more than the throughput-per-core does. A workload that is 20% slower but consumes 50% less power is a net win once the data centre is power-bound, regardless of how its p99 looks. The economic incentive for energy-aware optimisation is now visible in AWS's pricing of Graviton instances (which are sold partly on their power efficiency) and in the Indian government's ramped procurement of efficient compute for ONDC-scale workloads. The era-6 mental model — "joules are the budget; latency is the constraint" — is the inversion of the era-2-through-5 stance, and it will reshape what "fast" means. The engineering posture required is to add a second axis to every benchmark: not just "how fast" but also "at what energy cost".

When the arc breaks down — edge cases and counter-eras

The five-era frame is a useful model, not a complete one. It misses three classes of system that did not move along the same arc, and a senior engineer should know where the model fails.

Embedded and real-time systems. A cardiac pacemaker controller in 2025 has roughly the same performance discipline as one in 1995: a single ARM Cortex core, no cache hierarchy worth optimising, deterministic interrupt latency as the figure of merit. The era-1 mental model (cycle-counting, frequency-as-figure-of-merit) is still primary because the workload requires worst-case timing rather than throughput. Real-time engineers reading this curriculum will recognise eras 2-5 as irrelevant to their work, and they are right — the arc is the server-side arc, and the embedded arc has been roughly flat since the 1990s. The lesson: the arc applies where workload diversity creates throughput pressure; it does not apply where the workload is one tight deterministic loop.

Database query engines. Postgres in 1995 and Postgres in 2025 are doing roughly the same work — parsing SQL, planning a query, executing a heap scan, returning rows — but the bottleneck shape evolved differently from the application-server arc. Era 2 (IPC) never became the dominant cost for query engines because they are I/O-bound under most workloads; era 3 (memory hierarchy) hit early (the buffer pool was always cache-aware); era 4 (NUMA) hit late and only for shared-nothing parallel query engines like ClickHouse and DuckDB. The lesson: the arc's order is not universal; database internals have their own ordering driven by the disk-and-memory ratio rather than the chip-and-memory ratio. This is also why the databases curriculum (a separate track in padho-wiki) has its own mental-model progression that does not map cleanly onto these five eras.

Networking dataplanes. A switch ASIC, a NIC offload engine, or a DPDK userspace dataplane lives in a parallel arc that mostly skipped eras 1-3. From the 2000s onward, networking hardware was already pipelined, already cache-aware, already NUMA-conscious because line-rate processing demanded it. The arc that mattered for dataplanes was the kernel-bypass transition (DPDK, XDP, AF_XDP) which is roughly an era-4-equivalent transition that hit around 2015 and is still completing. The lesson: parallel arcs exist; not every domain went through the same five-era progression. A backend engineer who learns one arc should know to ask "what arc does this domain follow?" before assuming the eras transfer.

A useful heuristic: the five-era arc applies when (a) the workload is general-purpose request-response, (b) the hardware is commodity x86 / ARM, (c) the scale forces diversity in workload patterns. Any system that violates one of those three conditions has its own arc, and a senior engineer's value comes from recognising when they are reasoning across arcs versus within one.

Common confusions

  • "The 30-year arc is just Moore's Law slowing down" Different. Moore's Law (transistor density) is one driver, but the eras are about which bottleneck dominates, not which transistor count fits. The 2005 wall was thermal, not transistor. The 2015 NUMA inflection was packaging, not transistor. The 2022 cellular era is architectural, not silicon at all. Pinning the arc to Moore's Law misses the 80% of it that lives in the kernel, the runtime, and the network.
  • "Era 1 (frequency) is dead" Not entirely. Frequency boost / turbo / DVFS still matter for thermally-bursty workloads — a single-threaded SQL query plan optimisation still benefits from a 4.5 GHz turbo on one core. Era 1 has become a floor consideration, not a primary lever. You still tune isolcpus and cpupower frequency-set for benchmarks because era-1 effects pollute era-2/3/4 measurements.
  • "Eras are sequential — once you fix era N, era N+1 takes over" Wrong. They stack. A 2025 incident root-causes to whichever era is currently the dominant cost for this workload at this scale. A small Python script on a laptop is era-1-bound (single-thread CPU). A web service at 100 tx/sec is era-3-bound (cache footprint). A web service at 100,000 tx/sec is era-4-bound (NUMA). A payment processor at 1M tx/sec is era-5-bound (tail and cellular). Same code, different era depending on scale.
  • "Each era's tools replaced the previous era's" No — they layered. Modern engineers still run top (era 1) when they SSH into a hot box. perf stat (era 2) is the second tool reached for. Flamegraph (era 3) is the third. numastat (era 4) is the fourth. HdrHistogram + cellular dashboards (era 5) are the fifth. Removing any layer of the toolchain leaves a gap; the layered diagnostic ladder is the synthesis the curriculum has been building toward.
  • "Indian platforms are behind on the arc" Slightly behind on adoption (2-3 years), but the lag is shrinking. The earlier eras (1-3) are universal — every CPU and every memory hierarchy has the same physics regardless of country. Eras 4-5 lag because they require workload scale, not because of engineering capability. Razorpay's 2024 cellular UPI deployment puts them inside era 5; Zerodha's per-symbol partitioning puts them inside era 5; the lag in 2025 is roughly 18 months and dropping.
  • "The next era will be predictable" Probably not — eras 2 through 5 each surprised the engineers living through the era before them. Frequency engineers in 2003 did not predict the IPC era; IPC engineers in 2009 did not predict the memory-hierarchy era; the 2017 S3 incident was a surprise even to the engineers who built S3. Expect the next era to look obvious in retrospect and surprising in real time. The engineering posture is to keep walking down the stack — read deeper than the current generation's tools can see — and trust that the next bottleneck will become visible when it does.

Going deeper

How to spot which era your incident lives in — the 5-minute rubric

A practical rubric for an on-call engineer at 03:00: run top, look at CPU%, then walk the ladder. If CPU% < 30% and p99 is high, you are in era 5 (queueing, tail, network) — go to HdrHistogram and wrk2-style measurement. If CPU% is 60-100% and IPC is low (perf stat -e cycles,instructions shows IPC < 1.0), you are in era 2-3 (pipeline / memory bound) — go to flamegraph and perf stat -e LLC-load-misses. If CPU% is high and IPC is healthy (> 2.0) but the workload uses multiple sockets, you are in era 4 (NUMA) — go to numastat -c <pid> and look for the "other_node" column being non-zero. If everything looks fine on a single box but a fleet-wide degradation is happening, you are in era 5 (cellular / blast radius) — go to per-cell dashboards. Why this rubric works: each era's signature symptom is incompatible with the others. High p99 with low CPU is only explained by queueing or network — it cannot be CPU-bound by definition. Low IPC with high CPU is only explained by pipeline or memory stalls — the chip is busy but not productive. The era-symptom mapping is bijective enough at the diagnostic level that a 2-minute walk down the ladder is enough to localise to the right layer 80% of the time.

The Brendan Gregg synthesis — USE method as era-spanning

Brendan Gregg's USE method (Utilisation, Saturation, Errors) was published in 2012, in the middle of era 3, and survives because it is era-agnostic. Every resource — CPU, memory, network, disk, queue — has a utilisation, a saturation point, and an error rate, and the USE method gives the engineer a checklist that works on era-1 hardware and era-5 cellular services alike. The reason the method ages well is that it was never about a specific tool; it was about a vocabulary (resource, utilisation, saturation) that abstracts over the hardware era. A 2025 engineer applying USE to a CXL-attached memory pool is doing the same work as a 2012 engineer applying it to a Sandy Bridge LLC, because the vocabulary holds. The lesson: invest in vocabulary before tools, because vocabulary survives era transitions and tools do not.

The reproducibility crisis in performance engineering

A subtle problem the arc reveals: the closer you get to the present, the harder reproducibility becomes. An era-1 benchmark from 1998 (a fixed-loop integer microbench on a Pentium) reproduces today on any modern CPU within 5% of the published numbers. An era-5 benchmark from 2024 (a tail-latency measurement on a 16-core c6i.4xlarge under 50,000 RPS via wrk2) reproduces only on the same instance type, the same kernel version, the same noisy-neighbour conditions on the underlying physical host. The variability budget for era-5 measurements is 30-50% across runs even on identical hardware, because the dominant cost is workload-coupled (queueing, hedge fan-out, cell coordination) rather than hardware-coupled. This is why every era-5 chapter in the curriculum specifies the cloud instance type, the kernel version, and the load-generator settings down to the seed — the measurement is only reproducible to the precision of the configuration. Engineers who learned performance work in era 2 (where 5% reproducibility was normal) have to recalibrate when they move to era 5.

The economic argument — why each era's investment paid off

Each era's tooling investment looks expensive in advance and obvious in retrospect. The flamegraph effort cost Brendan Gregg roughly two years of work; it has saved the industry hundreds of thousands of engineer-hours per year since 2013. eBPF cost the kernel community a decade of architecture work; it is now indispensable. HdrHistogram cost Gil Tene a few years of evangelism; it is now the default. Cellular architecture cost AWS roughly five years of post-S3-outage engineering; it is now the default for any team at scale. The pattern: era-defining tools and architectures take 5-10 engineer-years to build and pay back over 20-30 engineer-years across the industry. The economic case for investing in the next era's tooling — even speculatively — is strong, because the payback is asymmetric. Indian platforms that are willing to invest 1-2 engineer-years in CXL tooling, GPU-flamegraph tooling, or coordination-cost dashboards in 2025-2026 will have a 2-3 year head start on the era 6 transition.

Reproduce this on your laptop

# Reproduce the era-1/2/3 diagnostic ladder
sudo apt install linux-tools-common linux-tools-generic
python3 -m venv .venv && source .venv/bin/activate
pip install hdrh
python3 era_ladder.py 30000000

# Read the ladder output
# Era 1: cycles + ins are non-zero -> the chip is busy
# Era 2: IPC < 1.0 -> chip stalled (pipeline or memory)
# Era 3: LLC-miss rate > 20% -> working set exceeded LLC

# Optional: check NUMA topology of your own box
numactl --hardware
# Optional: read coordinated-omission-correct latencies on a real workload
sudo apt install wrk2
wrk2 -t8 -c200 -d30s -R10000 --latency http://localhost:8080/health

The combined ladder — 90 seconds on the era-1/2/3 script, 30 seconds on numactl --hardware, 30 seconds on a wrk2 run — is the 2-minute version of the entire curriculum's diagnostic stance. The curriculum's 120 chapters expand each rung of the ladder into the depth a working engineer needs; the ladder itself is the compressed form.

Where this leads next

The arc closes here, but the work does not. Era 6 is already starting — coherence cost across cells, accelerator memory hierarchies, CXL-attached memory pools — and the chapters in this curriculum are the foundation for understanding it, not a substitute. A reader who has worked through Parts 1–16 has the vocabulary, the tools, and the diagnostic ladder to recognise era-6 bottlenecks when they appear, even though those chapters do not yet exist.

The natural next reads:

A useful exercise for any engineer who has read the full curriculum: pick the most performance-sensitive service your team owns, walk it down the diagnostic ladder, and note which era its current bottleneck lives in. Then ask the harder question: which era will be its bottleneck in two years, given the projected scale? The answer to that second question — almost always one era deeper than the present bottleneck — is where the engineering investment should be going. Teams that invest in their next era's tooling, before the workload forces it, are the teams that handle their next big incident with the right vocabulary instead of inventing it under pressure at 03:00. The curriculum's purpose was never to teach the past; it was to compress the past so that the next era is recognisable in real time.

A second exercise, equally valuable, is to walk the curriculum backwards: pick any one chapter and trace its dependency chain to the eras that produced its mental model. The chapter on coordinated omission depends on era-5 tail-latency vocabulary; the chapter on flamegraphs depends on era-3 memory-hierarchy thinking; the chapter on Amdahl's Law sits across eras 1, 2, and 9 because the serial-fraction argument was relevant in every era but became operationally binding only when era-2's IPC ceiling was hit. Mapping a chapter to its era exposes the assumption layers it inherits from — and those assumption layers are exactly what break first when the next era arrives. An engineer who can articulate "this technique works because we are in era N; here is what stops working when we enter era N+1" has the kind of historical-dependency awareness that survives the next inflection point. That awareness is what the curriculum has been building toward across 120 chapters: not a fixed set of tools but a stance that recognises which era a problem lives in, why, and where it will move next.

A final observation to close the curriculum: the engineers who wrote the foundational papers of each era — Drepper on memory, Gregg on flamegraphs, Tene on coordinated omission, Dean and Barroso on tail latency, MacCárthaigh on cellular — were not the senior architects of their day. They were mid-career engineers who had spent years in production at the layer where the next bottleneck was already visible, and who took the time to write down what they were seeing in a way the rest of the industry could absorb. The next era's papers will be written by engineers reading this curriculum today, working at platforms where the era-6 bottleneck is already starting to bite. The discipline is to read deeply, measure honestly, write clearly when you see something the rest of the industry has not yet named, and walk down the stack with patience.

The 30-year arc bends inward — toward smaller layers, finer measurements, deeper visibility. It does not bend toward "everything is fast now". A senior engineer in 2055 will read this chapter, smile at the eras we thought were complete, and walk down a diagnostic ladder we cannot yet imagine. The work continues; the layers go deeper; the discipline is to keep walking.

References