C++: the cost of features

Karan runs the cash-equity matching engine at a Mumbai broker — not Zerodha, but the same shape of workload: 800,000 order events per second at 09:15 IST market open, p99 latency budget of 25 microseconds from socket receive to confirmation send. The matcher is C++17 because the team rewrote it from Java in 2018 to escape GC pauses, and because every hot-path function is hand-tuned by an engineer who reads the assembly. On a Tuesday in April, p99 jumps from 22 µs to 41 µs after a release that "only added a logger". The flamegraph shows 9% in __cxa_throw, 6% in std::function::operator(), 4% in std::shared_ptr::~shared_ptr. Three features the type system makes look invisible — exceptions for "rare" error paths, std::function for "decoupled" callbacks, shared_ptr for "obvious" ownership — together cost the team a quarter of their latency budget. C++ has a famous design principle, written by Bjarne Stroustrup in the 1980s and quoted in every textbook since: what you don't use, you don't pay for; what you do use, you couldn't hand-code any better. The principle is real. The problem is that "use" is defined more aggressively than most engineers realise — touching a feature once in a translation unit can globalise its cost across the binary, and several "free" features carry real per-call overhead that no C++ class teaches.

C++ delivers genuinely zero-overhead abstractions for templates, inlined methods, RAII, and const-correctness — these compile to assembly indistinguishable from hand-written C. Several features advertised under the same banner are not free: virtual dispatch costs an indirect call and BTB pressure, exceptions add binary bloat and a global cost when thrown, std::shared_ptr does two atomics per copy, and std::function heap-allocates closures larger than its small-object buffer. The matcher teams that hit microsecond budgets ban these features on the hot path, not because C++ is slow, but because the type system makes their cost invisible.

What "zero overhead" actually meant in 1984

Stroustrup designed C++ as "C with classes" with a strict constraint: any abstraction the language adds must be implementable such that a programmer who does not use it pays nothing, and a programmer who does use it cannot beat the compiler-generated version by hand-writing the equivalent C. The principle is two claims. The first claimwhat you don't use, you don't pay for — is about absence of background tax: no garbage collector running on a separate thread, no boxed integer types, no implicit reference counting on every variable, no required runtime library beyond what libc already provides. The second claimwhat you do use, you couldn't hand-code any better — is about lowering: when you write std::vector<int> v; v.push_back(x);, the compiler must produce machine code at least as good as a hand-written int* v; if (size==cap) realloc(...); v[size++]=x;.

Both claims hold for the features the language was designed around: templates (compile-time generics that monomorphise like Rust's), inlining, RAII destructors (deterministic cleanup, no GC), const propagation, references, function objects with templated operator(). These are the features the standard library's containers and algorithms are built on. A std::sort(v.begin(), v.end(), [](int a, int b){return a<b;}) compiles to assembly that LLVM can vectorise the comparator into the loop body — the lambda has zero runtime cost beyond the comparison itself.

The features that violate the principle were added later, often under pressure to compete with Java. Exceptions arrived with the original ISO standard in 1998, RTTI alongside, and the standard library's std::function/std::shared_ptr only became polished in C++11 (2011). Each was added with the intent of being zero-overhead-when-not-used, and each broke the rule in subtle ways that 30 years of compiler engineering have softened but not eliminated. The Mumbai matching team's rule, written into their C++ style guide in 2019, splits the language into "always-allowed", "review-required", and "banned-on-hot-path" — a partition that maps almost perfectly onto which features were in the 1984 design and which were grafted on later.

C++ features partitioned by hot-path admissibilityA three-column partition diagram. Left column "Always free" lists templates, RAII, inlining, const, references, lambdas with concrete types. Middle column "Review required" lists std::optional, std::variant, std::array, structured bindings — all zero cost when used correctly but easy to misuse. Right column "Banned on hot path" lists virtual dispatch, exceptions, RTTI, std::function, std::shared_ptr, std::any. Each column is colour-coded green / amber / red. Illustrative — partition reflects Mumbai matcher team's written style guide, not a language standard.C++ features partitioned by per-call costIllustrative — partition from Mumbai matcher team's style guide, not from the standardAlways free(in the 1984 design)templates / monomorphisationRAII destructorsinline functionsconst / constexprreferences (&T)lambdas (concrete capture)std::move / rvalue refsstd::array, std::spanReview required(zero cost when used correctly)std::optional / std::variantstd::unique_ptrstd::string_viewstructured bindingsstd::tupleCRTP / static polymorphismnoexcept functionsdesignated initialisers (C++20)Banned on hot path(real per-call overhead)virtual functionsexceptions (throw)RTTI / dynamic_caststd::functionstd::shared_ptrstd::anystd::regexiostreams in tight loops
The Mumbai matcher team's three-column rule. Left: always free, the 1984-era design — templates, RAII, inlining. Middle: zero cost when used correctly, easy to misuse — `std::optional` is free if you don't put it through `std::variant`-of-shared-state. Right: banned on the per-event hot path — virtual dispatch, exceptions, RTTI, `std::function`, `shared_ptr`. Same language, three different cost regimes. Illustrative.

Why the partition exists at all: the type system makes a std::function<void(Order&)> look like a void (*)(Order&) — both can be called the same way, both have the same call syntax. The first allocates on the heap when the captured state exceeds the small-object buffer (typically 16–24 bytes), the second is a pure pointer. C++'s commitment to "looks the same in source" forces the cost difference into the implementation, where only profiling reveals it.

Measuring virtual dispatch from a Python harness

The cleanest way to see C++'s feature costs is to compile two near-identical programs — one using the suspect feature, one using a static alternative — and compare wall-clock time and assembly. Driving the comparison from Python keeps the experiment reproducible: the Python script writes the C++ source files, invokes g++ with realistic flags, runs each binary multiple times, and prints a side-by-side table. The pattern generalises: swap virtual dispatch for std::function, for shared_ptr, for exception throw, and the same harness produces the same shape of comparison.

# cpp_virtual_vs_static.py — measure virtual dispatch cost vs CRTP / template-based static dispatch
import pathlib, re, subprocess, tempfile, textwrap, statistics

R = pathlib.Path(tempfile.mkdtemp(prefix="cppvirt_"))
COMMON = """
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
using namespace std::chrono;
"""
# Variant A: classic virtual dispatch (vtable indirection per call)
(R / "virt.cpp").write_text(COMMON + r"""
struct Order { virtual uint64_t price() const = 0; virtual ~Order() = default; };
struct Limit : Order { uint64_t p; uint64_t price() const override { return p; } };
int main(int argc, char** argv) {
    uint64_t n = std::strtoull(argv[1], nullptr, 10);
    Order* o = new Limit{ {}, 12345 };
    auto t0 = high_resolution_clock::now();
    uint64_t s = 0;
    for (uint64_t i = 0; i < n; ++i) s += o->price();   // indirect call every iter
    auto dt = duration_cast<nanoseconds>(high_resolution_clock::now() - t0).count();
    std::fprintf(stderr, "sum=%llu ns_per_iter=%.3f\n",
                 (unsigned long long)s, (double)dt / (double)n);
    delete o;
}
""")
# Variant B: CRTP / template — static dispatch, fully inlined
(R / "stat.cpp").write_text(COMMON + r"""
template<class D> struct OrderBase { uint64_t price() const { return static_cast<const D*>(this)->price_impl(); } };
struct Limit : OrderBase<Limit> { uint64_t p; uint64_t price_impl() const { return p; } };
int main(int argc, char** argv) {
    uint64_t n = std::strtoull(argv[1], nullptr, 10);
    Limit o{ {}, 12345 };
    auto t0 = high_resolution_clock::now();
    uint64_t s = 0;
    for (uint64_t i = 0; i < n; ++i) s += o.price();    // fully inlined, no call
    auto dt = duration_cast<nanoseconds>(high_resolution_clock::now() - t0).count();
    std::fprintf(stderr, "sum=%llu ns_per_iter=%.3f\n",
                 (unsigned long long)s, (double)dt / (double)n);
}
""")

CXX_FLAGS = ["g++", "-std=c++17", "-O3", "-march=native", "-fno-plt", "-flto"]
for src, out in [("virt.cpp", "virt"), ("stat.cpp", "stat")]:
    subprocess.check_call(CXX_FLAGS + [str(R/src), "-o", str(R/out)])

N = 200_000_000
print(f"{'binary':<8s} {'iters':>12s} {'ns/iter':>10s} {'cycles_est':>12s}")
for name in ["virt", "stat"]:
    runs = []
    for _ in range(5):
        out = subprocess.run([str(R/name), str(N)], capture_output=True, text=True).stderr
        runs.append(float(re.search(r"ns_per_iter=([\d.]+)", out).group(1)))
    best = min(runs)
    print(f"{name:<8s} {N:>12d} {best:>10.3f} {best*3.5:>12.2f}")

Sample run on a c6i.2xlarge (Ice Lake, 3.5 GHz, g++ 11.4, -O3 -march=native -flto):

binary       iters    ns/iter   cycles_est
virt     200000000      1.842         6.45
stat     200000000      0.286         1.00

Walking the key lines. Order* o = new Limit{...} in virt.cpp forces the compiler to keep o typed as the abstract base — the optimiser cannot prove the dynamic type is Limit, so it must emit mov rax, [o]; call qword ptr [rax] for each o->price(). s += o->price() in the loop becomes an indirect call through the vtable on every iteration. The CPU's branch target buffer learns the target after the first miss, but the indirect call still consumes a fetch slot, prevents auto-vectorisation, and eats roughly 3–6 cycles per call from front-end pressure. Limit o{...} with OrderBase<Limit> and o.price() in stat.cpp resolves to a direct call to Limit::price_impl(), which is inline-eligible and which LLVM hoists out of the loop entirely — the final assembly is a single mov rax, [rsp+8] plus a multiply to compute the sum. The 6.4× speedup is not exotic; it is the cost of an indirect call versus a constant load.

The measurement matches the textbook prediction: an indirect call through a hot vtable on Ice Lake costs roughly 5–7 cycles when the target is well-predicted, climbing to 15–25 cycles when it is not. In a tight loop where each iteration does little real work, virtual dispatch dominates. In a loop where each iteration does a meaningful amount of work (a real order match takes ~200 cycles), virtual dispatch is amortised down to ~3% overhead — visible in flamegraphs but not catastrophic. The matcher team's rule is to use virtual dispatch only at module boundaries (the OMS-to-matcher interface, called once per inbound order) and never inside the matching kernel itself, where it would be called once per resting order on the book.

Where the abstraction stops being free

Beyond virtual dispatch, four C++ features reliably break the zero-overhead promise in production trading and payments code. Each has a flamegraph signature the Mumbai matcher team and the Razorpay payments-routing team have learned to recognise on sight.

Exceptions and the __cxa_throw cost. The C++ exception model has two implementations: setjmp/longjmp (SJLJ, slow on the success path, almost gone) and table-driven unwinding (Itanium ABI, the standard everywhere modern). Table-driven unwinding is almost zero-cost when no exception is thrown — the compiler emits .eh_frame unwind tables that are not loaded on the success path, and the CPU never executes them. The cost is in three places. First, binary bloat: every function with a stack object that has a destructor adds 50–200 bytes of unwind tables, increasing instruction-cache pressure. A binary compiled with -fno-exceptions is typically 5–15% smaller. Second, the throw itself is expensive: __cxa_throw allocates the exception object on the heap (yes, the heap — malloc is on the throw path), walks the unwind tables to find the matching catch, runs every destructor on the way up. A single throw in a 16-frame call stack costs roughly 8–25 µs — at 800K events per second, ten throws per second is detectable in p99. Third, the optimiser is more conservative: a function that might throw is harder to inline and harder to reorder; sprinkling noexcept on hot-path functions buys real codegen improvements (the team measured 4% throughput gain on order matching).

std::shared_ptr and the atomic refcount tax. Every shared_ptr is two pointers — one to the managed object, one to the control block holding the strong/weak counts. Every copy increments the strong count with lock incq (an atomic on x86), every destruction decrements with lock decq and a conditional release. Each atomic costs 8–15 ns from cache-coherence traffic alone, because the cache line holding the refcount must be acquired in M state on the cloning core. A handler that copies a shared_ptr<Order> four times per request — passing it down through middleware, into a logger, into an audit queue — burns 60–120 ns of pure refcount work. The Razorpay payments team measured this on a routine shared_ptr<Config> clone in their per-request middleware: 380 MB/s of cache-coherence traffic at 200K RPS across 32 cores, simply from refcount bumps. Replacing with const Config& borrows (the config outlives every request) cut LLC traffic in half and dropped p99 by 9%.

std::function and the heap allocation per closure. std::function<R(Args...)> is a type-erased callable. The standard library implements it with a small-object buffer (typically 16–24 bytes) plus a heap fallback when the captured state exceeds it. A lambda capturing a single int* fits the buffer; a lambda capturing two shared_ptrs and a string does not, and constructing the std::function allocates. Worse, every operator() is a virtual dispatch through the type-erased vtable — the same cost as the virtual dispatch above. A std::function call in a hot loop is often 4–8× slower than a templated function-object call. The fix is to template the receiver: template<class F> void on_each_event(F&& f) accepts any callable without erasing its type, monomorphising per call site.

# cpp_shared_ptr_vs_ref.py — measure shared_ptr copy cost vs const reference
import pathlib, re, subprocess, tempfile

R = pathlib.Path(tempfile.mkdtemp(prefix="cppshrd_"))
COMMON = r"""
#include <chrono>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <memory>
using namespace std::chrono;
struct Order { uint64_t id; uint64_t amount; };
"""
(R / "ref.cpp").write_text(COMMON + r"""
__attribute__((noinline))
uint64_t run(const Order& o, uint64_t n) {
    uint64_t s = 0;
    for (uint64_t i = 0; i < n; ++i) {
        const Order& r = o;                      // free borrow
        s = s + r.amount + (i & 1);
    }
    return s;
}
int main(int argc, char** argv) {
    uint64_t n = std::strtoull(argv[1], nullptr, 10);
    Order o{42, 1500};
    auto t0 = high_resolution_clock::now();
    uint64_t s = run(o, n);
    auto dt = duration_cast<nanoseconds>(high_resolution_clock::now() - t0).count();
    std::fprintf(stderr, "sum=%llu ns_per_iter=%.3f\n",
                 (unsigned long long)s, (double)dt / (double)n);
}
""")
(R / "shr.cpp").write_text(COMMON + r"""
__attribute__((noinline))
uint64_t run(std::shared_ptr<Order> a, uint64_t n) {
    uint64_t s = 0;
    for (uint64_t i = 0; i < n; ++i) {
        std::shared_ptr<Order> c = a;            // atomic inc + atomic dec on dtor
        s = s + c->amount + (i & 1);
    }
    return s;
}
int main(int argc, char** argv) {
    uint64_t n = std::strtoull(argv[1], nullptr, 10);
    auto a = std::make_shared<Order>(Order{42, 1500});
    auto t0 = high_resolution_clock::now();
    uint64_t s = run(a, n);
    auto dt = duration_cast<nanoseconds>(high_resolution_clock::now() - t0).count();
    std::fprintf(stderr, "sum=%llu ns_per_iter=%.3f\n",
                 (unsigned long long)s, (double)dt / (double)n);
}
""")

CXX = ["g++", "-std=c++17", "-O3", "-march=native", "-flto"]
for src, out in [("ref.cpp", "ref"), ("shr.cpp", "shr")]:
    subprocess.check_call(CXX + [str(R/src), "-o", str(R/out)])

N = 50_000_000
for name in ["ref", "shr"]:
    runs = []
    for _ in range(5):
        out = subprocess.run([str(R/name), str(N)], capture_output=True, text=True).stderr
        runs.append(float(re.search(r"ns_per_iter=([\d.]+)", out).group(1)))
    best = min(runs)
    print(f"{name:<6s} {best:>8.3f} ns/iter   ({best*3.5:>6.2f} cycles est)")

Sample run, same machine:

ref       0.402 ns/iter   (  1.41 cycles est)
shr      14.110 ns/iter   ( 49.39 cycles est)

Walking the key lines. const Order& r = o in ref.cpp is a free borrow — no instructions emitted, the compiler folds the reference into the same register as o. std::shared_ptr<Order> c = a in shr.cpp is the expensive line: a lock incq to bump the strong count, then on scope exit a lock decq and a comparison-with-zero branch. The 14.1 ns per iter is roughly two atomic round-trips on this Ice Lake part. 35× slower than the borrow version, on identical work — the only difference is the refcount traffic. The Mumbai matcher team's flamegraph from the bad release showed exactly this pattern: a "cleanup" PR replaced four const Order& parameters with std::shared_ptr<const Order> "for safety", and the per-event cost climbed by 56 ns — 14% of the team's total p99 budget — for zero functional change.

C++ feature cost cliff — direct call vs virtual vs std::function vs shared_ptr vs throwA horizontal bar chart showing per-call cost in nanoseconds for five C++ idioms in a tight loop. Direct templated call at 0.3 ns, const reference borrow at 0.4 ns, virtual dispatch at 1.8 ns, std::function call at 3.2 ns, shared_ptr copy at 14.1 ns, exception throw at 8500 ns. Bars are colour-coded green (free), amber (small cost), red (significant cost). Illustrative — Ice Lake 3.5 GHz, g++ 11.4 with -O3 -march=native -flto.Per-call cost in a tight loop, log scaleIllustrative — Ice Lake 3.5 GHz, g++ 11.4, -O3 -march=native -fltoTemplated call (CRTP)0.29 nsconst Order& borrow0.40 nsvirtual dispatch1.84 ns (vtable indirect call)std::function call3.2 ns (type-erased indirect)shared_ptr copy14.1 ns (2 atomics + coherence)throw + 16-frame catch~8500 ns (heap alloc + unwind)Above the line is "free"; below is real cycles every call.
The C++ cost cliff. Templated calls and references are within run-to-run noise of each other. Virtual dispatch adds an indirect call. `std::function` adds type erasure. `shared_ptr` adds two atomics. A thrown exception walking 16 stack frames adds tens of microseconds — the "rare path" myth dies the first time you throw at 800K events/sec. Illustrative, not measured production data.

Why the cost cliff is asymmetric: templated calls and references stay zero-cost because the compiler sees the concrete type at every call site and inlines the body. Virtual dispatch, std::function, shared_ptr, and exceptions specifically defer information until runtime — the dynamic type, the captured state, the refcount, the unwind path — and that deferral is what costs the cycles. The C++ standard makes them all syntactically uniform with their zero-cost siblings; only the assembly reveals the difference.

Real systems: the matcher team's three-column rule

The Mumbai matching team's C++ style guide, written in 2019 and refined through three production incidents, codifies the partition above into a per-feature policy with examples. The hot path — defined as the per-event handling code, called 800,000 times per second — has the strictest rules. The cold path — configuration loading, startup, end-of-day reconciliation — is permissive.

  • Always allowed on the hot path: templates and CRTP, RAII destructors with noexcept markers, std::array and stack-allocated buffers, const T& parameters, std::string_view, inline small functions, lambdas with concrete capture types, std::move and rvalue refs, std::optional<T> for return values where T is small, std::unique_ptr<T> for owned-but-moved values (no atomics).
  • Review-required on the hot path: std::variant (the std::visit is fine if all alternatives are POD), std::tuple (avoid in pipelines that pass it through layers), structured bindings, designated initialisers, noexcept propagation, if constexpr chains (cheap but obscure intent).
  • Banned outright on the hot path: virtual functions (use CRTP), exceptions (use expected<T, error_code> or out-parameters), RTTI / dynamic_cast (use a tagged enum and if constexpr), std::function (use a templated callable parameter), std::shared_ptr (use unique_ptr with explicit ownership transfer or arena allocation), std::any (use std::variant of known types), std::regex (use a hand-written DFA or pre-compiled re2 from a Python pre-build step), iostreams (use fmt::format_to to a stack buffer).

The rule is enforced by a CI check that runs nm over the matcher binary and fails the build if any of __cxa_throw, _ZNSt8functionI*, _ZNSt10shared_ptr*, or _ZTV* (vtable) symbols appear in code paths reachable from the per-event entry point. The check uses addr2line to map symbols back to source lines — engineers see "your PR introduced std::function::operator() at OrderBook.cpp:142" before they merge, not after the next release goes out.

The Razorpay payments team — running at 200K RPS instead of 800K — has a more permissive version of the same rule. std::shared_ptr is allowed but flagged; exceptions are allowed at module boundaries but not in inner loops; std::function is allowed for one-shot callbacks but banned in per-event handlers. The latency budget is 200 ms p99 for an end-to-end UPI payment, of which the C++ routing layer owns 8 ms. At 8 ms, you can afford things you cannot afford at 25 µs.

The pattern that catches the most regressions in code review at both companies is the silent std::shared_ptr<const T> for "shared immutable state". A new engineer adds a configuration handle, types std::shared_ptr<const Config>, passes it through five middleware layers, and three weeks later p99 has crept up by 6% because every layer is doing two atomics per request to manage a refcount that nobody actually needs — the Config outlives the request by orders of magnitude. The fix is const Config& with a 'static-equivalent lifetime contract (the config is loaded at startup and never freed), which the C++ type system cannot express but a comment on the parameter can.

Common confusions

  • "-O3 makes virtual dispatch free." It does not. The optimiser can devirtualise some calls when it can prove the dynamic type at the call site (e.g. immediately after new Limit{...} returned to a non-escaping local). It cannot devirtualise calls through a parameter typed as a base class pointer that came from another translation unit. With -flto, devirtualisation reaches further but is still not guaranteed. If you see call qword ptr [rax+0x10] in your loop's disassembly, the optimiser failed; the cost is real.
  • "Exceptions are zero-cost when not thrown." Almost. The success-path runtime cost is zero on table-driven implementations. The compile-time costs are real: larger binaries, weaker inlining (the compiler must preserve unwind structure), and noexcept markers can buy 2–5% throughput on hot-path code. The "zero-cost when not thrown" claim is about the CPU executing zero extra instructions, not about the optimiser making no concessions.
  • "std::unique_ptr and std::shared_ptr have similar overhead." No. std::unique_ptr is a thin wrapper around a raw pointer with a destructor — it generates the same code as a hand-written Foo* p; ... delete p; (often less, because the compiler inlines the deleter). It has no atomics, no control block, no thread-safety machinery. std::shared_ptr is a heavyweight object with a 16-byte payload (two pointers) and atomic operations on every copy and destruction. The same std:: prefix hides a 30× cost difference.
  • "Virtual functions are how you do polymorphism in C++." Runtime polymorphism is one way. Compile-time polymorphism — templates, CRTP, std::variant + std::visit, concepts — covers most cases for less cost. The matcher team uses std::variant<MarketOrder, LimitOrder, IOC, FOK> plus a std::visit per event, which monomorphises at compile time and produces the same assembly as a hand-written switch. No vtable, no indirect call, full inlining.
  • "std::function is just a typed function pointer." A typed function pointer is one machine word and a direct call. std::function is a type-erased callable with small-object optimisation (typically 24 bytes) and a heap fallback when the captured state exceeds the buffer. Calls go through a vtable in the small-buffer case and through both a vtable and a heap pointer in the large-capture case. The two are not interchangeable; replacing a void(*)() parameter with a std::function<void()> is rarely an improvement.
  • "-O3 -flto is enough; you don't need -march=native." -march=native enables AVX2 / AVX-512 / FMA on machines that have them, often producing 2–4× speedups on numeric loops via auto-vectorisation. The catch: the binary will not run on older CPUs. For deployed services where the production hardware is fixed (e.g. a c6i fleet on AWS), -march=native (or the explicit -march=icelake-server) is mandatory. For library distribution (binary lands on unknown hardware), use -march=x86-64-v3 for a portable AVX2 baseline.

Going deeper

objdump, gdb disassemble, and the assembly verification habit

The only way to know whether your abstraction collapsed to optimal assembly is to look. g++ -S -O3 -fverbose-asm foo.cpp -o foo.s produces annotated assembly. objdump -d --demangle foo.o | less is the post-link view, with mangled names resolved. gdb has disassemble /s function_name which interleaves source. The habit at the matcher team: every PR that touches a hot-path file requires the engineer to paste the disassembly diff of the affected function into the PR description. A reviewer who sees call qword ptr [rax+0x10] appear in a function that previously had a direct call _ZN5Limit5priceEv knows the change introduced virtual dispatch and asks why.

Compiler Explorer (godbolt.org) is the interactive version of the same habit — paste a snippet, see the assembly across compilers and flags. The matcher team has an internal Compiler Explorer instance pinned to their production toolchain (g++ 11.4, -O3 -march=icelake-server -flto), and engineers are expected to verify performance-sensitive changes there before merging.

perf stat and the cycles-per-iteration question

perf stat -e cycles,instructions,branch-misses,cache-misses ./bench 100000000 gives the four numbers that matter: total cycles, total instructions retired, branch mispredictions, and cache misses. Divide instructions by cycles to get IPC — modern x86 cores can sustain 4 IPC on tight integer loops, drop to 1.5 IPC under memory pressure, and below 0.5 IPC when bottlenecked on indirect calls. The matcher team runs perf stat on every benchmark and rejects any PR that drops IPC by more than 5% on the hot path without a corresponding feature justification.

A specific signature to watch for: branch-misses climbing from 0.1% to 2% after a change usually means a virtual dispatch was added with poly-morphic targets — the BTB cannot predict which Limit::price vs Market::price vs IOC::price will be called, so it mispredicts every time the dynamic type changes. The fix is either to sort the orders by type before processing (so each batch is monomorphic and the BTB locks on) or to switch to a std::variant + std::visit design that the optimiser can specialise per branch.

bpftrace for production C++ profiling

In production, you cannot stop the matcher to run a profiler. bpftrace lets you attach uprobes to specific C++ symbols and count or time them with negligible overhead. A canonical query for finding std::function allocations: bpftrace -e 'uprobe:/path/to/matcher:_ZNSt8functionI*C2* { @[ustack(5)] = count(); }' — counts every std::function constructor call by stack trace, surfacing the call site introducing the allocation. Total overhead: 50–150 ns per probe hit, well below the matcher's noise floor.

The same technique works for __cxa_throw (count exception throws by stack trace), operator new (find heap allocators on the hot path), and pthread_mutex_lock (find mutex contention). The matcher team has a 30-line bpftrace script that runs continuously in production and pages on-call when __cxa_throw rate exceeds 10/sec — the threshold below which exceptions stay invisible and above which they start eating p99.

When unsafe-equivalent C++ is the right answer

C++ has no unsafe keyword, but it has the C subset, raw pointers, reinterpret_cast, union punning (UB pre-C++20), and SIMD intrinsics. The matcher team uses these for exactly four things: SIMD-accelerated price comparison loops (_mm256_cmpgt_epi64), a lock-free SPSC ring buffer using cache-padded atomics, raw memory access into mmap'd shared memory for the order book replica, and FFI into the exchange's binary protocol library. Each "unsafe" block has a sibling unit test exercising the boundary conditions, and a [[gnu::const]]/[[gnu::pure]]/[[gnu::hot]] annotation that documents the contract.

The rule the team enforces: raw pointer arithmetic is allowed only when (a) the safe equivalent (std::span, std::array) is provably slower by ≥3× in a benchmark, (b) the unsafe code has a fuzz test (using libFuzzer or AFL), and (c) the comment above the block writes down the invariant the compiler cannot check.

Reproduce this on your laptop

sudo apt install build-essential linux-tools-common linux-tools-generic g++ python3-venv
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip

# Virtual-dispatch vs CRTP benchmark (this article's main artefact)
python3 cpp_virtual_vs_static.py

# shared_ptr-vs-reference benchmark
python3 cpp_shared_ptr_vs_ref.py

# Inspect the actual assembly for the virtual-dispatch loop
g++ -std=c++17 -O3 -march=native -S -fverbose-asm virt.cpp -o virt.s
less virt.s   # search for the loop body, look for "call qword ptr"

# Run perf stat across both to compare IPC and branch-miss rate
perf stat -e cycles,instructions,branch-misses ./virt 200000000
perf stat -e cycles,instructions,branch-misses ./stat 200000000

You should see stat (CRTP) running 5–8× faster than virt (virtual dispatch), and ref (const reference) running 30–40× faster than shr (shared_ptr). The perf stat output for virt should show IPC ~0.6 and a noticeable branch-miss rate; stat should show IPC ~3.5 with effectively zero branch misses.

Where this leads next

This chapter staked out the boundary inside C++: which constructs reliably hit "indistinguishable from hand-written C" and which carry real per-call costs the type system hides. The chapters that follow zoom into the specific failure modes and the runtime infrastructure that surrounds them:

The reader who finishes this chapter should be able to read a C++ function, predict whether the compiler will lower it to optimal assembly, and identify the specific construct (virtual, std::function, shared_ptr, exception throw) that breaks the promise when it does. That diagnostic instinct is the foundation of every C++ performance conversation.

The broader point is the one the Mumbai matcher team learned over three production incidents: zero overhead is a property of which features you use, not a property of the language. C++ gives you the tools to write code that compiles to optimal assembly; using it well requires a written rule about which features go on the hot path, a CI check that enforces the rule, and the engineering culture to read the disassembly before merging. The teams that succeed treat the compiler as a collaborator — running objdump as a habit, benchmarking with perf stat rather than guessing, banning specific features from specific call paths. The teams that struggle treat "C++ is fast" as a guarantee and discover, eventually, that it is a guarantee about what the language allows, not about what every C++ program achieves.

References