Strong vs weak scaling

Aditi runs the nightly fraud-graph rebuild at a Bengaluru fintech. The job ingests the day's UPI transactions (~120M edges), recomputes a connected-components decomposition, and writes the labels back to Postgres. On the existing 16-core box it finishes in 38 minutes. The product team wants it in 10 minutes so the morning risk dashboard is fresh by the 09:30 standup.

Aditi spins up a 64-core c6i.16xlarge, reruns the same job on the same 120M edges, and measures 14 minutes. She presents the 2.7x speedup to her manager. Her manager — who reads systems-performance papers on weekends — asks one question: "is that strong or weak?" Aditi does not have an answer.

The 2.7x looks great until the next quarter, when transaction volume grows to 240M edges, the same 64-core box now takes 28 minutes, and the dashboard is late again. The benchmark answered the wrong question. Strong scaling asks "if I keep the work fixed and add cores, how much faster does it finish?" — Aditi's actual question. Weak scaling asks "if I grow the work and the cores together, can I keep the wall-clock constant?" — a question Aditi never asked but that the hardware vendors love. This chapter is about the difference, why every parallel benchmark you read is implicitly one or the other, and why the wrong choice ships dashboards that lie to your CEO.

Strong scaling fixes the problem size and varies the core count; the metric is speedup(N) = T(1) / T(N) and the ceiling is Amdahl's 1/α. Weak scaling grows the problem size proportionally with the core count; the metric is efficiency(N) = T(1, W) / T(N, N·W) and the ceiling is set by inter-node communication overhead, not by the serial fraction. They answer different questions, have different ceilings, and the benchmark you publish must match the question your operations team is being paid to answer.

Two questions, two metrics, two ceilings

A parallel system has two knobs you can turn: how many workers it uses (N), and how much work each iteration processes (W). Strong scaling holds W fixed and varies N; weak scaling varies both, holding the per-worker share W/N constant.

The first asks "can I make this batch finish faster?", the second asks "can I keep up with growing data without my batch slipping?". Both are legitimate, both are common, and they are routinely confused — vendors publish weak-scaling numbers when customers want strong-scaling answers, and the customer signs the PO before noticing.

The textbook formula for strong-scaling speedup is S(N) = T(1) / T(N), where T(1) is the wall-clock time on one core and T(N) is the wall-clock on N cores for the same workload. Linear speedup is S(N) = N; anything less is sub-linear. Amdahl's law tells you the asymptote: S(N) ≤ 1 / α, where α is the unparallelisable serial fraction.

A workload with 5% serial work caps at 20× regardless of core count, and reaches 10× somewhere around 18 cores. The shape of the curve is unforgiving — the closer you push to the asymptote, the more cores it costs per unit of speedup, and the wider the gap between marketing-deck "linear" and measured "sub-linear".

Weak scaling reframes the question. Instead of fixed W, you set W = N · W_per_worker and measure how T changes. Ideal weak scaling is T(N, N·W_per_worker) = T(1, W_per_worker) — the wall-clock stays flat as you add workers and grow the problem proportionally.

The metric is efficiency, E(N) = T(1, W_per_worker) / T(N, N·W_per_worker), and ideal efficiency is 1.0 (or 100%). The ceiling here is not Amdahl's α; it is the inter-worker communication and synchronisation cost that grows with N (often as O(N) for all-reduce, O(log N) for tree-reductions, or O(N²) for naive all-to-all). Gustafson's law captures this: with weak scaling and a parallelisable fraction (1−α), scaled speedup is S_weak(N) = α + N · (1 − α), which grows nearly linearly with N for any non-trivial 1−α.

Strong vs weak scaling: same hardware, different questionTwo side-by-side panels. Left panel shows strong scaling: a fixed-size problem (constant grey block) processed by 1, 2, 4, 8 workers, with the wall-clock bar shrinking but never reaching zero (Amdahl ceiling shown as a dashed line). Right panel shows weak scaling: the problem block grows proportionally with workers (1, 2, 4, 8 grey blocks), and the wall-clock bar stays approximately constant.Strong scaling vs weak scaling — same N, different WStrong scaling: W fixed"finish the same job faster"N=1W=120MN=2W=120MN=4W=120MN=8W=120MN=16Amdahl floorT_min = α·T(1)metric: speedup S(N) = T(1)/T(N)ceiling: Amdahl 1/α"halve the wall-clock for the nightly batch""hit the 10-minute SLO on the existing data"Weak scaling: W = N · W_per_worker"keep up as data grows"N=1W=15MN=2W=30MN=4W=60MN=8W=120MT_target (constant)metric: efficiency E(N) = T(1)/T(N)ceiling: comm overhead grows with N"sustain 30 min wall-clock as transactions grow""weather Diwali at the same SLO as a Tuesday"Same hardware, same code — the question dictates which curve you measure.
The left panel asks "how fast does the same 120M-edge job finish on more cores?". The right panel asks "as my data grows, can I keep the 30-minute wall-clock by adding proportional cores?". Different question, different metric, different ceiling. Illustrative.

Why the two ceilings differ: in strong scaling, the serial fraction is a time you cannot shrink — α · T(1) is the irreducible wall-clock floor, and adding cores past 1/α is wasted spend. In weak scaling, the serial work also grows with W, so it grows with N — the floor moves up with the workload, not down with the cores, and Amdahl's bound does not apply in the same form. The new bound comes from communication and synchronisation: an all-reduce across N MPI ranks costs O(α_comm · log N) for tree-reductions, an all-gather is O(N) per rank for naive implementations. That communication time is the weak-scaling floor and it is the term Gustafson's α + N(1−α) does not include.

A measurement harness that runs both modes

The cleanest way to internalise the difference is to run the same workload in both scaling modes on the same hardware and watch the curves diverge. The harness below uses multiprocessing.Pool to parallelise a connected-components-style graph job — a stand-in for Aditi's fraud-graph rebuild — and reports speedup (strong) and efficiency (weak) for the same code at the same N values.

# scaling_harness.py — measure strong and weak scaling on the same workload.
# Workload: per-edge UPI fraud-feature aggregation (a stand-in for connected-components).
import os, time, statistics, multiprocessing as mp
from functools import partial

EDGES_PER_SHARD = 1_500_000   # weak-scaling: each worker gets this many edges
TOTAL_EDGES_STRONG = 12_000_000  # strong-scaling: this many edges, regardless of N
SERIAL_FRACTION_MS = 380       # mandatory serial pre/post-step (Postgres write, etc.)

def shard_work(seed, n_edges):
    """Per-worker compute: hash + accumulate, mimicking feature aggregation."""
    rng = (seed * 2654435761) & 0xffffffff
    acc = 0
    for i in range(n_edges):
        rng = (rng * 1103515245 + 12345) & 0xffffffff
        acc ^= (rng >> 16) * (i & 0xff)
    return acc

def serial_step():
    time.sleep(SERIAL_FRACTION_MS / 1000.0)  # the unparallelisable bit

def measure(N, total_edges):
    shard = total_edges // N
    serial_step()
    t0 = time.perf_counter()
    with mp.Pool(N) as pool:
        results = pool.map(partial(shard_work, n_edges=shard), range(N))
    t_par = time.perf_counter() - t0
    serial_step()
    return t_par + 2 * (SERIAL_FRACTION_MS / 1000.0), sum(results) & 0xffff

def run_strong():
    print("STRONG SCALING (fixed W = 12M edges)")
    print(f"{'N':>4} {'T(s)':>8} {'speedup':>9} {'efficiency':>11}")
    t1, _ = measure(1, TOTAL_EDGES_STRONG)
    for N in [1, 2, 4, 8, 16]:
        if N <= mp.cpu_count():
            tn, _ = measure(N, TOTAL_EDGES_STRONG)
            sp = t1 / tn
            print(f"{N:>4} {tn:>8.2f} {sp:>9.2f} {sp/N:>11.2%}")

def run_weak():
    print("\nWEAK SCALING (W = N * 1.5M edges per worker)")
    print(f"{'N':>4} {'W (edges)':>11} {'T(s)':>8} {'efficiency':>11}")
    t1, _ = measure(1, EDGES_PER_SHARD)
    for N in [1, 2, 4, 8, 16]:
        if N <= mp.cpu_count():
            tn, _ = measure(N, N * EDGES_PER_SHARD)
            print(f"{N:>4} {N*EDGES_PER_SHARD:>11,} {tn:>8.2f} {t1/tn:>11.2%}")

if __name__ == "__main__":
    print(f"Hardware: {mp.cpu_count()} logical cores")
    run_strong()
    run_weak()
# Sample run, c6i.4xlarge (16 vCPU, ap-south-1), ~95 s wall time.
Hardware: 16 logical cores
STRONG SCALING (fixed W = 12M edges)
   N     T(s)   speedup  efficiency
   1     8.62      1.00      100.00%
   2     5.25      1.64       82.10%
   4     3.41      2.53       63.20%
   8     2.46      3.50       43.80%
  16     2.04      4.23       26.40%

WEAK SCALING (W = N * 1.5M edges per worker)
   N   W (edges)     T(s)  efficiency
   1   1,500,000     1.92      100.00%
   2   3,000,000     2.05       93.70%
   4   6,000,000     2.18       88.10%
   8  12,000,000     2.31       83.10%
  16  24,000,000     2.58       74.40%

Walk-through. shard_work is the per-worker compute — a deterministic hash+accumulate loop that mimics the per-edge work in a feature-aggregation pass. It is CPU-bound and has no synchronisation, so it is the friendliest possible workload for parallelism. serial_step is the 380 ms of unparallelisable work — modelling the Postgres write that wraps every nightly batch. It is what produces the Amdahl floor. measure(N, total_edges) runs the parallel section across N workers with total_edges/N edges per worker, sandwiched between two serial_step calls. The strong-scaling sweep keeps total_edges = 12M constant; the weak-scaling sweep sets total_edges = N * 1.5M.

The numbers tell two different stories. Strong scaling: speedup grows from 1.0 → 4.23 across N=1→16, but efficiency collapses from 100% to 26%. The Amdahl floor is visible at N=16 — T(16) = 2.04 s, of which 2 * 0.38 = 0.76 s is the serial step (37% of the wall-clock). The remaining 1.28s is parallel work; doubling cores again would shrink that to ~0.7s but the serial 0.76s would still be there, so the next doubling gets T(32) ≈ 1.5s and speedup ≈ 5.7 — diminishing returns dominated by α. Weak scaling: efficiency stays above 74% all the way to N=16, because the serial term is the same fixed 0.76s regardless of how much parallel work each worker does — it just becomes a smaller fraction of total wall-clock as W grows. The efficiency drop from 100% to 74% comes from multiprocessing IPC overhead (pickling result accumulators back through the pool), not from Amdahl.

The two curves look similar at small N but diverge sharply past N=8. A vendor publishing the weak-scaling number ("16-core efficiency: 74%") and a customer reading it as a strong-scaling promise ("16-core speedup: 16 * 0.74 = 11.8×") will be 3× off in their capacity plan. That is the entire pathology this chapter exists to prevent.

Measured strong-scaling speedup and weak-scaling efficiency from the harnessTwo overlaid line plots on a common N-axis from 1 to 16. The strong-scaling speedup line rises from 1.0 at N=1 to 4.23 at N=16, with the dashed Amdahl-asymptote line at S = 11.4 (1/0.088) drawn for reference. A separate weak-scaling efficiency line (right axis) starts at 100% and declines gently to 74% at N=16. The strong-scaling line clearly bends and rolls toward the Amdahl asymptote; the weak-scaling efficiency line stays near the top of its panel.Same harness, two questions: strong speedup vs weak efficiencyN (workers)12481616x8x4x1xstrong speedup S(N)100%75%50%25%weak efficiency E(N)Amdahl asymptote 1/α ≈ 11.4x1.001.642.533.504.23strong efficiency collapseweak efficiency stays high
Same code, same hardware, same N values — two completely different stories. The strong-scaling speedup (solid, accent) bends toward Amdahl's asymptote; the weak-scaling efficiency (dashed, soft) stays near 100%. Numbers measured on c6i.4xlarge from the harness above.

Why the strong curve flattens by N=8 while the weak curve does not: the serial step is 0.76 s (two serial_step() calls × 380 ms), which is 37% of the strong-N=16 wall-clock but only 30% of the weak-N=16 wall-clock — and the absolute serial time is the same. In the strong sweep, the parallel work shrinks as N grows (12M / N edges per worker), so the serial fraction of total wall-clock grows mechanically; that growing fraction is the Amdahl signature. In the weak sweep, the parallel work per worker is constant (1.5M edges), so the serial fraction stays bounded as a fraction of total wall-clock as N grows — efficiency stays high until communication overhead catches up at very large N.

A second observation worth pulling out: at N=2 the strong-scaling efficiency is 82% — already a 18% loss that is mostly the IPC overhead of forking a worker pool and pickling the per-shard accumulator back. The Amdahl term at N=2 is small (α + (1−α)/2 = 0.5 + 0.25 = 0.75, predicting S = 1.33 not the measured 1.64), so the measured speedup is better than Amdahl predicts because the serial step is not the only thing happening — there is also a pool-creation cost on the N=1 baseline that the N=2 run amortises across two workers. Always sanity-check the N=1 baseline: if it includes any one-time cost (pool creation, JIT warm-up, JVM class loading) that the N>1 runs amortise, your speedup numbers will look better than the underlying scaling curve actually is. The fix is to warm up the harness before the first measured run, or to subtract the one-time cost from T(1) explicitly.

When to ask which question

Production systems split cleanly into "strong-scaling problems" and "weak-scaling problems", and the split is determined by the SLO, not the algorithm.

Strong-scaling problems have a fixed input and a wall-clock target. The nightly fraud-graph rebuild on yesterday's transactions; the once-per-quarter risk-model retraining on the trailing 90-day data; the pre-market-open positions reconciliation at Zerodha that must finish before 09:00 IST.

You cannot grow the input — yesterday's transactions are what they are — and the deadline is non-negotiable. The question is: "given this fixed W and this hardware budget, can I hit the deadline?". The metric is speedup; the ceiling is Amdahl's 1/α; the spend question is "is the next doubling of cores worth its diminishing speedup return?".

Weak-scaling problems have a growing input and a wall-clock budget that should stay the same. The hourly UPI risk-scoring batch that processes the previous hour's transactions and must finish in 45 minutes regardless of whether it's a quiet Tuesday afternoon or Diwali evening; the per-second metric-aggregation pipeline at Hotstar that must keep up with whatever fraction of 25M concurrent IPL viewers are emitting events; the per-minute order-book snapshot at Zerodha Kite that must process whatever order volume the market just produced.

The data grows with time-of-day, day-of-year, or external events; the deadline is a steady cadence, not a one-shot wall. The question is: "as W grows by 2× (e.g., a Diwali surge), can I add 2× workers and keep wall-clock flat?". The metric is efficiency; the ceiling is communication overhead; the spend question is "at what N does communication start eating the gain?".

The classification matters because the engineering responses are different. A strong-scaling failure (you cannot hit the SLO at any N you can afford) means the algorithm is the problem — you need a different algorithm with a smaller α, or a way to break the serial section into a parallel one (e.g., partitioning the Postgres write, or using a write-ahead log with parallel commit). A weak-scaling failure (efficiency drops below acceptable as N grows) means the architecture is the problem — you need fewer all-reduces, smarter partitioning, or a more locality-friendly data layout. Buying more cores fixes neither, but it is the first response when teams have not classified the problem.

Why "buy more cores" almost never solves a strong-scaling failure: if you are at N = 32 with speedup S = 12 and Amdahl's α = 0.06, the asymptotic ceiling is 1/0.06 = 16.7×. To go from S=12 to S=15, the formula S(N) = 1 / (α + (1−α)/N) requires N ≈ 188 — a 6× core increase for a 25% speedup gain. The cost-per-speedup at the cliff edge is dominated by the asymptote, and the rational engineering response is to spend the next quarter shrinking α (parallelise the Postgres write, replace the synchronous-replication ack with a quorum, etc.) rather than spending the next quarter's budget on a 188-core machine that delivers a 25% gain.

The benchmarking pitfalls that hide the difference

Three measurement mistakes routinely produce numbers that look like one mode but are secretly the other.

Pitfall 1: weak-scaling numbers reported as strong-scaling claims. A vendor benchmark report says "linear scaling to 256 nodes". Read carefully and the dataset is also 256× larger at 256 nodes — that is weak scaling, not the strong scaling the customer's nightly batch needs. Vendors publish weak-scaling numbers because they look better (efficiency stays high; communication overhead is the only growing term) and because their hardware-sales motion benefits from a curve that grows with N. The customer's actual question is "if I have a 12M-edge graph today, how much faster does the 256-node cluster process that same graph than my current 16-core box?" — a strong-scaling question with a much less flattering answer. Always check the dataset size column before quoting a vendor's scaling number.

Pitfall 2: strong-scaling benchmarks at trivial N. A team reports "linear speedup from 1 to 8 cores" for their service. At N=1 the wall-clock is 8.6s; at N=8 it is 2.46s; speedup is 3.5×. That is not linear (linear would be 1.075s at N=8). The team is mis-reading the curve because the points at N=1, N=2, N=4 look roughly linear (1.0, 1.64, 2.53), and they extrapolated. Always sample your speedup curve at the largest N your production hardware will ever see, because the Amdahl curvature only becomes visible once (1−α)/N is comparable to α. For α = 0.05, the curve looks linear up to N ≈ 4 and then bends; for α = 0.005, it looks linear up to N ≈ 40 and then bends. Without measuring past the bend, you cannot know where it is.

Pitfall 3: confusing throughput-scaling with strong scaling. A team adds threads to a web service and reports "throughput scales linearly to 64 threads". This is neither strong nor weak scaling in the parallel-batch sense — it is throughput scaling under bursty offered load, which is governed by queueing theory (M/M/c at fixed μ, varying c) and which has its own ceiling at the per-thread service rate. Conflating throughput-per-second with batch-speedup leads to the classic mistake of "we doubled the threads and throughput doubled, so doubling again will double again" — until you hit the database connection-pool limit, the GIL contention point, or the kernel scheduler's runqueue saturation, and the curve flattens hard. The fix is to be explicit about the workload: a batch job is a strong-scaling question; a service under continuous load is a queueing question; weak scaling is the version of the batch question where the input grows with the cores. Three different curves, three different formulas, and they require three different benchmark setups.

A practical rule: before publishing any "scales to N cores" claim, write down the workload as a (W, N, T) triple and state which one you held fixed. Strong scaling fixes W; weak scaling fixes W/N; throughput scaling fixes T (one second of wall-clock) and measures requests-per-second. Saying "scales to N=64" without the triple is a vacuous claim, and any benchmark report that does not name the triple is hiding which question it answered.

Pitfall 4: super-linear speedup mistaken for a real result. When S(N) > N, the parallel implementation is faster than N copies of the serial would predict. This is sometimes celebrated as a discovery — "our parallel code is super-linear!" — but it is almost always a measurement artefact. Three causes dominate. Cache effects: the per-worker share of W fits in L2 at large N but not at N=1, so the N=1 baseline has a worse cache profile and runs slower than expected. Different algorithms: the parallel version uses a different data structure (e.g., per-worker hash maps that get merged) than the serial version (one global hash map), and the parallel version's data structure happens to be faster even before parallelisation. Branch prediction and TLB warm-up: the N=1 baseline pays one-time costs that the per-worker N>1 runs each pay independently in parallel. The diagnostic is to run the N=1 baseline with the parallel-code-on-1-core measurement (not the serial code), and the super-linear result usually evaporates. If it does not, the cause is a real cache or algorithmic improvement that the team should quantify separately — not call it "super-linear scaling".

Strong vs weak scaling in production: three Indian case studies

Razorpay's daily fraud-feature backfill. Every morning at 03:00 IST a Spark job recomputes 47 fraud-detection features on the previous day's UPI traffic — about 320M events. Wall-clock target: 90 minutes (must finish before the merchant-dashboard refresh at 04:30). This is a strong-scaling problem: W is yesterday's fixed event count; T is the fixed 90-minute SLO; the variable is N (Spark executor count).

The team sized for N = 256 based on a strong-scaling sweep that showed Amdahl floor at ~1100 seconds (the cluster commit + final-write step), giving a measured wall-clock of 67 minutes at N=256. Adding more executors past 256 buys nothing — (1−α)/N is already small enough that the serial floor dominates. The right response when the SLO needs to be tightened is to parallelise the commit step (split the final write across 32 partitioned Postgres tables), not to add more executors. The team did exactly that and pushed the wall-clock to 41 minutes without changing executor count.

The cost story is the engineering punchline. The naive response — "double the executors to N=512 to hit a tighter SLO" — would have cost roughly ₹14 lakh per quarter in additional EMR spend and delivered a measured wall-clock of 64 minutes (a 4% improvement, because the serial floor is the dominant term). The actual response — partition the Postgres commit, no executor change — cost two engineer-weeks and delivered a wall-clock of 41 minutes (a 39% improvement). The strong-scaling curve told the team which lever to pull; the team that does not measure the curve almost always pulls the wrong lever first.

Hotstar's per-minute viewer-aggregation during the IPL final. Every minute the analytics pipeline aggregates the previous minute's player-event stream — viewer counts, ad-impression counts, buffer-event counts — into per-region rollups. During a normal Tuesday match the rate is ~2M events/minute; during the IPL final it is ~85M events/minute. Wall-clock target: 60 seconds (the next minute's data is already arriving). This is a weak-scaling problem: W grows with viewership, T is fixed, and the question is "can we add proportional workers to keep T constant?".

The team sized using a weak-scaling sweep that showed efficiency staying above 80% up to N=128 Flink task-managers, beyond which the all-to-all shuffle for the regional rollup started costing more than the per-task work. The IPL final gets pre-warmed to N=128 the morning of, with auto-scale enabled to N=192 if the per-minute backlog grows past 30 seconds. Scaling past N=192 is rejected by policy because the weak-scaling efficiency falls below 60% — the marginal worker is contributing more communication than computation, and the right response is sharding the rollup geographically (running independent N=128 clusters per region) rather than scaling the single cluster further.

The decision to cap at N=192 was data-driven, not intuitive. The team had a vendor proposal to scale the single cluster to N=512 with a 3× hardware upgrade; the weak-scaling efficiency curve at N=512 was projected to be 38%, meaning roughly 60% of the upgrade spend would be wasted on shuffle traffic. The geographic-sharding alternative cost less, kept efficiency above 75%, and added regional fault isolation as a bonus. The vendor proposal was rejected on the basis of the curve alone — a saving of roughly ₹3.2 crore in the IPL-2024 capacity plan that came directly from measuring the weak-scaling curve before signing the PO.

Zerodha Kite's pre-market positions reconciliation. Every weekday at 08:30 IST a job recomputes overnight position-marks for ~6.4M client portfolios against the previous day's closing prices and the corporate-actions feed. Wall-clock target: 25 minutes (must finish before the 09:00 pre-open auction). Strong scaling — W is fixed at 6.4M portfolios, T is the fixed 25-minute SLO. The team initially scaled by adding worker threads on a single 64-core box, hit the Amdahl floor at 18 minutes (the synchronous Postgres commit on the marks table being the serial bottleneck), and the next quarter saw portfolio count grow to 7.8M and the wall-clock crept to 21 minutes — uncomfortably close to the SLO. The right response was identified by reframing the question: the per-portfolio mark computation is embarrassingly parallel; the bottleneck is the commit, which is O(N_portfolios) regardless of worker count. Switching to a partitioned Postgres write (32 partitions, one commit per partition) cut the serial fraction from 18 minutes to 3.5 minutes and pushed the wall-clock down to 8 minutes — leaving headroom for years of growth without re-architecting. The 32-partition number was not arbitrary — it was chosen by measuring the strong-scaling curve of the partitioned-write step alone and finding the knee where adding more partitions stopped helping (Postgres connection-pool contention dominates past ~40 partitions on the team's RDS instance class). The full Amdahl analysis was done not on the headline job but on the bottleneck step — a pattern worth internalising: when the headline curve hits its floor, descend one level and re-measure the curve of the floor itself.

The pattern across all three: the speedup formula is the same, but the engineering response — "parallelise the serial section", "shard the cluster geographically", "partition the bottleneck commit" — depends on which scaling mode you are in and where the floor sits. Without classifying the workload into strong vs weak, the team flails between buying more hardware and changing architecture, often choosing the wrong one.

Edge cases and the modes that look like neither

Real workloads sometimes refuse to sit cleanly in either bucket. Three patterns recur and each requires a small adaptation of the strong/weak framing.

Memory-bound weak scaling. A workload that fits in L2 cache at small N per worker can fall out of L2 into L3 (or DRAM) when the per-worker share grows during weak scaling. The per-worker compute is "the same" by edge-count, but the per-worker cost is 3-5× higher because the cache hit rate has collapsed. Weak-scaling efficiency that "should" be 90% measures at 35%, and the team blames communication overhead — but the bottleneck is actually the memory hierarchy, and the fix is to keep the per-worker working-set inside the L2 budget by sharding more finely. The diagnostic is to run perf stat -e LLC-load-misses during the weak-scaling sweep: if the LLC miss rate climbs as N grows, the per-worker working-set has crossed a cache boundary. The adaptation: weak-scaling benchmarks must hold the per-worker working-set size, not just the per-worker input count, constant — which sometimes means partitioning the input differently across N (e.g., column-wise instead of row-wise for a numpy reduction).

Heterogeneous hardware. Cloud clusters frequently mix instance types — half the workers on c6i.4xlarge, half on c6i.16xlarge — and the scaling-curve framework assumes homogeneous workers. Strong scaling on heterogeneous hardware reduces to the slowest worker (the cluster cannot finish until the slowest shard finishes), and the right metric is the straggler ratio T_slowest / T_fastest rather than uniform speedup. Weak scaling is even worse — different workers process their share at different rates, so the assignment of W/N per worker has to be size-weighted, not equal. The fix is a work-stealing scheduler that re-shards dynamically; the multiprocessing.Pool model used in the harness above does this implicitly via pool.map, but homegrown schedulers usually do not.

Workloads where W is not a free variable. A test-suite that runs in 22 minutes is a strong-scaling problem (fixed W, want lower T); a streaming pipeline at 80M events/minute is a weak-scaling problem (W grows with traffic). But a user-interactive job — say, an analyst running an ad-hoc Spark query over yesterday's data — has W partially controlled by the user (filters, date ranges). The right framing is "what is the strong-scaling curve for the smallest plausible query, and what is the weak-scaling curve as the user expands the query?". Cloud query engines (BigQuery, Snowflake, Redshift) that auto-scale N per query are implicitly running weak-scaling under the hood, and the per-query latency the user perceives is governed by the weak-scaling efficiency at the chosen N. CRED's analytics workspace publishes per-query strong-scaling curves to its data-team users specifically so they can predict whether doubling the cluster will halve the query — a strong-scaling intuition the engine's auto-scaling does not directly answer.

A unifying observation: the strong/weak dichotomy is the first cut of the classification. Most production workloads have a strong-scaling axis (within a job) and a weak-scaling axis (across jobs as data grows). Modelling them as orthogonal — one curve for "how fast does each job finish?" and another for "how does the per-job time evolve as data scales?" — is the right two-dimensional view. The teams that publish both curves and revisit them quarterly are the teams that do not get blindsided by capacity events.

Common confusions

Going deeper

Iso-efficiency: the third metric nobody publishes

Vipin Kumar's iso-efficiency function W = f(N, E) answers a question neither strong nor weak scaling does directly: "to maintain efficiency E as N grows, how must W grow?". For a system with O(N) communication overhead per worker, the iso-efficiency function is W = O(N²) — to keep efficiency at 0.8 while doubling the workers, you must quadruple the work. For O(log N) communication, iso-efficiency is W = O(N log N) — much friendlier. The function is the right capacity-planning metric for a system that grows over time: it tells you whether your data growth is keeping up with the cluster you can afford. Aadhaar's auth pipeline uses iso-efficiency curves explicitly because both W (auth requests) and N (worker capacity) grow over time, and the question is whether the marginal cluster size keeps efficiency in the productive range. The metric is published in operations dashboards alongside CPU utilisation and queue depth; it is the slowest-moving of the three but the one that decides next year's hardware budget.

The iso-efficiency framing also reveals a planning trap: for a system with O(N²) communication overhead (naive all-to-all), maintaining efficiency requires W = O(N³) — your data must grow faster than the cluster, or the marginal worker is wasted. Most production systems do not see O(N²) communication because they use tree-reductions or partitioning, but the few that do (some legacy MPI codes, some graph-processing engines that materialise pairwise relationships) hit a wall where doubling the cluster requires 8× more data to stay efficient — a wall data growth alone cannot cross. The architectural fix is to replace the O(N²) communication with an O(N log N) pattern; the planning fix is to not buy past the iso-efficiency knee and instead shard into independent clusters.

When the serial fraction itself depends on N

A subtle Amdahl-violating pattern: in some workloads, α is not constant — it grows with N. Locking, coherence traffic, and synchronisation barriers all scale with the worker count, so the "serial" time per iteration is α_0 + α_1 · N. The speedup curve then looks like Amdahl initially but rolls down at large N (the "retrograde scaling" region of the Universal Scalability Law). The fix is to identify the α_1 · N term — it is almost always a contended lock, a barrier, or an all-to-all shuffle — and replace it with a hierarchical, log-N, or partitioned variant. Flipkart's catalogue search backend measured retrograde scaling past N=64 cores on a single box during 2024 BBD prep; the culprit was a synchronized block in the Java result-merge step that produced α_1 · N lock-contention. Replacing it with a tree-reduction (log-N levels of pairwise merges) flattened the curve and pushed the productive ceiling out to N=192 cores.

The diagnostic for retrograde scaling is to plot speedup and its second derivative against N. Amdahl-bound systems have monotone-increasing speedup with monotone-decreasing slope; retrograde systems have speedup that turns over and decreases past some N*. The N* is the point where the marginal worker is contributing less parallel work than it is adding to lock-contention or coherence-traffic overhead — and adding cores past N* makes wall-clock worse. Many production teams discover N* accidentally during a capacity-add: the deploy that "should have made things faster" makes them slower, and the rollback restores the earlier N. The right pre-flight discipline is to measure the strong-scaling curve out to 1.5× the production N before sizing changes, and to look for the N* knee.

The cost of measuring the curve correctly

Both scaling sweeps require care. For strong scaling, the trap is that T(1) should be measured with the parallel code running on a single worker — not with a serial implementation that has different cache behaviour, different memory allocation patterns, and no synchronisation overhead. The "speedup" of T_serial / T_parallel(N) is sometimes reported as ">N" (super-linear speedup) — almost always because the parallel version has better cache behaviour, not because the parallelisation is magic. Use T(1) of the same parallel code with N=1, not the serial implementation. For weak scaling, the trap is that W per worker must produce the same per-worker work as N grows — not the same nominal input size, but the same actual compute. If your workload has a non-uniform cost profile (some inputs are 10× more expensive than others), random sampling will produce a bumpy curve that hides the real efficiency trend. Use deterministic, stratified sampling for weak-scaling inputs.

A second measurement discipline is to report variance, not just mean. A single run at each N produces one data point; the noise floor on a shared cloud instance can be 10-20% wall-clock variation between identical runs (CPU-frequency boost behaviour, neighbour-VM interference, kernel scheduler placement). Run each N at least 5 times, report median and IQR (or 5th/95th percentile), and look for N values where the IQR widens — that widening is often the first sign of contention or noise that the median hides. The harness shown above uses one run per N for brevity; a production-grade scaling sweep would loop the measurement and accumulate.

A third subtle issue: the order of measurements matters on shared hardware. Running N=1 first and N=16 last means N=16 sees a fully-warm system (caches populated, page tables faulted, dynamic-frequency boost stabilised), while N=1 saw a cold system. The reverse order (N=16 first) systematically advantages the smaller-N runs. Randomise the order of N values across repetitions and you eliminate the systematic bias; the median across the random order is the trustworthy point. This is the same coordinated-omission lineage from earlier in Part 7 — measurement-order matters because the system has memory.

Closed-loop scaling vs the production reality

Both strong and weak scaling are closed-loop measurements: the benchmark feeds work to the workers and waits for completion before issuing more. Production is open-loop: requests arrive at a rate λ regardless of how the workers are doing, and if the workers fall behind, a queue grows.

The strong-scaling speedup at N=64 might be 18× (vs N=1), but the production throughput at N=64 is set by λ_max such that the M/M/c queueing curve doesn't cliff — and that λ_max is governed by the per-thread service rate plus the queue dynamics, not by the strong-scaling speedup.

Always project closed-loop scaling numbers to open-loop capacity using the queueing-theoretic equivalence: open-loop λ_max ≈ N · (1 / S_per_request) · (1 − safety_margin), where safety_margin typically sits around 0.15–0.25 depending on tail-latency tolerance. This is the same coordinated-omission lineage from Part 7 — the closed-loop scaling test under-reports the queueing because it gives the queue time to drain. Cross-link: /wiki/wall-real-systems-are-not-m-m-1.

Reproducibility footer

The harness below runs both scaling sweeps end-to-end on a stock laptop in roughly 95 seconds; the strong-scaling and weak-scaling tables print to stdout in the order shown above.

# Reproduce this on your laptop, ~95 s
python3 -m venv .venv && source .venv/bin/activate
python3 scaling_harness.py

Then change SERIAL_FRACTION_MS from 380 to 50 and re-run. The strong-scaling efficiency at N=16 will jump from 26% to ~75%; that 50× drop in α is the entire engineering story of every Amdahl-bound system. Then change EDGES_PER_SHARD to 15_000_000 (10× the per-worker work) and re-run weak scaling; efficiency will stay near 100% even at N=16 because the IPC overhead becomes a smaller fraction of total wall-clock. Both experiments take less than two minutes and both teach more than any vendor-published scaling chart.

Where this leads next

The next chapter (/wiki/amdahls-law-revisited) and its counterpoint (/wiki/gustafsons-counterargument) are the two formal models that strong and weak scaling instantiate. Amdahl is the strong-scaling model with constant α; Gustafson is the weak-scaling model with the same α but a workload that grows with N.

The Universal Scalability Law generalises both — it adds a β · N(N−1) coherence-overhead term that captures the retrograde-scaling regime neither Amdahl nor Gustafson sees, and it is the model the rest of Part 9 builds on.

The deeper transition is from "scaling is a property of the algorithm" to "scaling is a question, and the answer depends on the workload, the SLO, and the hardware". The same code can have a 16× strong-scaling speedup and a 0.74 weak-scaling efficiency on the same hardware — and both numbers are correct, because they answer different questions.

Engineering teams that internalise this stop publishing single-number scaling claims and start publishing scaling curves with the workload triple (W, N, T) named on the axis labels. The triple is the contract; without it, every scaling number is ambiguous and every capacity plan is built on a number that does not mean what the team thinks it means.

Two operational habits to internalise. First: classify every batch and every service into "strong" or "weak" before sizing the hardware. The classification takes ten minutes per workload, and it determines which curve you measure and which engineering response (parallelise the serial section vs reduce communication overhead) is worth the next quarter's effort.

Second: always sample your scaling curve at the largest N you will ever see in production, not the largest N convenient to test. The curvature of Amdahl's law and the roll-off of communication overhead both become visible only at large N; extrapolating from small-N measurements is the most common cause of "we doubled the cores and it got slower" surprises, which are not surprises at all once you have measured the curve far enough out.

A third habit, often skipped: publish the scaling curve, not just the headline number. A single "16-core efficiency: 74%" tells the reader nothing about whether the curve is rising, flat, or falling at N=16. The curve (N, T(N)) for N ∈ {1, 2, 4, 8, 16, 32} is the contract — it lets a future engineer (or a vendor evaluating a hardware swap) extrapolate intelligently. Every team that has shipped a curve in their internal docs has been thanked for it by the on-call engineer six months later when the question "should we scale to N=64?" comes up at 02:00 IST during an incident. The curve answers it in 30 seconds; the headline number does not.

References