The economic argument: scale and cost

Riya runs the payments stack at PaySetu on a single r6i.4xlarge — 16 vCPU, 128 GB RAM — and it has been the calmest part of the platform for two years. Last Diwali her CFO asks the obvious question: "if we're growing 3× a year, can we just buy a bigger box every year and stop pretending we need a distributed team of fifteen?" She runs the numbers and discovers that the next box up — r6i.32xlarge — costs 8.1× more for 8× the cores, but a cluster of three r6i.4xlarge costs 3× more for 3× the cores and survives a single-AZ outage that the giant box does not.

The CFO's question is the right question. The answer — when distribution starts paying for itself, and when it just adds engineering overhead — is the entire economic argument for everything in the next 138 chapters.

Vertical scaling is cheaper per unit until you hit one of three cliffs: a price-per-throughput knee in the cloud SKU ladder, a single-machine ceiling that you physically cannot exceed, or an availability target that no single machine can meet. Distribution is the answer to those three cliffs and is overhead before them. Knowing which cliff you are about to hit — and which one you are not — is the difference between a working platform and an over-engineered one.

The single-box baseline — what one machine actually buys you in 2026

The first move in any scaling argument is to be honest about how far one machine goes. The honest number surprises people who learned distributed systems from blog posts about microservices.

A 2026-era cloud VM at the top of the SKU ladder gives you, for roughly ₹3.6 lakh per month on-demand:

That is enough to serve 40,000 small JSON requests per second at p99 of 35 ms, host a single-tenant Postgres at 8000 transactions/sec sustained, or run an 800 GB in-memory cache with 1.2M ops/sec. KapitalKite's entire equity-trading order book — one country's worth of trades — fits comfortably on a single one of these instances during normal market hours and uses about 30% of the available cores.

What one large cloud VM gets you in 2026A schematic of a single VM with annotated capacity bars for CPU, memory bandwidth, NVMe IO, and network throughput, with example workloads that each fit inside.One r6i.32xlarge — ₹3.6L/month128 vCPU · 1024 GB RAM · 50 Gbps NIC · NVMe attachedCPU — 40k JSON req/s p99 35ms uses ~70%RAM — 800 GB cache uses ~80%NVMe — 8k tx/s pgsql uses ~30%Workloads that fit on one box• KapitalKite order book — full trading day• PaySetu UPI router — 8k TPS sustained• MealRush dispatch — 3 cities full traffic• YatriBook fare cache — 800 GB hot set• PaisaCard rewards — full ledger to 50M cards
Illustrative — not measured data. The single-box envelope as of 2026. Most Indian fintech and consumer workloads sit comfortably inside this envelope until they hit one of the three cliffs.

The takeaway is uncomfortable for anyone who has read too many "microservices migration" blog posts: a startling number of services that are running on Kubernetes with twenty pods would run more cheaply, more reliably, and with lower latency on one of these boxes, with a warm standby for failover. The distributed architecture is not buying them anything they could not get from systemd and a load balancer.

Why this matters before the rest of the chapter: the economic argument runs in both directions. Distribution can save money at scale, and it can also burn money at the scale you are at right now. Before you reach for a Raft cluster, the honest engineering move is to check whether the single-box envelope still fits your workload — because if it does, every distributed-systems primitive you add is overhead that is paid for in latency, complexity, and on-call pages.

The three cliffs — when one box stops being enough

You leave the single-box regime when you hit one of three cliffs. They are not interchangeable. Each one has a different fix, and reaching for the wrong fix is how teams build expensive distributed systems that do not solve their actual problem.

Cliff 1 — the price-per-throughput knee

Cloud VMs are priced linearly within a SKU family, but the family ladder is not linear at the top. From r6i.large (2 vCPU) through r6i.16xlarge (64 vCPU), the price per vCPU is essentially flat. Above that — r6i.24xlarge, r6i.32xlarge, the metal SKUs — the price per vCPU starts climbing because you are paying a scarcity premium for the largest hosts in the rack.

# vm_economics.py — when do three smaller boxes beat one big one?
SKUS = [
    # name,            vcpu, ram_gb, monthly_inr
    ("r6i.4xlarge",     16,  128,    45_000),
    ("r6i.8xlarge",     32,  256,    90_000),
    ("r6i.16xlarge",    64,  512,   180_000),
    ("r6i.24xlarge",    96,  768,   292_000),  # premium starts
    ("r6i.32xlarge",   128, 1024,   415_000),  # premium accelerates
    ("r6i.metal",      128, 1024,   468_000),  # peak premium
]

print(f"{'sku':18} {'vcpu':>5} {'₹/month':>10} {'₹/vcpu':>8} {'cluster of 3 r6i.4xlarge':>28}")
print("-" * 75)
baseline = 45_000  # r6i.4xlarge
for name, vcpu, _, price in SKUS:
    per_vcpu = price / vcpu
    # what would 3x the throughput of this SKU cost using r6i.4xlarge clusters?
    cluster_size = max(3, (vcpu * 3) // 16)
    cluster_price = cluster_size * baseline
    print(f"{name:18} {vcpu:>5} {price:>10,} {per_vcpu:>8,.0f} "
          f"{cluster_size}× r6i.4xlarge = ₹{cluster_price:>10,}")

Sample run:

sku                 vcpu    ₹/month   ₹/vcpu     cluster of 3 r6i.4xlarge
---------------------------------------------------------------------------
r6i.4xlarge           16     45,000    2,812     3× r6i.4xlarge = ₹   135,000
r6i.8xlarge           32     90,000    2,812     6× r6i.4xlarge = ₹   270,000
r6i.16xlarge          64    180,000    2,812    12× r6i.4xlarge = ₹   540,000
r6i.24xlarge          96    292,000    3,041    18× r6i.4xlarge = ₹   810,000
r6i.32xlarge         128    415,000    3,242    24× r6i.4xlarge = ₹ 1,080,000
r6i.metal            128    468,000    3,656    24× r6i.4xlarge = ₹ 1,080,000

The price per vCPU is flat at ₹2,812 up to 64 vCPU, then climbs 15% by r6i.32xlarge and 30% by metal. per_vcpu = price / vcpu is the load-bearing line: it exposes the knee. cluster_size = max(3, (vcpu * 3) // 16) computes how many baseline boxes give you 3× the headline SKU's throughput, because you need at least 3 for any meaningful availability story (1 leader + 2 followers, or 3-way replication). Above 64 vCPU, two r6i.16xlarge instances cost ₹360,000 — less than one r6i.32xlarge at ₹415,000 — and survive an instance failure, while the single big box is one bad neighbour away from a noisy-neighbour incident.

This is the first cliff: not "we cannot grow vertically", but "vertical growth has stopped being the cheapest unit of capacity." The financial case for going horizontal becomes inevitable once ₹/vcpu for the next size up exceeds your ability to operate three smaller boxes with a load balancer in front.

Why the cloud SKU ladder is not linear at the top: the largest VMs are physically constrained. There are only so many sockets per rack, and a metal SKU consumes the whole host. Cloud providers price the scarcity in. On-prem, the cliff is sharper — beyond a certain socket count there are no SKUs at all, just custom systems with quote-on-request pricing and 12-month lead times.

Cliff 2 — the single-machine ceiling

Even if money were no object, some workloads are physically larger than any single machine. CricStream, the OTT for cricket, peaks at 48 million concurrent viewers during an India–Australia World Cup final. Each viewer maintains an HLS player polling every 4 seconds. That is 12 million requests per second of metadata traffic alone, before any video bytes move. The largest single VM AWS sells you tops out around 100 Gbps of network throughput — about 1.2 million HLS-sized requests per second. You are off by 10×. There is no SKU that solves this. You are forced into distribution by the physics of network cards and the speed of light, not by a price-per-vCPU spreadsheet.

The second cliff has three concrete shapes:

When you are physically off by 5× or more from the largest available box, distribution is not a choice. It is a fact about your problem. You are no longer optimising — you are surviving.

Cliff 3 — availability you cannot buy on a single machine

The third cliff is the most subtle, and the most commonly misunderstood. A single AWS r6i.32xlarge has roughly 99.5% availability — about 3.6 hours of downtime per month, mostly from host-maintenance reboots, hypervisor migrations, and the occasional hardware failure. That sounds high until you write it down next to a contract: PaySetu's UPI router has a regulatory uptime SLA of 99.99% — 4.3 minutes/month, which no single VM can deliver. Not because you do not have the budget, but because the cloud provider does not sell that SLA on a single instance. The host will, eventually, reboot.

The only way to clear 99.99% is to run multiple instances across failure domains, and the moment you do, you have a distributed system — even if it is just two boxes behind a load balancer. Whatever your replication strategy is (active-active, active-passive, leader-follower), you have crossed into the territory the next 138 chapters cover: failure detection, leader election, replication lag, split-brain prevention, partial-failure semantics.

The three cliffs that force distributionThree vertical bars, each labelled with a cliff: price knee, machine ceiling, availability target. Each bar shows when the single-box regime stops being viable, with example workloads at the top and the distributed-systems primitives needed to cross the cliff at the bottom.Three cliffs — and what is on the other sideCliff 1Price-per-vCPU kneeAbove 64 vCPU,cluster of 3 winson ₹/throughputNeed: load balancer,stateless replicasCliff 2Machine ceilingWorkload > 1× NIC,RAM, or NVMeof largest SKUNeed: sharding,consistent hashingCliff 3Availability targetNeed ≥ 99.95%,single VM capsat ~99.5%Need: replication,failure detection
Each cliff demands a different primitive. Reaching for sharding when your problem is availability — or for replication when your problem is throughput — is how teams build the wrong distributed system.

The hidden cost of distribution — what the spreadsheet doesn't show

The cluster-of-three vs one-big-box comparison ran to ₹135,000 vs ₹415,000 — distribution wins three-to-one on infrastructure. The spreadsheet stops there. The actual bill does not.

When you go from one box to three, you are not paying 3× the engineering cost — you are paying somewhere between 5× and 15×, and the multiplier comes from concerns that did not exist on a single machine:

A reasonable rule of thumb, calibrated against several Bengaluru fintech and OTT teams: distribution starts paying for itself only when you would otherwise need at least the second-from-top SKU and you have a credible plan to use the redundancy for availability, not just for capacity. Below that threshold, the engineering and operational tax exceeds the infra savings, and you are paying for a distributed system to feel modern.

# total_cost_of_ownership.py — distributed vs single-box, fully loaded
def annual_tco(sku_cost_inr_month, n_replicas, ops_engineers, observability_inr_month):
    """All-in cost in ₹ crore for a service running for a year."""
    infra = sku_cost_inr_month * n_replicas * 12
    eng = ops_engineers * 50_00_000  # ₹50L fully loaded per Bengaluru SRE
    obs = observability_inr_month * 12
    return (infra + eng + obs) / 1_00_00_000  # in ₹ crore

scenarios = [
    ("single r6i.16xlarge",         180_000, 1, 0.5,  20_000),
    ("3× r6i.4xlarge cluster",       45_000, 3, 1.5, 120_000),
    ("3× r6i.16xlarge cluster",     180_000, 3, 2.0, 250_000),
    ("single r6i.32xlarge",         415_000, 1, 0.7,  30_000),
]
print(f"{'scenario':32} {'infra':>10} {'eng':>10} {'obs':>10} {'TCO ₹cr':>10}")
print("-" * 80)
for name, cost, n, eng, obs in scenarios:
    tco = annual_tco(cost, n, eng, obs)
    infra_cr = cost * n * 12 / 1_00_00_000
    print(f"{name:32} {infra_cr:>9.2f} {eng*0.5:>9.2f} {obs*12/1_00_00_000:>10.3f} {tco:>10.2f}")

Sample run:

scenario                            infra        eng        obs    TCO ₹cr
--------------------------------------------------------------------------------
single r6i.16xlarge                  0.22       0.25      0.024       0.74
3× r6i.4xlarge cluster               0.16       0.75      0.144       1.06
3× r6i.16xlarge cluster              0.65       1.00      0.300       1.95
single r6i.32xlarge                  0.50       0.35      0.036       1.04

Read across the rows. The single r6i.16xlarge at ₹0.74 crore/year is the cheapest of the four — and would in fact serve PaySetu's current load with 30% headroom. The 3× r6i.4xlarge cluster has lower infrastructure cost (₹0.16 cr vs ₹0.22 cr) but higher total cost because the on-call rotation jumped from 0.5 engineers to 1.5. Distribution lost on TCO at this scale, even though it won on infra. The single r6i.32xlarge ties with the 3-node cluster on TCO, which is the inflection point — at that level of infra spend the redundancy stops being an obvious loss. Above that (the 3× r6i.16xlarge row at ₹1.95 cr), distribution wins decisively, but only because the workload genuinely needs that throughput.

Why TCO is the right denominator for this decision: infrastructure cost shows up in finance dashboards and is easy to pattern-match against; engineering cost is invisible until the headcount conversation, and observability cost is hidden inside the platform-team budget. Teams that compare only the infra column under-count the real cost of distribution by ~70% at the 3-replica scale and ~40% at the 12-replica scale. The ratio improves with scale, which is exactly why distribution wins for big systems and loses for small ones.

The honest decision tree — should you distribute?

Boil the chapter down to a checklist that maps onto the three cliffs and the TCO math. Run it before any meeting where the words "we should microservice this" appear:

  1. Is your hot working set bigger than the largest single VM? If yes, you have hit Cliff 2 and you must shard. There is no room to argue.
  2. Is your sustained throughput within 5× of the largest single VM's NIC, NVMe, or vCPU envelope? If yes, Cliff 2 again — distribute now or you will run out of room within one growth cycle.
  3. Is your contractual or business-criticality uptime requirement above 99.9%? If yes, Cliff 3 — replicate across failure domains, even if a single instance has the throughput.
  4. Are you paying more for the largest single VM than for 3× the second-largest? If yes, Cliff 1 — go horizontal for cost. Verify the engineering bandwidth is there before pulling the trigger.
  5. None of the above? Stay on a single big box with a warm standby for failover. Spend your engineering budget on the application, not the platform. The single-box answer remains correct for far more services than current fashion suggests.

The decision is not "monolith bad, distributed good." It is "what cliff are you actually closest to, and does crossing it pay for the engineering it costs?"

Common confusions

Going deeper

The Spanner / Dynamo dichotomy as an economic argument

Two of the most-cited distributed databases — Google's Spanner and Amazon's Dynamo — sit on opposite sides of an economic argument that this chapter sets up. Spanner spends engineering money (TrueTime hardware, atomic clocks, 2PC across regions) to give you strong consistency at global scale; Dynamo spends correctness money (eventual consistency, conflict resolution at the application layer) to give you availability and low write latency. Both are responses to the same observation: at sufficient scale, something has to give. Spanner gives up cost (custom hardware, expensive coordination); Dynamo gives up the easy mental model. The choice between them is a TCO calculation specific to your workload, not a religious one. Part 12 (consistency models) and Part 14 (distributed transactions) make this concrete.

Why most "scaling problems" are actually queueing problems

A surprisingly common pattern: a team hits "performance issues" at, say, 2000 RPS on a box that can do 10,000 RPS, and concludes they need to distribute. The actual cause is almost always one of: a single hot row in Postgres causing lock contention, a synchronous external API call holding a thread for 200 ms, or a thread pool sized too small for the connection pool. None of these are solved by adding more boxes — adding boxes to a queueing problem just gives you the same queue stretched across more machines. Build a proper load-test that drives the single box to its CPU/IO/network ceiling and then decide; the fix is usually a thread-pool tweak or a SQL-level fix, not a Kubernetes cluster. This is the "scalability is a property of the bottleneck, not the system" lesson that Brendan Gregg's USE method — covered in systems-performance: the 30-year arc — formalises.

The CricStream economics — why availability sometimes overrides cost

CricStream's peak — 48M concurrent during a final — requires distribution for capacity. But the more interesting economic argument is that CricStream pays for redundant capacity that is unused 360 days of the year because the 5 days that matter justify the spend. The marginal cost of a missed match is brand-destroying; the marginal cost of two extra AZs sitting idle for 360 days is a rounding error in a sports-rights budget that runs into the thousands of crore. Capacity that is "wasted" 99% of the time can still be the correct economic choice when the cost of missing peak is non-linear. The framework for thinking about this is the cost-of-failure-times-probability-of-failure product, which dominates the static-utilisation argument once the failure cost gets large.

Reproduce this on your laptop

Confirm the price-per-vCPU knee on whatever cloud you actually use — the SKU prices change, the shape does not. The script below reproduces the analysis with current AWS list prices.

# Reproduce the cliff analysis on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install requests boto3
# Pull current AWS instance pricing for the r6i family in ap-south-1
python3 - <<'EOF'
import requests
url = "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-south-1/index.json"
data = requests.get(url, timeout=30).json()
products = data["products"]
hits = []
for sku, p in products.items():
    a = p.get("attributes", {})
    if a.get("instanceFamily") == "Memory optimized" and a.get("instanceType","").startswith("r6i."):
        if a.get("operatingSystem") == "Linux" and a.get("tenancy") == "Shared":
            hits.append((a["instanceType"], int(a["vcpu"])))
print(f"found {len(hits)} r6i SKUs in ap-south-1")
EOF

For the latency-and-throughput half of the argument, run a single-box load test with wrk2 against a trivial Python service, then compare against a 3-node fleet behind a haproxy. The single-box result is almost always within 10% of the 3-node result on p99 latency until you actually saturate the single box's CPU. That measurement, on your own hardware, is the most useful thing you can do before any distributed-systems decision.

Where this leads next

The next four chapters take the cliffs above and walk past them in order. Each chapter introduces the primitive that the cliff demands.

By the end of Part 1 (chapter 11) you will have a precise vocabulary for when a system needs to be distributed and what the cost of distribution is in that specific case. From Part 2 onwards the question shifts from "should I distribute?" to "how does distribution actually break, and what do I do about it?".

References

  1. Designing Data-Intensive Applications — Martin Kleppmann, O'Reilly 2017. Chapters 1 and 2 are the canonical introduction to scaling, reliability, and the trade-offs that motivate distribution.
  2. The Tail at Scale — Jeff Dean, Luiz André Barroso, CACM 2013. The paper that formalised why latency tails get worse as you fan out across more machines, and why "average" latency is a misleading number for fanout systems.
  3. Scalability! But at what COST? — Frank McSherry, Michael Isard, Derek G. Murray, HotOS 2015. The benchmark study that showed many distributed systems papers compared against artificially weak single-machine baselines and were not actually faster than a laptop.
  4. Stack Overflow: How We Do Deployment 2016 — Nick Craver. Documents the famously vertical Stack Overflow architecture: how to serve a top-100 web property from a small fleet of large servers.
  5. Spanner: Google's Globally-Distributed Database — James C. Corbett et al., OSDI 2012. The TrueTime-and-2PC argument for spending engineering money to keep strong consistency at global scale.
  6. Dynamo: Amazon's Highly Available Key-Value Store — Giuseppe DeCandia et al., SOSP 2007. The other side of the dichotomy — eventual consistency in exchange for availability and low write latency.
  7. The 30-year arc of systems performance — internal cross-link. The single-box performance envelope that this chapter takes as its baseline is unpacked era by era there.