Capacity at 99.99%

Kiran runs the SLO review for Razorpay's UPI payment-collect API. The product team wants to publish "99.99% availability" on the merchant dashboard. The current measured availability over the last 90 days is 99.94% — close, but not there. Kiran does the arithmetic on a notepad: 99.94% means 51 minutes of downtime in 90 days; 99.99% means 13 minutes. The gap is 38 minutes of downtime to remove. He pulls the incident log: 4 incidents in 90 days, average 13 minutes each. To go from three-and-a-half nines to four nines, he must either prevent every incident (impossible — the upstream NPCI switch had two of them) or cut every incident's mean-time-to-recover (MTTR) by 75%. And the merchant SLO is the user-perceived number — it must include the time when NPCI was down, when the bank was down, when the merchant's webhook endpoint refused the callback. Each of those dependencies is independently advertising 99.95% at best. The product team's "99.99%" is a number that the underlying physics does not allow you to ship.

The cost of reliability scales by orders of magnitude per nine. Going from 99% to 99.9% is mostly engineering discipline; from 99.9% to 99.99% requires removing single points of failure and budgeting for human MTTR; from 99.99% to 99.999% requires automated failover, multi-region replication, and chaos-tested degraded modes. The composition rule is brutal: a service depending on N independent components each at availability A has effective availability close to A^N — four nines on top of four 99.95% dependencies is mathematically impossible without redundancy.

What "four nines" buys you, and the budget you spend it from

Availability is conventionally written as a count of nines — 99%, 99.9%, 99.99%, 99.999%. Each additional nine cuts the allowed downtime by 10×, which is easier to internalise as a duration than as a percentage. The table is worth memorising because it is the unit you negotiate SLOs in:

Availability	Downtime / year	Downtime / quarter	Downtime / 30 days	Downtime / week
99% (two nines)	3 days 15 h	21 h 36 m	7 h 12 m	1 h 41 m
99.9% (three nines)	8 h 46 m	2 h 11 m	43 m 12 s	10 m 5 s
99.95%	4 h 23 m	1 h 6 m	21 m 36 s	5 m 2 s
99.99% (four nines)	52 m 36 s	13 m 9 s	4 m 19 s	1 m 0 s
99.999% (five nines)	5 m 16 s	1 m 19 s	26 s	6 s

The budget framing is the one Google's SRE book made canonical. If your SLO is 99.99% over 30 days, you have 4 minutes 19 seconds of "error budget" to spend in that window — on planned maintenance, on bad deploys, on cosmic rays, on an upstream provider's incident, on the on-caller's hands shaking at 03:00. Spend the budget faster than it accrues and you have to stop shipping changes; under-spend it and you are over-engineering reliability that the user is not paying for. Why error budgets work organisationally: they convert "is reliability a problem?" from a subjective negotiation between SRE and product into an arithmetic question with one number per quarter. If the budget is 4 m 19 s and you have already burned 6 minutes on a bad migration, the deploy freeze is not an SRE political win — it is the math.

The Indian payment-systems context makes the numbers concrete. A 99.99% UPI collect API processing 12,000 transactions per second at peak loses about 31 transactions per second during downtime. A 4-minute outage at peak is ~7,500 failed payments — at an average ticket of ₹420, that is ₹31.5 lakh of merchant revenue stranded for the duration, plus the customer-support cost of explaining "your money is safe, the merchant just didn't get the callback yet". RBI's framework for payment-system operators (under the Payment and Settlement Systems Act) requires participants to publish operational availability and report breaches — for prepaid payment instruments and UPI participants the de-facto bar has converged on 99.95%–99.99%, and breaches above the published number trigger written explanation to the supervisor. The number is not a marketing claim; it is a regulatory contract.

The downtime budget collapses by 10× per nine, but the engineering investment per nine grows roughly 10× too — driven by the architectural change required (no SPOF, then automated failover, then geographic redundancy). Illustrative; the exact slope depends on your starting point.

The underrated insight is that not all nines are equal. A service that achieves 99.99% by being 100% available for 364.6 days and then completely down for 53 minutes once a year fails its users very differently from one that achieves the same number through 100 brief 30-second blips. The first looks great in the SLO dashboard and is catastrophic for the user who hit it; the second is annoying but tolerable. The fix is to pair the availability SLO with a maximum incident duration SLO — Razorpay's internal target, for example, is "99.99% availability and no incident longer than 10 minutes". The second clause is what forces investment in automated remediation rather than fast on-callers.

The composition rule — why 99.99% on top of 99.95% does not exist

A service is rarely a single binary that fails or succeeds in isolation. It is a chain of components, each with its own availability number — your code, the database, the cache, the message queue, the upstream payment gateway, the merchant's webhook URL, the cloud provider's load balancer, the regional DNS resolver. If a single user request requires N components to be available, and each component has independent availability A_i, the end-to-end availability is the product:

A_total = A_1 × A_2 × A_3 × ... × A_N

A request path that touches your service (99.99%), your database (99.95%), your cache (99.9%), the upstream UPI switch (99.95%), and the bank's auth server (99.9%) has an effective availability of:

0.9999 × 0.9995 × 0.999 × 0.9995 × 0.999 = 0.9969

That is 99.69%, or about 27 hours of downtime per year. Your service can be operationally perfect and still be perceived as a three-nines product, because the user does not distinguish "UPI was down" from "Razorpay was down" — they tap a button and either a payment confirms or it doesn't. Why this multiplication rule is the most expensive lesson in capacity planning: every team building a four-nines service eventually discovers that they are not the bottleneck. The math forces a different question — not "how do I make my service four nines?" but "how do I make the composed system four nines, given that some components are fundamentally three nines?". The answer is always redundancy: parallel paths whose availabilities subtract from 1 instead of multiplying.

# availability_compose.py — show how much serial composition costs vs parallel redundancy.
# Run: python3 availability_compose.py
import math

def serial(*components):
    """Availability of N components in series (all must be up)."""
    p = 1.0
    for a in components:
        p *= a
    return p

def parallel(*components):
    """Availability of N components in parallel (any one up is enough)."""
    p_all_down = 1.0
    for a in components:
        p_all_down *= (1.0 - a)
    return 1.0 - p_all_down

def downtime_minutes_per_year(availability):
    return (1.0 - availability) * 365 * 24 * 60

# Razorpay UPI collect path — 5 serial dependencies
your_service   = 0.9999   # four nines on your code
your_db        = 0.9995
your_cache     = 0.999
upi_switch     = 0.9995   # NPCI's internal SLA for the UPI switch
bank_auth      = 0.999

end_to_end = serial(your_service, your_db, your_cache, upi_switch, bank_auth)
print(f"End-to-end (serial)   : {end_to_end*100:.4f}%  "
      f"({downtime_minutes_per_year(end_to_end):.0f} min/yr down)")

# Now make the cache redundant: 3 cache replicas in parallel, each 99.9%
cache_redundant = parallel(0.999, 0.999, 0.999)
better = serial(your_service, your_db, cache_redundant, upi_switch, bank_auth)
print(f"With redundant cache  : {better*100:.4f}%      "
      f"({downtime_minutes_per_year(better):.0f} min/yr down)")

# Now also replicate UPI switch path — Razorpay maintains parallel routes
# to two PSPs (e.g., Yes Bank + ICICI), each 99.95%.
upi_redundant = parallel(0.9995, 0.9995)
best = serial(your_service, your_db, cache_redundant, upi_redundant, bank_auth)
print(f"With redundant UPI    : {best*100:.4f}%      "
      f"({downtime_minutes_per_year(best):.0f} min/yr down)")

# Where is the bottleneck now?
contributors = [
    ("your_service", your_service),
    ("your_db",      your_db),
    ("cache_set",    cache_redundant),
    ("upi_set",      upi_redundant),
    ("bank_auth",    bank_auth),
]
print("\nDowntime contribution (min/yr) per component:")
for name, a in contributors:
    print(f"  {name:14s}: {downtime_minutes_per_year(a):7.1f}")

# Output:
End-to-end (serial)   : 99.6900%  (1630 min/yr down)
With redundant cache  : 99.7400%  (1367 min/yr down)
With redundant UPI    : 99.7898%  (1105 min/yr down)

Downtime contribution (min/yr) per component:
  your_service  :    52.6
  your_db       :   263.0
  cache_set     :     0.5
  upi_set       :     0.3
  bank_auth     :   525.6

Walk through the lines that matter:

serial(...) line: confirms the brutal reality — five components averaging 99.94% individual availability compose to 99.69% end-to-end. Three nines is the ceiling here, no matter how good your code is.
parallel(0.999, 0.999, 0.999) line: three independent cache replicas at 99.9% each give you 99.9999999% as a set — nine nines from triple redundancy on a three-nines component, if the failures are truly independent. This is the entire reason Redis Cluster, Cassandra, and Aerospike exist as products.
upi_redundant line: maintaining parallel routes to two PSPs is a real Razorpay engineering investment — it is what lets the merchant SLO survive a Yes Bank outage without a postmortem.
The contribution table: shows that after redundancy, the bottleneck is bank_auth at 525 minutes per year. The math is now telling you exactly where to spend the next quarter of engineering time — not on your own service.

The independence assumption is where this calculation lies to you most often. If your three cache replicas are in the same availability zone and the AZ goes down, they all go down together — the parallel formula's (1-A)^N term is fiction. Why correlated failures are the asymmetric risk in availability math: independence pushes the failure probability down to the product of the failure rates (very small), but correlation pushes it back up to the maximum of the failure rates (much larger). A 99.99% calculation that assumes independent failures and then encounters a correlated event — a Mumbai region outage that takes out all three replicas — sees its real availability collapse from "9 nines" to "3 nines" overnight. Every redundancy investment must be paired with a correlation audit: "what single fault would take out more than one of these?"

Designing for the four-nines budget — RTO, RPO, and the failover economy

Once the composition math is honest, the engineering question is concrete: how do you shrink each individual component's downtime contribution? The classic SRE framing is two numbers per component:

RTO (Recovery Time Objective): the maximum acceptable time from "this component just failed" to "it is serving requests again". Four nines requires RTO ≤ 60 seconds for the components on the hot path.
RPO (Recovery Point Objective): the maximum acceptable amount of data loss measured in time. For a payments system, RPO = 0 (you cannot lose any committed transaction); for a clickstream analytics pipeline, RPO = 5 minutes is fine.

The 60-second RTO is what forces the architectural change between three nines and four. With a 60-second RTO budget, you cannot wait for a human to wake up, log in, diagnose, and act — the failover must be automated and tested. With a 600-second RTO budget you can. This is the single biggest cost driver in the three-to-four nines jump.

# rto_budget.py — how RTO and incident frequency combine to produce an availability number.
# Run: python3 rto_budget.py

MINUTES_PER_YEAR = 525_600

def availability_from(incidents_per_year, mean_rto_minutes):
    downtime = incidents_per_year * mean_rto_minutes
    return 1.0 - downtime / MINUTES_PER_YEAR

# Scenario A: human-paged failover, 4 incidents/year, MTTR = 25 minutes (page → ack
#   → diagnose → act → verify). Realistic for a well-run team without automation.
print("Human-paged failover:",
      f"{availability_from(4, 25)*100:.4f}% "
      f"(downtime: {4*25} min/yr)")

# Scenario B: automated failover, 4 incidents/year, MTTR = 60 seconds (detection
#   + DNS or LB swap + warm replica takes traffic). Requires runbook, health checks,
#   and chaos-tested standby.
print("Automated failover  :",
      f"{availability_from(4, 1)*100:.4f}% "
      f"(downtime: {4*1} min/yr)")

# Scenario C: active-active, 4 incidents/year, MTTR = 5 seconds (no failover at
#   all — the surviving region absorbs traffic immediately).
print("Active-active       :",
      f"{availability_from(4, 5/60)*100:.4f}% "
      f"(downtime: {4*5/60:.2f} min/yr)")

# What does the same RTO budget afford in incident frequency?
print("\nIf MTTR = 1 min, how many incidents fit in a 99.99% budget?")
budget_min = (1 - 0.9999) * MINUTES_PER_YEAR
print(f"  Budget = {budget_min:.1f} min/yr; allowed incidents = {budget_min/1:.0f}")
print("If MTTR = 25 min:")
print(f"  Allowed incidents = {budget_min/25:.1f}  "
      "(less than three — and you need spare budget for the long tail)")

# Output:
Human-paged failover: 99.9810% (downtime: 100 min/yr)
Automated failover  : 99.9992% (downtime: 4 min/yr)
Active-active       : 99.9999% (downtime: 0.33 min/yr)

If MTTR = 1 min, how many incidents fit in a 99.99% budget?
  Budget = 52.6 min/yr; allowed incidents = 53
If MTTR = 25 min:
  Allowed incidents = 2.1  (less than three — and you need spare budget for the long tail)

The pattern visible in the output is the heart of the chapter: at 99.99%, you have either a high-MTTR-low-incident-count world (two incidents a year, both resolved fast by hand) or a low-MTTR-high-incident-count world (fifty short blips a year, each absorbed by automation). The middle — twenty incidents a year of moderate length — does not fit the budget. The architectural choice between these two worlds is downstream of the failure modes you actually see in production. A service with mostly hardware-failure incidents (a node dies, traffic moves) wants fast automated failover. A service with mostly logic-bug incidents (a bad deploy degrades p99) wants fast human rollback. Most real services have both, which is why every four-nines architecture ends up with both automated failover and a one-click rollback button in the deploy tool.

Inside the 99.99% downtime budget, you trade MTTR for incident frequency along a hyperbola. The architectural pattern that wins depends on which axis your real failures cluster on. A service with frequent transient failures wants active-active; one with rare but disruptive failures wants automated failover. Illustrative.

Why the active-active corner is so seductive and so expensive: it lets you absorb hundreds of small failures per year without consuming the user-visible budget, because no individual failure causes a perceptible outage. The cost is a doubling (or more) of the always-on infrastructure spend, plus the engineering tax of writing every piece of stateful logic to be conflict-free across regions — which for payments often means picking eventual consistency where the business wants strong consistency, or paying the latency tax of a synchronous cross-region write. Hotstar runs active-active across two AWS regions for IPL streaming because the workload is read-heavy and tolerates eventual consistency on user-state. A core ledger service often cannot make that trade and therefore cannot get to five nines no matter what it spends.

Edge cases — where the four-nines model breaks

The arithmetic in the previous sections assumes a well-behaved world: failures are independent, the SLO window is fixed, the user's request path is known. Each of those assumptions has a production failure mode where the model produces a number that is technically correct and operationally misleading.

The single-incident-eats-the-quarter problem. A service that has been at 100% for 89 days takes a single 60-minute outage on day 90. The 90-day rolling availability number is (89 × 1440 - 60) / (90 × 1440) = 99.954% — a four-nines failure. Now the team enters a deploy freeze for the next three weeks while the budget recovers, even though the underlying system is healthier than it has been all quarter. The fix is not to game the SLO window but to pair the rolling availability number with an event count SLO ("no more than 2 incidents > 10 minutes per quarter") so that one bad event does not silence the next quarter of work. Razorpay's internal practice is to publish three numbers — rolling availability, incident count, and the longest single incident — and gate deploys on whichever is most binding rather than on availability alone.

Cold dependencies that look hot. Your SLO calculation lists Postgres as a dependency at 99.99%. The Postgres cluster has not failed in 18 months. Your math says it contributes ~52 min/yr of expected downtime. Then a once-a-decade event happens — a region-wide power loss, a cascading kernel bug, a cosmic-ray-induced memory error in the leader — and the cluster is down for 6 hours. The "99.99%" number was calibrated on a sample size that was too small to see the long-tail event; the true availability of a single Postgres cluster over a 10-year window is closer to 99.9%. Four-nines design must explicitly account for the components whose failures are rare-but-catastrophic, by replicating across fault domains even when the historical data does not "justify" it. The historical data is the part of the distribution you have already measured; the engineering job is to defend against the part you have not.

The SLO that matches a metric the user cannot perceive. A team publishes "p50 latency < 200ms at 99.99%". The user's experience is dominated by the p99 in the long tail of the request distribution, not the median. The 99.99% bound on p50 can be perfectly met while the p99 doubles every Tatkal hour and the user calls support. SLOs must target the percentile the user actually feels — for an API behind a web client, that is usually p95 or p99; for a batch job, it is the wall-clock completion time; for a payment, it is the success rate of the transaction not the latency of the API call. Pick the metric the user can describe in one sentence at a customer-support desk, and SLO that.

Common confusions

"99.99% means the service is up 99.99% of the time" Only if you measure availability as time-up vs time-total. The meaningful definition for users is requests-succeeded vs requests-attempted, weighted by traffic. A 30-minute outage at 03:00 IST when traffic is 5% of peak is a different SLO event from a 30-minute outage at 10:00 during Tatkal — the request-weighted definition charges the second one ~20× more. Pick request-weighted SLOs for any system whose traffic is not flat.
"Adding a hot standby gives you four nines" A hot standby reduces the probability that the primary's failure causes user-visible downtime, but only if the failover itself is fast, tested, and does not introduce its own failures. Untested failovers fail at the rate of both the primary and the failover machinery — sometimes worse than no standby, because you also added a new failure mode (split-brain, stale-data takeover, partial-failover deadlock).
"Our SLO is 99.99% so we should never have downtime" The SLO is a budget, not a target. A team that never spends its error budget is over-investing in reliability and under-investing in features; a team that always exhausts it is under-investing in reliability. Healthy operation is to land somewhere between 30% and 80% budget consumption per quarter.
"99.99% on the API means 99.99% for the user" The user's experience is the composition of every component on their request path, not just yours. A merchant calling Razorpay's UPI collect API depends on Razorpay, NPCI's UPI switch, the payer's bank, the payee's bank, the merchant's own webhook receiver, and (usually) DNS, the CDN, and the LB. Compose the chain before you make the user-facing claim.
"Automated failover is always better than manual" Automated failover is faster but introduces its own bugs — a misconfigured health check that flaps causes more downtime than the failure it was meant to protect against. Many four-nines systems use gated automation: detect automatically, propose the action, but require a human ack within 5 minutes for cross-region failovers. This trades a few extra minutes of MTTR for protection against the runaway-automation incident.
"Five nines is just four nines with more effort" Going from four to five nines requires architectural changes that four nines does not — typically multi-region active-active with conflict-free replicated state. The cost is not 10× the engineering of four nines; it is closer to a different product, with different consistency guarantees and different latency profiles. Many products that need five nines on availability accept that the consistency contract weakens in exchange.

Going deeper

Burn-rate alerts and the multi-window SLO

Naive SLO alerting fires when the budget is exhausted — by which time the incident is over and the alert is useless. The pattern Google's SRE book popularised is burn-rate alerting: alert when the budget is being consumed at a rate that would exhaust the entire window's budget early. A service with a 30-day 99.99% SLO has 4 m 19 s of budget; a 1-hour burn rate of 14× would consume a month's budget in 12 hours, which is a paging event right now. Implement two windows simultaneously — a fast window (5 minutes, alerts on a 14× burn rate) catches sharp incidents; a slow window (1 hour, alerts on a 6× burn rate) catches sustained degradations that the fast window misses because they sit just under its threshold. The combination of fast-and-slow windows is what makes burn-rate alerts both timely and not-too-noisy. The Razorpay/Hotstar reference implementations use 4 windows (5 m, 30 m, 2 h, 6 h) at different burn-rate thresholds — Google's "Multi-Window, Multi-Burn-Rate Alerts" doc is the canonical reference.

Correlated failures and the "shared fate" audit

The composition formula parallel(A, A, A) only holds if the three components fail independently. In practice, "independence" is a property you have to engineer, not assume. The Indian fintech war story: a payments company ran three Postgres replicas across three AZs in ap-south-1. The official availability calculation said "we can lose any AZ and stay up." On 2022-06-22 a region-wide AWS networking event took out routing between all three AZs simultaneously — the replicas were "up" but unreachable from the API tier. The "three nines" the math predicted became "four hours of downtime in one event" because the failures correlated through a shared network fabric. The fix was to add a fourth replica in ap-southeast-1 (Singapore), introducing 90 ms of cross-region latency on the failover path but breaking the correlation. The post-incident SLO model now lists every shared dependency (region, AZ, network fabric, IAM, DNS provider, even the Terraform pipeline that deploys the replicas) and demands written justification for any pair of replicas sharing one. The audit is annual.

The user-perceived availability vs the measured availability

Your monitoring tells you the API returned a 200 within 200 ms; the user tells you they couldn't pay. The gap is everything that lives outside your service: the merchant's webhook timing out, the user's mobile data dropping mid-confirmation, the bank's auth screen freezing for 8 seconds, the SMS OTP arriving after the session timed out. Real-user monitoring (RUM) — instrumenting the actual mobile app or web client to report success/failure events — is the only way to measure user-perceived availability. The number is almost always 1–2 nines worse than the server-side number. Razorpay publishes a separate "merchant-perceived" availability that is computed from RUM-style webhook delivery success — and it is consistently 0.05–0.10% lower than the server-side API availability. The dashboard your CEO looks at should be the user-perceived one; the dashboard your on-caller looks at should be the server-side one. Conflating them produces the wrong investment priorities.

Why 99.999% is a different kind of engineering problem

The leap from four nines to five nines is qualitatively different from the leap from three to four. Four nines (52 minutes/year) can be reached by a single-region architecture with automated failover and disciplined operations. Five nines (5.26 minutes/year) cannot — there is no plausible single-region design where DNS propagation, BGP route convergence, and human-in-the-loop verification combined fit inside a five-minute annual budget. Five nines requires active-active across geographically separated regions with sub-second failover, which means: every write must be either commutative or sequenced through a global consensus layer; every read must be locally serviceable from a replica that lags the global state by some bounded amount; every deploy must be a rolling change that respects this consistency contract. The cost is a different product, with different consistency guarantees and different latency. Stripe's published architecture for its payments core is a reasonable public example — they document accepting eventual consistency on non-critical metadata to preserve five-nines availability on the core ledger writes. Indian payment-system equivalents (UPI participants reaching for sub-3-minute monthly downtime) face the same trade-off and resolve it the same way: strong consistency for the money path, eventual consistency for everything else.

Reproduce this on your laptop

# Reproduce the availability calculations from this chapter on any laptop.
# No external dependencies — both scripts are pure-stdlib Python.
python3 -m venv .venv && source .venv/bin/activate
# (no pip install needed)
python3 availability_compose.py
python3 rto_budget.py
# Expected output: serial composition of 5 components averaging 99.94%
# collapses to ~99.69%; parallel cache replicas push end-to-end to
# ~99.79%; automated 1-minute failover gives ~99.999% within the
# 4-incident model. Vary the inputs to find the bottleneck for your
# own architecture.

Where this leads next

The capacity-planning arc that started with /wiki/headroom-peak-and-degraded-modes and walked through /wiki/load-testing-wrk-k6-gatling, /wiki/load-shedding-strategies, and /wiki/autoscaling-metric-based-predictive closes here. Four-nines capacity is the combination of all of them: enough headroom that the queueing knee isn't your first failure, load-tested with coordinated-omission-aware tooling so the numbers are honest, sheddable when load exceeds capacity, autoscalable when the curve is predictable, and failover-redundant when a component dies. Drop any one of the four and the math collapses — a service with no headroom cannot survive its first burst, a service that does not shed melts under sustained overload, a service without redundancy cannot escape its weakest dependency.

The hardest part to internalise is that the four-nines question is not a systems performance question in isolation; it is a systems composition question. Your service's individual availability matters less than the architecture's response to its dependencies' failures. The performance-engineering vocabulary developed in Parts 1–13 of this curriculum — flamegraphs, queueing knees, GC pauses, NUMA hops — is necessary scaffolding, but at the four-nines level the question becomes: given that any of these primitives can fail, what survives?

The next part of the curriculum (/wiki/wall-debugging-live-systems-is-its-own-skill) shifts from designing for failure to finding the bug while the system is on fire — the third-party perspective on the same numbers. Four-nines design tells you what to spend; production debugging tells you what to do when you have already spent it and the page is still ringing.

References

Google SRE Book — Service Level Objectives — the canonical treatment of error budgets and burn-rate alerts.
Google SRE Workbook — Implementing SLOs — practical patterns including multi-window multi-burn-rate alerting.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — composition math for tail latency, directly applicable to tail availability.
AWS Well-Architected Reliability Pillar — RTO/RPO definitions, multi-AZ patterns, the canonical cloud reference for the four-nines architecture.
RBI Master Direction on Digital Payment Security Controls (2021) — the regulatory framing of availability and incident reporting for Indian payment-system operators.
Brendan Gregg, Systems Performance (2nd ed., 2020) — Chapter 2 on methodologies including capacity-planning frameworks.
/wiki/headroom-peak-and-degraded-modes — the opening chapter of the capacity-planning arc that this one closes.