The availability argument
Aditi runs the merchant-acceptance gateway at PaySetu. The contract her CEO signed with the national UPI switch reads "99.99% monthly availability, measured at the gateway's south-bound endpoint, in 5-minute buckets, with a penalty of ₹4.5 lakh per breach minute." Her platform sits on one r6i.16xlarge that hums along at 28% CPU and serves the entire merchant fleet at p99 of 38 ms. The CFO loves the box. The auditor does not. Last quarter the host rebooted once for a hypervisor patch — 6 minutes of downtime — and the penalty alone wiped out two months of platform-team salary.
The problem is not Aditi's server. The problem is that no cloud provider sells 99.99% on a single instance. Not for any amount of money. Distribution is the only way to buy what her contract requires, and the moment she replicates that gateway, every problem in the next 137 chapters becomes hers.
Availability above ~99.5% cannot be bought on a single machine — the host will, eventually, reboot. To buy a "four nines" SLA you must run replicas across independent failure domains, and the moment you do, you have a distributed system whose availability is bounded by replica independence, the failover protocol, and correlated-failure modes the spreadsheet does not show. The arithmetic of redundancy is unforgiving and counter-intuitive.
What "availability" actually measures — and why one box tops out at 99.5%
Availability is a number on a contract before it is anything else, and the contract usually says "the fraction of 5-minute observation windows in a calendar month during which the service returned a non-error response within the latency budget". Translating that to engineering, you get the standard "nines" table — the one every SRE has tattooed on their laptop:
| Nines | Allowed downtime / month | Allowed / year | Translation |
|---|---|---|---|
| 99% | 7h 18m | 3d 15h | Acceptable for an internal dashboard |
| 99.5% | 3h 39m | 1d 19h | A bare cloud VM with no redundancy |
| 99.9% | 43m 48s | 8h 45m | Web app with rolling deploys |
| 99.95% | 21m 54s | 4h 22m | A typical SaaS contract |
| 99.99% | 4m 22s | 52m 35s | Payment gateway, financial messaging |
| 99.999% | 26.3s | 5m 15s | Telco class — phone networks |
The 99.5% row is not a coincidence; it is the single-instance ceiling on a major cloud, and it is set by causes that have nothing to do with your code. AWS publishes 99.5% as the single-instance SLA for an EC2 box. Azure and GCP publish similar numbers (Azure 99.9% for "two or more instances in different availability zones", GCP 99.5% for a single zonal instance). The downtime budget below 99.5% is filled by these recurring events:
- Host maintenance reboots. Cloud providers patch hypervisors roughly quarterly. Your VM gets a "scheduled-event" notification 7–14 days ahead, and a 2–6 minute reboot. This is the largest single cause.
- Live migration failures. When the provider tries to move your VM to a healthy host live, occasionally the migration aborts and the VM cold-boots elsewhere. ~20–60 second downtime per occurrence; happens 1–3× a year on busy fleets.
- Hardware failure. DIMM ECC errors, NVMe drive death, NIC partial failure. Cloud providers swap the underlying host on detection — 1–4 minutes of downtime, ~once every 3–4 years per VM.
- Noisy-neighbour effects. Not strictly downtime, but the latency excursion is large enough that automated monitors classify the buckets as failed. SLA-meaningful even though the VM is "up".
Why the 99.5% ceiling is hard, not soft: the cloud provider's host maintenance cadence is set by their security and patching obligations, not yours. They will reboot your VM whether your contract permits it or not. The provider's own 99.5% SLA explicitly excludes this maintenance from its measurement window — they are not breaching the contract when they reboot you. You, however, count that reboot against your downstream contract. The only way out is to have another instance ready to serve while one reboots.
The arithmetic of redundancy — independence is the load-bearing assumption
If one instance has availability p, then N perfectly-independent instances configured so that any one being up keeps the service up have availability 1 − (1−p)^N. This is the formula every redundancy slide deck shows, and it is almost always wrong when applied without checking the independence assumption. Let us write down what it does and does not tell you.
# redundancy_math.py — what does N copies actually buy you?
import math
def availability(p_single, n, failure_correlation=0.0):
"""
p_single: per-instance availability (e.g. 0.995 = 99.5%)
n: number of replicas (any one up keeps service up)
failure_correlation: 0.0 = perfectly independent failures
1.0 = all replicas fail together
"""
p_fail_indep = (1 - p_single) ** n
p_fail_corr = 1 - p_single
p_fail_actual = (1 - failure_correlation) * p_fail_indep + \
failure_correlation * p_fail_corr
return 1 - p_fail_actual
def downtime_minutes_per_month(avail):
return (1 - avail) * 30 * 24 * 60
scenarios = [
("1 VM (single zone)", 0.995, 1, 0.0),
("2 VMs same zone", 0.995, 2, 0.30), # share host fleet
("2 VMs different zones", 0.995, 2, 0.05), # share region cabling
("3 VMs across 3 zones", 0.995, 3, 0.05),
("3 VMs across 2 regions", 0.995, 3, 0.005), # share provider
("3 VMs across 2 cloud vendors",0.995, 3, 0.001),
]
print(f"{'config':36} {'avail':>10} {'down/mo':>12}")
print("-" * 62)
for name, p, n, corr in scenarios:
a = availability(p, n, corr)
d = downtime_minutes_per_month(a)
print(f"{name:36} {a*100:>9.4f}% {d:>9.2f} min")
Sample run:
config avail down/mo
--------------------------------------------------------------
1 VM (single zone) 99.5000% 216.00 min
2 VMs same zone 99.6500% 151.20 min
2 VMs different zones 99.9525% 20.52 min
3 VMs across 3 zones 99.9750% 10.80 min
3 VMs across 2 regions 99.9975% 1.08 min
3 VMs across 2 cloud vendors 99.9995% 0.22 min
The columns reward careful reading. Two VMs in the same zone — sharing a host fleet, a power feed, a top-of-rack switch — buys you almost nothing: from 99.5% to 99.65%, because their failures correlate at ~30%. failure_correlation is the load-bearing parameter; setting it to 0 (the textbook assumption) gives you 99.9975% from 2 same-zone VMs, which is a fantasy — same-zone correlated failures dominate. Move the replicas to different AZs and the correlation drops to ~5%, and the 2-replica configuration jumps to 99.95% — the curve gets steep where independence improves. Across two cloud vendors, correlation drops to ~0.1% (only truly global events — DNS root failure, BGP route leaks, a synchronised certificate expiry — bring both down) and the math gets you to five nines. The lesson: redundancy buys availability only as fast as it buys independence.
Why correlated failures dominate the math: the formula (1 − p)^N assumes the events are independent like coin flips. Two replicas in the same AZ are not independent — they share a fate when the AZ loses power, when the cooling fails, when a fibre cut isolates the AZ from the regional spine, when the AZ-level orchestrator pushes a bad update. A 5% correlation in failures means about 1 in 20 of the "bad luck" events are a shared event. Across thousands of AZ-minutes per year, that 5% term completely overwhelms the (1 − 0.995)^2 = 0.000025 term in the independence formula. Reduce correlation, not just count.
A useful sanity check: pick the configuration your team is actually proposing, plug in a pessimistic correlation (15% if same AZ, 5% if multi-AZ same region, 1% if multi-region), and see if the resulting downtime budget fits your contract with a 2× safety factor. If it does not, you are buying redundancy without buying availability — and the wiser move is to find one more failure-domain layer to spread across, not to add another replica in the layer you are already in.
The 2× safety factor matters because the script's correlation parameters are themselves estimates from history, and history under-counts the rare correlated event by definition (you have not seen it yet). PaySetu's own architecture review used a 3× safety factor after a regional outage cost them ₹2.7 crore in penalties, which was the cheap version of the lesson.
The four sources of correlated failure that erase your nines
The redundancy spreadsheet looks great until reality wires the replicas together. Every correlated-failure source breaks the independence assumption that the math requires. The four big ones, in roughly the order they ruin people's afternoons:
1. Shared infrastructure
Two VMs in the same rack share a top-of-rack switch. Two VMs in the same AZ share regional fibre. Two VMs in the same region share a cloud-provider control plane. Two VMs running the same OS image share a kernel CVE that lands on patch Tuesday. The correlation parameter in the math above is shorthand for "everything below the abstraction line that you are not paying attention to". CricStream learned this the hard way during a cricket final when both their replicas — in different AZs of the same region — went down because the regional load-balancer control plane had a bad config push that stopped routing to either AZ. The application replicas were fine. Nobody could reach them.
2. Shared software
If both replicas run the same buggy version of the same JVM, a single garbage-collection edge case will pause both at roughly the same wall-clock time. Same Postgres minor version, same Linux kernel revision, same TLS library — these all correlate failures. KapitalKite once tripped a circuit breaker at the same moment on all five replicas of its order-router because they were all on the same glibc build, and a malformed UTF-8 sequence in a customer's display name triggered a __asan abort uniformly. Five replicas; one bug; zero availability.
3. Shared dependencies
Both replicas talk to the same database. Both depend on the same DNS resolver. Both authenticate against the same identity service. The replicas can be perfectly redundant, but their common dependency is now the availability ceiling. PaySetu's gateway might be 99.99% available across 5 replicas — but it depends on the national UPI switch, which has its own 99.95% SLA, and the product of the two is what the merchant actually experiences. Adding a sixth replica improves nothing; the dependency chain has set the floor.
4. Shared change
Both replicas accept your latest deploy at roughly the same time. If the deploy is bad, both go down together. CI/CD pipelines that "deploy to all replicas in parallel" are correlated-failure factories. The fix is rolling deploys with health-gated progression, plus blue/green for the schema-changing case — Part 19 of this curriculum unpacks the pattern. But the engineering point is that deploys are a correlation source as real as a power feed, and ignoring them is how 99.99% configurations achieve 99.5% in practice.
The cost of buying availability — replication is cheap, fast failover is not
Adding a replica is the easy part. Detecting that one replica has died, without false-positives that flap traffic, and failing over within a window that the contract permits — that is the hard part, and it is the entire substance of Parts 7 (replication), 8 (consensus), 9 (leader election), and 10 (failure detection) of this curriculum. The cost of availability is not the second VM. It is the failure-detection-to-failover budget.
Work backwards from the contract. PaySetu's "99.99%, measured in 5-minute buckets, ≤ 1 breach bucket per month" effectively means any failure must be resolved within 5 minutes. Subtract:
- Failure detection latency — typically 10–60 seconds with phi-accrual or aggressive health checks. Faster detection means more false-positives, which become availability events of their own (flapping traffic).
- Quorum-based failover — leader election runs in 200–800 ms on a healthy network, but during a real partition it can take seconds (Raft's randomised election timeout is 150–300 ms; Paxos coordination can be slower).
- Client-side rediscovery — DNS TTLs, connection-pool refresh, sticky session expiry. Even if the new leader is up in 1 second, if your TTL is 60 seconds, half your clients still hammer the dead node.
- State sync after failover — if the new leader's log was behind, it must catch up before serving writes. Replication lag at failover time is the worst replication lag.
- Downstream propagation — your dependents have caches, circuit breakers, and rate limiters that may have already marked you "unhealthy". Even after you are back, they need to re-probe and re-trust you, which adds another 10–30 seconds of partial unavailability from the user's point of view.
Even with everything tuned, you are looking at 30–90 seconds end-to-end to recover from a single replica failure in a well-engineered service. That fits inside the 5-minute bucket but eats most of it. Two failures in the same month — one for a host reboot, one for a deploy gone bad — and you have spent your entire 4m 22s downtime budget for the month. There is no slack. This is why "99.99%" is operationally far more expensive than "99.9%" — it is not 10× the redundancy, it is 10× the precision in your failure-handling code, which is engineering hours, not VM hours.
# availability_decomposition.py — what does each second of MTTR cost?
def monthly_uptime(failures_per_month, mttr_seconds):
"""Given a rate of incidents and a mean recovery time, what's the SLA?"""
total_seconds = 30 * 24 * 60 * 60
downtime = failures_per_month * mttr_seconds
return 1 - (downtime / total_seconds)
incidents = [
("1 host reboot/quarter", 1/3, 180), # 3 min reboot
("1 deploy/week, 5% bad", 4*0.05, 300), # 5 min rollback
("1 hardware failure/year", 1/12, 240),
("1 noisy neighbour/month", 1.0, 45), # latency excursion
("1 zone outage/year", 1/12, 1500), # 25 min until cross-zone
]
print(f"{'incident class':32} {'/mo':>6} {'mttr':>6} {'monthly hit':>14}")
print("-" * 64)
total_downtime = 0
for name, rate, mttr in incidents:
a = monthly_uptime(rate, mttr)
hit_seconds = rate * mttr
total_downtime += hit_seconds
print(f"{name:32} {rate:>6.2f} {mttr:>6}s {hit_seconds:>10.1f}s")
overall = 1 - total_downtime / (30*24*60*60)
print(f"\nCombined uptime: {overall*100:.4f}% ({(1-overall)*30*24*60:.1f} min/month)")
Sample run:
incident class /mo mttr monthly hit
----------------------------------------------------------------
1 host reboot/quarter 0.33 180s 60.0s
1 deploy/week, 5% bad 0.20 300s 60.0s
1 hardware failure/year 0.08 240s 20.0s
1 noisy neighbour/month 1.00 45s 45.0s
1 zone outage/year 0.08 1500s 125.0s
Combined uptime: 99.9884% (5.0 min/month)
This is what 99.99% actually requires you to engineer. The script's output says: even with a textbook-good architecture (3-zone replication, automated failover, rolling deploys), the realistic uptime is 99.9884% — slightly under four nines. The shortfall comes mostly from the zone-outage and the deploy lines, both of which need concentrated engineering investment to drive down. mttr_seconds is the line you control; failures_per_month is largely set by your environment. Halving your MTTR halves your downtime; halving your failure rate is far harder. Investing in fast, correct failover is the highest-leverage thing you can do for availability once basic redundancy is in place — Part 9 (leader election and leases) and Part 10 (failure detection) are where this engineering happens.
Why fast failover is harder than it looks: a heartbeat-based detector that fails over in 2 seconds is also a detector that fails over on a 1.9-second GC pause, which is a false positive — the old leader is alive and now you have two leaders accepting writes (split-brain). The real engineering problem is distinguishing slow from dead under uncertainty, which is what phi-accrual quantifies and what fencing tokens prevent. There is no setting of the timeout that is fast and correct — there is a Pareto frontier and you pick a point on it.
The numbers above hide one more uncomfortable truth: the deploy-induced failure rate dominates the rest for any team that ships more often than weekly. Every continuous-deployment team's largest source of incidents is its own pipeline, not the cloud provider's hardware. MealRush did the post-mortem accounting on a year of incidents and found 73% of their downtime came from their own changes — bad config, a regressed schema migration, a feature flag that interacted badly with a downstream service — and only 11% from infrastructure causes the cloud provider would acknowledge. The remaining 16% was capacity exhaustion under unexpected load, which is half a deploy story too (because the team that owns the load curve also owns the deploy that exposed it). The implication: paying for more redundancy without investing in deploy safety is paying for a smaller fraction of the actual problem. The right ratio of "redundancy spend" to "deploy-pipeline spend" for a four-nines service is roughly 1:2, which surprises every CFO who sees the bill.
Common confusions
-
"Adding more replicas always increases availability." Only if the replicas are independent and the failover protocol is correct. Two same-zone replicas share so much infrastructure that they fail together ~30% of the time; their availability gain over one replica is small. Five replicas behind a buggy load balancer are five replicas the load balancer can route incorrectly. Replication multiplies your independence, not your servers. The graveyards of post-mortems are full of "we added a replica for HA and it caused our first ever outage when the failover logic went wrong".
-
"99.99% is just one nine more than 99.9%." Numerically yes; engineeringly no. 99.9% allows 43 minutes of downtime per month — a single bad deploy fits inside the budget. 99.99% allows 4m 22s — one deploy gone bad, detected and rolled back in 5 minutes, breaches the contract. Each additional nine costs more engineering than the previous one combined; the curve is super-linear in operational cost.
-
"My app's availability is the cloud provider's SLA." It is bounded above by the provider's SLA, but in practice you sit far below it because of your own deploys, dependencies, and bugs. PaySetu's gateway is bounded above by 99.99% (regional EC2 SLA), but the real availability is more like 99.92% — most of the gap is self-inflicted, not the provider's fault.
-
"Multi-cloud doubles availability." It can, but only if you architect for it from day one. A "multi-cloud" deployment that has its DNS, observability, and deploy pipeline all on one cloud is not multi-cloud — it is a single-cloud deployment with a backup data plane. Genuine multi-cloud requires every dependency to itself be multi-cloud, which is enormous engineering work; very few teams that claim multi-cloud actually have it.
-
"Active-active is more available than active-passive." Active-active gives you better latency (clients hit the nearest replica) but not necessarily higher availability — it requires conflict resolution, which is its own correctness risk. Active-passive with a 30-second failover often beats poorly-designed active-active on real-world availability, because the failure modes are simpler. The correct choice is workload-specific.
-
"Five nines is achievable for any service, and a green health check means the service is up." Five nines (26 seconds/month) is below the floor of human change-management — any service that requires a human in the loop for incident response cannot reach five nines, full stop. And a health check that returns 200 OK only proves the health-check endpoint is up, not that the actual workload path is — synthetic transactions that mirror real user flows are the only honest availability measure. Both fallacies have the same root: confusing instrumented liveness with what users actually experience.
Going deeper
Tail-availability — the metric that actually matters
The "average monthly availability" is the metric on contracts; it is not the metric users feel. A service that is 100% available for 29 days and completely down for 1 day has 96.7% availability — which is somehow "below" the 99% line yet hides a brutal day-long outage. The metric users feel is tail availability: the worst N-minute window in the month, the longest single outage. Two services with the same monthly average but different worst-case windows are very different products. Bengaluru fintech regulators are increasingly writing maximum-single-outage-duration clauses into contracts (e.g. "no single outage may exceed 15 minutes") because the average alone is gameable. When you design for availability, design the tail; the average will follow.
Why availability and consistency are the same conversation
Part 12 of this curriculum (consistency models) will unpack CAP and PACELC formally, but the seed of that conversation is here. Strong consistency requires that a write be visible at all replicas before it is acknowledged, which means a partition between replicas blocks writes — the system trades availability for consistency. Eventual consistency lets each replica answer independently — the system trades consistency for availability. The four nines you are buying with replication are bought against a consistency choice. PaySetu's gateway is leader-write, follower-read with bounded staleness — a deliberate trade where the gateway prefers a 200 ms stale read over an unavailable read. KapitalKite's order book is the opposite — a stale read can cause a wrong-priced trade, so the system blocks reads during a partition rather than serve stale data. There is no "high availability" answer that is independent of the consistency choice.
The Spanner counter-argument — buying availability with engineering, not redundancy
Google's Spanner achieves five-nines availability across a globally-distributed deployment. It does so not by piling up more replicas — three to five replicas per Paxos group is plenty — but by engineering away every correlated failure source. TrueTime gives them clock independence (Part 3), Paxos gives them quorum-correct failover (Part 8), the deploy pipeline pushes to at most one replica per Paxos group at a time (correlated-deploy mitigation), and they spend on multi-region cabling that no individual customer could justify. Spanner's availability comes from concentrated engineering against the correlation sources, not from N. The lesson: once you have three replicas across three failure domains, the next nine is bought by reducing correlation, not by adding boxes.
The blast-radius argument — why availability is also a containment problem
Replication moves you from "can my service stay up?" to "what fraction of my users see an outage when something fails?" — the blast radius. Five replicas serving the same global traffic mean every failure hits 100% of users; five replicas serving sharded traffic mean a single replica failure hits 20% of users — same redundancy, very different user experience. BharatBazaar's Big Billion Day uses this directly: their checkout fleet is cell-isolated so that any single deploy bug or dependency outage hits at most one cell of merchants, not the whole platform. The math of "X% of users had a bad day" is not the same as the math of "Y minutes of downtime", and modern reliability engineering increasingly tracks both. A 99.95% service that has its bad 22 minutes hit 100% of users is worse than a 99.9% service whose 44 minutes are smeared across cells so no user ever sees more than 5 minutes. Cell-based architectures (the AWS internal pattern, Tesla's vehicle-fleet pattern, Cloudflare's per-PoP pattern) are availability arguments dressed as architecture choices.
Reproduce this on your laptop
The redundancy math runs locally; the failure-correlation parameter is the one you should explore.
# Reproduce the redundancy curves for your own configuration
python3 -m venv .venv && source .venv/bin/activate
pip install matplotlib numpy
python3 - <<'EOF'
import numpy as np
import matplotlib.pyplot as plt
p = 0.995 # single-VM availability
ns = np.arange(1, 8)
for corr in [0.0, 0.05, 0.15, 0.30]:
avail = [1 - ((1-corr)*(1-p)**n + corr*(1-p)) for n in ns]
plt.plot(ns, [(1-a)*30*24*60 for a in avail], marker='o',
label=f'correlation={corr}')
plt.yscale('log')
plt.xlabel('number of replicas')
plt.ylabel('monthly downtime (minutes, log scale)')
plt.legend(); plt.grid(True); plt.savefig('/tmp/avail.png')
print('saved /tmp/avail.png — note how the curves flatten as correlation rises')
EOF
The plot makes the lesson visceral: at zero correlation the line drops vertically as you add replicas; at 30% correlation, the line goes flat after N=2 — every replica past the second adds nothing, because the failures pile up on the correlated event.
Where this leads next
The remaining chapters of Part 1 close out the "why distribute" arc — economic, physics, and now availability are the three arguments, and at least one of them must apply to your service before any of the next 137 chapters earn their keep. From Part 2 onwards the question shifts to "how do distributed systems break and how do we cope", and every primitive from there can be read as an answer to a specific failure mode that erodes the availability budget you just learned to measure:
- The fallacies of distributed computing — why your assumption that the network is reliable, that latency is zero, and that bandwidth is infinite are all wrong, and what each fallacy costs your availability budget.
- Single-machine ceilings — the physics floor — Cliff 2 from chapter 1 unpacked into the actual hardware limits.
- Failure detection — phi-accrual and the dead-or-slow problem — Part 10. The detector is the load-bearing component of every availability story above 99.9%.
- Leader election and fencing tokens — Part 9. The protocol that turns "I think I'm the leader" into "I am, provably, the leader" without sacrificing the failover budget.
By the end of Part 12 (consistency models), you will be able to read PaySetu's contract, look at its replication topology, and predict its actual availability from first principles — not from the marketing slide.
A final thought before moving on. Availability is the only one of the three arguments (economic, physics, availability) where you cannot fall off the cliff and recover with a refactor. If you misjudge Cliff 1, you overspend and refactor when the bill arrives. If you misjudge Cliff 2, you cap out on capacity and re-architect during the next growth quarter. Misjudging Cliff 3 — committing to a four-nines contract on a single VM — gets your CEO a phone call from a regulator and the platform team a six-figure penalty before anyone has time to refactor. The third argument is the one where the engineering decision precedes the contract, and that ordering is what makes it the load-bearing motivation for the rest of this curriculum.
References
- Designing Data-Intensive Applications — Martin Kleppmann, O'Reilly 2017. Chapter 1 covers reliability, availability, and the difference between fault and failure with the precision a contract demands.
- Site Reliability Engineering — Google SRE Book, especially the "embracing risk" chapter and the canonical nines-table appendix.
- Amazon EC2 Service Level Agreement — the actual contract that bounds single-instance availability at 99.5% and multi-AZ at 99.99%.
- Failure Trends in a Large Disk Drive Population — Pinheiro, Weber, Barroso, FAST 2007. The empirical study that quantified hardware-failure correlation in a real fleet.
- The Tail at Scale — Dean and Barroso, CACM 2013. Tail latency is tail availability when the contract is measured in 5-minute buckets.
- Spanner: Google's Globally-Distributed Database — Corbett et al., OSDI 2012. The engineering-out-correlation playbook for five-nines.
- The economic argument: scale and cost — internal cross-link. Cliff 3 was the seed of this chapter; the cost of buying availability is the natural follow-on to the cost of buying capacity.