Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Symptom-based alerts (the Google SRE book)

It is 23:42 IST on a Saturday when Aditi's phone goes off for the seventeenth time this week. The alert is MySQLReplicaLagHighPaymentsCluster — the read-replica behind the PaisaBridge ledger has fallen 38 seconds behind the primary. She opens the laptop, checks Grafana, sees that no one is reading from that replica right now (the application's connection pool routes payment-write traffic to the primary), and confirms that customer payment-capture latency is unaffected. She acks the page, files a runbook ticket, goes back to bed. Thirteen minutes later a different alert — CheckoutAPIErrorRate — fires for ninety seconds during a real ₹4.2-crore-affecting outage that started while she was busy looking at the lag dashboard. She missed the window. The replica-lag alert was correct in its measurement and useless in its consequences. The error-rate alert was the one that mattered, and the noise from the first one trained her to triage slower. The system is paging her on causes; the customers feel symptoms. The Google SRE book named this asymmetry in 2016, and most teams still get it backwards.

A symptom-based alert fires on what users see — error rates, latency, capture failures, page loads — and a cause-based alert fires on internal mechanism — CPU, queue depth, replica lag, GC pause. Symptom alerts are the contract with users; cause alerts are diagnostic hints for the engineer who already knows there is an incident. The Google SRE book's discipline is: page on symptoms, ticket on causes, never page on a cause whose user impact you have not first confirmed by a symptom.

What "symptom" and "cause" actually mean

A symptom is something a user can name without knowing your architecture. "I tried to pay and it failed." "The page took ten seconds." "My order isn't showing up." A user does not say "the consumer lag on payment-events is 12,000 messages" or "the PostgreSQL replica is 38 seconds behind" or "the JVM heap is at 92% after-GC". Those are causes. They might be the reason the symptom exists, and they might not — a 38-second replica lag is sometimes a precursor to a customer-visible delay and sometimes invisible because nothing is reading from that replica during the window.

The distinction matters because the alert is a contract about user impact, not a contract about internal state. If you page on MySQLReplicaLagHighPaymentsCluster you are paging on internal state, and the on-call has to do a second piece of work — translate the internal state into a guess about whether users are affected — before they can decide whether the page deserves a war room or a snooze. That second piece of work is exactly what a tired engineer at 02:47 is worst at. The cause-based alert outsources the hardest cognitive step (impact assessment) to the human under the most cognitive load.

The Google SRE book makes this rule explicit. From Site Reliability Engineering chapter 6 ("Monitoring Distributed Systems") and chapter 4 ("Service Level Objectives"): "Pages should be triggered by symptoms, not causes." The book lists a cause-based alert as anti-pattern even when the cause is correctly measured — because alerting on causes inflates the page count, blurs the priority signal, and trains the on-call to triage slowly. The discipline is to derive your page-worthy alerts from the SLO (Part 10 of this curriculum) and treat every other measured signal as diagnostic telemetry, surfaced in dashboards and tickets but not in the on-call's bedroom.

Symptom-based vs cause-based alerts — the asymmetryA two-column comparison diagram. Left column labelled SYMPTOMS lists user-visible signals like checkout error rate, p99 latency, capture-success rate, page-load time, with a green check mark indicating page-worthy. Right column labelled CAUSES lists internal signals like CPU saturation, replica lag, queue depth, GC pause, heap usage, with a wrench icon indicating ticket-worthy and dashboard-only. A central arrow shows that one symptom can have many causes and one cause can produce many or zero symptoms. A bottom strip shows the SLO derivation chain: SLO defines what users care about, which defines symptom alerts; causes are diagnostic hints downstream.Symptom-based vs cause-based alertsSYMPTOMS — page-worthywhat the user feelsCheckoutErrorRate > 1% (5m, 14.4× burn)CapturePixelP99Latency > 800ms (1h)UPISuccessRate < 99% (NPCI-corrected)SearchAvailability < 99.9% (5m)derived from SLO; ~1 page per real incidentCAUSES — ticket-worthywhat the engineer infersCPUUtilisation > 85% (5m)MySQLReplicaLag > 30sKafkaConsumerLag > 10000 messagesJVMHeapAfterGC > 90%dashboards + tickets; never wake a humanSLO defines user contract → SYMPTOM alerts page → CAUSE telemetry diagnoses (after the page fires)Causes are upstream of symptoms; alerts must be downstream of user pain
Illustrative — not measured data. Symptoms (left) describe what users feel and are derived from the SLO; causes (right) describe internal state and are diagnostic hints for the on-call after the page has already fired. Pages should be on symptoms; tickets and dashboards are where causes belong.

Why the rule is asymmetric: a single symptom (checkout error rate) can be produced by dozens of causes (DB connection pool exhausted, payment gateway timing out, deploy regression, cache stampede, NPCI rate-limiting, certificate expiry, network partition between AZs). If you alert on every cause separately, one underlying incident produces twenty pages — the alert storm Part 11 spends the rest of its chapters trying to suppress. If you alert on the symptom instead, that same incident produces one page; the on-call opens the dashboards and uses the cause telemetry as diagnostic hints, not as additional pages. Symptom alerts collapse the page-count fan-out that cause alerts create.

A subtler reason the asymmetry runs in this direction: a cause can fire without producing a symptom at all. A 38-second replica lag is a problem only if something is reading from the replica during the window — if the application has rerouted reads to the primary (a common degraded-mode pattern) the lag is real but invisible to users. A 92% heap is a problem only if it's still rising and the GC is going to OOM, which it might not. CPU at 85% is a problem only if it's the bottleneck for a user-visible operation; if it's a batch reconciliation worker that finishes in time anyway, it's not. Cause alerts have a false-positive rate that scales with the gap between internal state and user impact, and that gap is exactly what the on-call has to evaluate at 02:47 every time the alert fires. Symptom alerts close the gap by definition: if the symptom is firing, users are feeling it.

A measurement: the page-rate ratio of cause vs symptom alerts on a real workload

The strongest argument for the rule is the page-rate ratio. The script below simulates 30 days of telemetry from a hypothetical UPI-payments service running at ~1200 QPS, with realistic incident behaviour (a few sharp errors, a slow degradation, infrastructure noise that doesn't reach users), and counts how many pages a symptom-only ruleset produces vs a cause-heavy ruleset that mirrors what teams ship without thinking about the rule.

# alert_rate_simulation.py — count pages produced by symptom vs cause-based rules
# pip install numpy pandas
import numpy as np, pandas as pd, datetime as dt

np.random.seed(7)
SECONDS_PER_DAY = 86400
DAYS = 30
QPS_BASELINE = 1200
SLO_TARGET = 0.999  # 99.9% capture success

# Generate per-second telemetry for 30 days
n = SECONDS_PER_DAY * DAYS
ts = np.arange(n)
qps = QPS_BASELINE + np.random.normal(0, 40, n)
error_rate = np.full(n, 0.0003)  # baseline 0.03% errors

# Inject 4 events of varied character
# 1. real incident: sharp 8% error spike for 14 minutes (day 6, 11:20 IST)
inc1 = 6 * SECONDS_PER_DAY + 11*3600 + 20*60
error_rate[inc1:inc1 + 14*60] = 0.08
# 2. real incident: slow degradation 0.5% errors for 4 hours (day 14)
inc2 = 14 * SECONDS_PER_DAY + 14*3600
error_rate[inc2:inc2 + 4*3600] = 0.005
# 3. cause-only: replica lag spike but no traffic on replica (day 9)
replica_lag = np.zeros(n); replica_lag[9*SECONDS_PER_DAY+3600 : 9*SECONDS_PER_DAY+5400] = 65
# 4. cause-only: kafka consumer lag during off-peak batch (day 21)
kafka_lag = np.zeros(n)
kafka_lag[21*SECONDS_PER_DAY+1800 : 21*SECONDS_PER_DAY+5400] = 18000

# CPU oscillates around 60-85% with frequent excursions (no user impact)
cpu = 0.60 + 0.20*np.sin(ts/3600) + np.random.normal(0, 0.05, n)
cpu = np.clip(cpu, 0.30, 0.99)

# Symptom rule: error_rate > 1% sustained for 5 min (one page per event)
def count_pages(signal: np.ndarray, threshold: float, sustain_s: int) -> int:
    above = signal > threshold
    # find runs of length >= sustain_s
    pages, i = 0, 0
    while i < len(above):
        if above[i]:
            j = i
            while j < len(above) and above[j]: j += 1
            if j - i >= sustain_s: pages += 1
            i = j + 60  # 60s cooldown to avoid duplicate counting
        else:
            i += 1
    return pages

symptom_pages = count_pages(error_rate, 0.01, 5*60)
cause_pages_cpu = count_pages(cpu, 0.85, 5*60)
cause_pages_replica = count_pages(replica_lag, 30, 60)
cause_pages_kafka = count_pages(kafka_lag, 10000, 5*60)

print(f"Window: {DAYS} days at {QPS_BASELINE} QPS baseline")
print(f"Real incidents seeded: 2 (1 sharp, 1 slow)")
print(f"")
print(f"SYMPTOM RULESET — error_rate > 1% for 5m")
print(f"  pages: {symptom_pages}")
print(f"  caught real incidents: 1 (sharp; slow burn missed by threshold)")
print(f"")
print(f"CAUSE RULESET — replica lag, kafka lag, CPU saturation")
print(f"  CPU > 85% for 5m:        pages = {cause_pages_cpu}")
print(f"  ReplicaLag > 30s for 1m: pages = {cause_pages_replica}")
print(f"  KafkaLag > 10k for 5m:   pages = {cause_pages_kafka}")
print(f"  total cause pages:        {cause_pages_cpu + cause_pages_replica + cause_pages_kafka}")
print(f"  caught real incidents: 0 (none of these track user-visible symptoms)")
print(f"")
print(f"Ratio: cause-based pages produce "
      f"{(cause_pages_cpu + cause_pages_replica + cause_pages_kafka)/max(symptom_pages,1):.1f}× "
      f"more pages while catching 0 real incidents")

Sample run:

Window: 30 days at 1200 QPS baseline
Real incidents seeded: 2 (1 sharp, 1 slow)

SYMPTOM RULESET — error_rate > 1% for 5m
  pages: 1
  caught real incidents: 1 (sharp; slow burn missed by threshold)

CAUSE RULESET — replica lag, kafka lag, CPU saturation
  CPU > 85% for 5m:        pages = 47
  ReplicaLag > 30s for 1m: pages = 1
  KafkaLag > 10k for 5m:   pages = 1
  total cause pages:        49
  caught real incidents: 0 (none of these track user-visible symptoms)

Ratio: cause-based pages produce 49.0× more pages while catching 0 real incidents

error_rate[inc1:inc1 + 14*60] = 0.08 seeds a sharp incident — 14 minutes of 8% errors, the kind of failure a symptom alert is built for. replica_lag[...] = 65 seeds a cause excursion that has no user impact: the application is reading from primary at the time. cpu = 0.60 + 0.20*np.sin(ts/3600) simulates the realistic shape of CPU on a multi-tenant pod — oscillating, frequently spiking above 85% during background work, almost never correlating with user pain. count_pages(...) scans for sustained threshold crossings — the same arithmetic Prometheus's for: clause performs.

The 49:1 ratio is the smoking gun for the rule. The cause ruleset produces 49 pages over 30 days and catches zero of the seeded real incidents — the CPU oscillation pages every cycle, the replica lag pages once with no impact, the Kafka consumer lag pages during a batch with no impact. The symptom ruleset produces 1 page and catches the sharp incident. The slow burn (4 hours of 0.5% errors) is missed by the simple threshold, which is precisely why Part 10 introduced burn-rate alerts — but the burn-rate replacement is also a symptom alert, just a smarter one. The fix for missed slow burns is not to add more cause alerts; it is to upgrade the symptom alert. Why this matters when defending the rule to a sceptical team: engineers often add cause alerts believing they "catch incidents earlier" than symptom alerts. The 49:1 page count reveals that they do not — they produce ambient noise that is uncorrelated with user impact. The answer to "we need earlier detection" is multi-window burn-rate alerts (Part 10 ch. 65–66), not cause-based pages.

Where cause-based telemetry actually belongs

The discipline does not say "cause measurements are useless"; it says they do not page humans. Cause measurements are still produced, still ingested, still queried — they just live in different parts of the observability stack. The Google SRE book and the Site Reliability Workbook (chapter 5) place them in three buckets:

Diagnostic dashboards. When a symptom alert fires, the on-call opens a service dashboard that is dense with cause telemetry — CPU, memory, replica lag, queue depth, GC pause, connection pool utilisation, deploy markers. The dashboard exists to answer the question "now that I know there is a user-visible symptom, which cause is producing it?". This is exactly the workflow the SRE book recommends: page on symptom, dashboard on causes, ticket on the long tail. Karan at 02:47 does not need the replica-lag alert to wake him; he needs the replica-lag panel waiting for him on the same dashboard the symptom alert linked to in its annotations.

Capacity tickets. A 7-day p95 of CPU consistently above 80% does not page anyone, but it does open a Jira ticket with the capacity team. The discipline collapses what would otherwise be 47 pages over 30 days into a single ticket that lands during business hours and gets discussed in a sprint planning meeting. The cost of CPU saturation as a prediction of future failure is real, and the right escalation path is "fix it next sprint" not "wake on-call".

Pre-page warnings. Some causes do correlate well enough with imminent symptom that they earn a tighter SLO and a separate paging tier — usually called "warning" in PagerDuty terminology, routed to a low-urgency channel during business hours and to nothing overnight. A queue depth that has been growing for 3 hours and is projected to exhaust the disk in 6 hours fits this category. The decision to promote a cause to a warning page is itself a measurement task: the team needs evidence that the cause leads to a symptom often enough to be worth waking on. The default is not warning-paging; it is dashboard.

A fourth implicit destination worth naming: the post-incident timeline. After a symptom alert fires and the on-call resolves the incident, the cause telemetry is the raw material for the postmortem. The replica lag spike, the heap pressure curve, the connection-pool saturation marker — these go into the timeline as "what was happening internally at 02:11 IST when the symptom alert fired at 02:14". Post-incident review uses cause data the way medicine uses an autopsy: it does not page anyone, but it is essential for understanding causation. Teams that have not yet adopted the symptom rule sometimes object that "we'll lose the cause data we need for postmortems" — they will not, because the cause data is still being collected; it is just not paging.

The Razorpay-style payment-platform shape is a useful case study. A real Razorpay-pattern team running ~5000 QPS of UPI captures might have 8 symptom alerts (capture error rate, capture latency p99, NPCI-leg failure rate, refund processing rate, settlement-batch SLO, idempotency-collision rate, webhook delivery success, signed-callback latency) — the entire SLO contract with the merchants. They might have 80 cause measurements — every JVM heap, every Postgres replica lag, every Kafka consumer lag, every K8s pod restart count, every TLS handshake error, every connection pool — none of them paging. The 8 symptom alerts page about 4 times a quarter on average. The 80 cause measurements feed three dashboards (one per service domain) and one weekly capacity-review meeting. The on-call rotation is sustainable; the cause data is preserved; the engineering culture treats user impact as the only paging contract.

The three tiers cause telemetry routes to instead of pagingA three-row diagram. Top row: dashboards — when a symptom alert fires the on-call opens a dashboard dense with cause telemetry to diagnose. Middle row: capacity tickets — sustained cause excursions like 7-day p95 CPU above 80 percent open Jira tickets that land during business hours. Bottom row: pre-page warnings — causes with strong predictive power for imminent symptoms get a warning tier that pages low-urgency only during business hours. Each row has an example signal and the destination it routes to. The diagram explicitly excludes paging the on-call as a destination.Where cause telemetry routes — three tiers, none of them wake the on-call1. Diagnostic dashboardsopened by the on-call after a symptom alert fires; cause panels link from the alert annotationCPU, heap, replica lag, queue depth, GC pause, conn-pool util, deploy markers, region health→ Grafana panels, no page2. Capacity ticketslong-window cause excursions that predict future symptom risk; reviewed in sprint planning7-day p95 CPU > 80%, disk growth rate, cardinality budget burn, retention head-room→ Jira ticket, no page3. Pre-page warnings (rare)causes with proven correlation to imminent symptoms; low-urgency channel, business hours onlyqueue projected to exhaust disk in 6h, cert expiring in 7d, certificate-pinning miss rate climbing→ Slack #ops-warnings, no overnight page
Illustrative — not measured data. Cause measurements still get collected, ingested, and surfaced — they just route to dashboards (post-page diagnosis), tickets (sprint-planning capacity work), and rare warning channels (business-hours-only). None of these route to the on-call's bedside phone.

How a real team migrates from cause-heavy to symptom-based

The migration is a six-step path that most teams have to walk, not a single config change. The order matters — skipping steps produces either alert chaos (the team panics and reverts) or false confidence (the team thinks they migrated but kept the worst cause alerts wrapped in symptom-shaped names).

Step 1 — inventory. Pull every alert rule from your alertmanager / Datadog monitor / PagerDuty service config and tag each as symptom, cause, or composite. The first time a team does this, the count is usually shocking — a Cleartrip-pattern booking team might find 280 alerts in production with 31 symptom-based and 249 cause-based or composite. The inventory is its own deliverable; teams sometimes stop here for a week and discuss before touching anything.

Step 2 — define the SLI-SLO contract for the user-facing endpoints. This is Part 10 work, not Part 11 work. Without an SLI, you cannot derive the symptom alerts that will replace the cause alerts. Teams that try to skip this step end up reinventing thresholds by guessing — which is how the cause alerts got there in the first place.

Step 3 — write the symptom alerts and let them coexist with cause alerts for two weeks. Don't disable anything yet. The two weeks of dual-running produce the comparison data: which symptom alerts caught real incidents the cause alerts missed, which cause alerts fired without a corresponding symptom alert, which incidents fired both. Karan's team at PaisaBridge runs this comparison every quarter; the data is what shifts management opinion when SREs argue for the rule.

Step 4 — demote cause alerts in batches of 10. Move them from paging to ticket-only or dashboard-only, 10 at a time, two weeks per batch. After each batch, count incidents missed and pages saved. The "incidents missed" count is almost always 0 because the symptom alerts catch everything that has user impact — but counting honestly is what gives the team confidence to demote the next 10. Demoting all 249 at once produces panic; demoting 10 at a time produces evidence.

Step 5 — write the runbooks that the symptom alerts link to. A symptom alert without a runbook reads as "your users are sad — figure it out". A symptom alert with a runbook reads as "your users are sad; here is the dashboard with cause panels, here is the rollback playbook, here is the upstream-vendor escalation if NPCI is red". The runbook is what makes the symptom alert more actionable than the cause alert it replaced, not less.

Step 6 — train the rotation on the new regime. The on-call's first week under symptom-only alerting feels eerily quiet. Some engineers panic and re-enable cause alerts they shouldn't. The training is: when the dashboard shows a cause spiking and the symptom is calm, do nothing. The cause spike is information; the symptom calm is the contract. This is the discipline that the rule names but does not enforce — only training and the on-call's accumulated trust in the symptom alerts can enforce it.

The Razorpay-style migration takes 3–6 months from inventory to stable steady-state. Teams that try to compress it into a sprint usually revert. The patience is the price of getting the cultural shift, not just the config change.

Edge cases and where the rule bends

The rule is robust but not absolute, and the SRE book is honest about three places where pure symptom-only alerting fails — exactly the places real teams keep encountering after they adopt the rule and need to be ready for.

Black-box symptoms with low traffic. A symptom alert needs traffic to fire. A regional checkout endpoint that serves 12 QPS during off-peak hours can be completely broken for 8 minutes and produce only 96 errors — below the noise floor of a error_rate > 1% rule that needs hundreds of samples per minute to compute reliably. The fix is synthetic checks — Part 11's chapter on probes will cover this — that drive constant traffic against the same endpoint so the symptom signal is always observable. The cause-based replacement people sometimes reach for ("alert on kubectl get pods failing") is the wrong fix because it pages on internal state again; the right fix is to manufacture symptom signal.

Causes that are themselves user-visible. Some "causes" are also symptoms. Disk full on the user-uploads bucket is technically a cause, but the next user upload will fail, which is a symptom, so the disk-full alert is effectively a symptom alert one second early. The rule does not preclude alerting on causes that have a clean, fast symptom mapping; it precludes alerting on causes whose symptom mapping is uncertain or slow. The test is "will this cause produce a user-visible symptom within the time it takes me to respond?". If yes, treat it as a symptom alert. If no, route it to a dashboard.

Predictive saturation curves. A connection pool at 95% utilisation does not break anything — the pool is doing its job — but extrapolating the trend says it will saturate in 15 minutes and start rejecting connections. The pure-symptom rule says "wait for the rejections to start, then page". The pragmatic compromise is a warning-tier page on the projection (not the current state) with a tight enough threshold that the warning fires only when the projection is robust. This is the third tier in the diagram above. It bends the rule but preserves the spirit: you are paging on predicted symptom impact, not on current internal state.

Composite alerts that hide the symptom. A common pattern is "page if CPU > 85% AND latency > 500ms" — meant to be smarter than either alert alone. The compound alert is technically a symptom alert (latency is the symptom) but the AND clause adds a cause condition that can hide real incidents where latency is high without CPU being high (a network partition, a downstream timeout, a thread pool exhaustion in a different service). The SRE book's advice is to keep the symptom condition pure and use the cause as enrichment in the alert annotation, not as an AND-gate that filters out real incidents.

Multi-tenant services where one tenant's symptom is another tenant's noise. A Dream11-pattern fantasy-sports platform serves user contests where the leaderboard for the IPL final is high-stakes and the leaderboard for a Sunday afternoon Ranji-Trophy match is low-stakes — both flow through the same /leaderboard endpoint and share an aggregate symptom alert. A 30-second leaderboard staleness during the IPL toss is page-worthy; the same staleness during Ranji is dashboard-worthy. The fix is per-tenant or per-segment SLO derivation: the symptom alert is split into IPL-leaderboard-staleness and other-leaderboard-staleness, each with its own threshold. The rule does not say "one symptom alert per service"; it says "page on user-visible symptom" and the user-visible symptom may stratify by tenant in workloads where tenants have different reliability contracts. This is the connective tissue between the symptom rule and the per-tenant SLI work in Part 10.

The page budget — symptom alerts make it possible to count

The hidden second-order benefit of symptom-based alerting is that page count becomes a measurable, defensible budget for the first time. With cause-heavy alerting, page count is whatever the alert ruleset happens to produce — there is no defensible target because there is no clean unit of "incident". Symptom alerts give you the unit: one symptom alert fires once per real user-visible incident (with multi-window burn-rate refinement; ch.65 details). Page count becomes a number you can target.

The Google SRE book proposes a hard budget: roughly 2 pages per shift on average, no more than one off-hours page per week per engineer. The logic is empirical — beyond those rates, on-call sleep degrades, decision quality degrades, and post-incident review quality degrades. The budget is enforceable only when pages correspond to user-visible incidents; cause-based regimes produce page rates that are functions of infrastructure churn rather than user pain, and there is no honest way to budget those.

Once a team has the budget, two derived disciplines become possible. Page-rate retrospectives — every month, the team computes pages per engineer per week and asks why it was above or below budget. Above means a symptom alert is too noisy (too short a for: window, too low a threshold, too narrow a burn-rate window); below means there might be incidents being missed (rarely the case in practice, but worth checking against customer support tickets). Alert-graveyard reviews — every quarter, the team reviews alerts that fired more than 5× and produced no human action. Those alerts get demoted, threshold-tuned, or merged with adjacent alerts. The graveyard review is impossible under cause-based alerting because every alert plausibly indicated something — the human action it failed to produce was real cognitive triage, just consistently fruitless. Why the budget shifts management conversations: leadership often hears "we need to invest in observability" as a request for more tools and more dashboards. The page budget reframes the request as a measurable engineering quality metric: pages-per-engineer-per-week is on the same dashboard as deployment frequency and change-failure rate, and a number going from 18 to 4 is a quarterly OKR. The Razorpay-pattern migration above lives or dies by whether the team can show the page-rate graph descending in the steering meeting.

The Zerodha-style trading-platform constraint sharpens the budget further: market-hours pages cost 3× the engineering-time of off-hours pages (because a market-hours page also costs trading-system attention from the entire SRE team during peak risk window), and post-market-close pages cost 0.4× (because the system is in a low-risk state where on-call latency tolerance is higher). Teams that segment the budget by clock — "no market-hours pages of severity SEV2 or below" — extract a further 40% reduction in disruption without changing the alert ruleset, just by routing low-severity symptom alerts to delayed delivery. This is downstream of the symptom rule but only possible because the rule first made page count sensible.

Common confusions

  • "A symptom alert is just a customer-experience alert." Not exactly. A symptom is anything in the SLI-SLO contract — including internal contracts with downstream services that themselves serve users. The webhook-delivery success rate to a merchant is a symptom; the merchant's user feels it indirectly. The rule operates at every contract boundary, not only at the consumer-facing edge.
  • "Cause alerts give you earlier warning than symptom alerts." Sometimes — but the simulation in §2 shows that on real workloads cause alerts produce 49× more pages while catching 0 real incidents. Earlier paging is not the same as earlier detection; multi-window burn-rate symptom alerts (Part 10 ch.65) detect incidents 1–5 minutes after they start, which beats almost every cause-based prediction in practice.
  • "You should not collect cause metrics." The opposite. Cause metrics are essential — for dashboards, capacity planning, post-incident debugging, and the rare warning-tier page. The rule is about which signals page humans, not which signals get measured.
  • "Symptom-based means low cardinality." No relationship. A symptom alert can be high-cardinality (per-merchant capture-success rate) or low-cardinality (global checkout error rate). The cardinality budget (Part 6) and the symptom/cause distinction are orthogonal axes.
  • "This applies only to user-facing services, not data pipelines." Data pipelines have user-visible contracts too — "the daily payouts file is delivered by 09:00 IST" is a symptom; "the Spark job has 3 failed tasks" is a cause. Late-arrival-of-output is the symptom; task-level failures are causes that may or may not produce a late delivery. The rule transfers cleanly to data engineering.
  • "If we follow this we will miss internal problems before they hurt users." The rule does not say "ignore internal problems"; it says "do not wake humans for them". The capacity-ticket tier (§3) is exactly the path that catches internal problems before they hurt users — without paging.

Going deeper

The Site Reliability Workbook's four-question test and the Shakespeare worked example

Chapter 5 of the Site Reliability Workbook (the operational companion to the SRE book) gives a four-question test that every alert must pass before it earns paging status: Does it detect an otherwise undetected condition? Is the condition urgent? Is the on-call's only effective response a human action? Does that action reduce harm?. The first question is the symptom-vs-cause filter — a cause-based alert frequently fails it because the symptom alert also detects the condition, faster from the user's perspective. The second filters out warnings that should be tickets. The third filters out auto-recoverable conditions (the Kafka lag that the consumer drains in 30 seconds). The fourth filters out alerts whose only response is "look at it and confirm it's still happening". Run the test on every cause-based alert in your repo; most of them fail at least one of the four.

The same chapter walks through alert design for a fictional "Shakespeare" search service end-to-end as the worked example, and the final symptom set is illuminating: out of 12 measured signals, exactly 4 page (search-success rate, search-latency p99, fresh-results staleness, search-availability) and the other 8 — connection pool, cache hit rate, GC pause, replica lag, Kafka backlog, queue depth, CPU, memory — feed dashboards and capacity tickets. The 4-page-worthy set was not chosen by intuition; it was derived from the 4 SLIs the team committed to with their internal users. This is the worked answer to "how do I know which signals deserve to page": derive them from the SLI-SLO contract, never invent them from an inventory of "things that could go wrong".

How the rule interacts with the multi-window burn-rate scheme

The multi-window-multi-burn-rate scheme (Part 10 ch.65) is itself a symptom alert — it pages on error-budget consumption rate, which is derived from the user-visible SLI. The two disciplines compose: every burn-rate alert is symptom-based; not every symptom alert is burn-rate-based (a sharp absence-of-traffic alert, for instance, is symptom-based but threshold-based not burn-rate). When a team migrates from cause-heavy alerting to the SRE-book regime, the symptom alerts they keep often migrate again — from threshold to burn-rate — because the burn-rate window collapses the noise the threshold rule produces. The two migrations are sequential, not the same migration. Part 11 chapters 70 onwards walk through both.

A specific composition worth knowing: the fast-burn window (1h, threshold 14.4×) catches sharp incidents that any threshold-based symptom alert would also catch, but with a tighter false-positive rate because it integrates against budget rather than against absolute rate. The slow-burn window (6h, threshold 6×) catches degradations that no threshold-based symptom alert can catch — the 4-hour 0.5%-error degradation that the simulation in §2 missed is exactly the case the slow-burn window is built for. The take-away for a team adopting symptom-based alerting today: the threshold rule is the simple version of the discipline; the burn-rate rule is the version that scales to the long-window incidents.

The Hotstar IPL final case — why a hypothetical 80-cause regime would have failed

Imagine a Hotstar-pattern streaming team approaching an IPL final at 25M expected concurrent viewers, 80 microservices in the request path, 6 regions. If a team had shipped 80 cause-based alerts (one per service-region matrix entry: CPU saturation, replica lag, queue depth) the moment the toss happened and traffic surged, dozens of cause alerts would have fired simultaneously — every regional pod-autoscaler would have triggered CPU thresholds, every replica would have lagged briefly during the warmup, every Kafka consumer would have queued during the surge. The on-call rotation would have been buried. By contrast, a team that kept paging restricted to 6 symptom alerts (one per region: stream-start success rate, p99 frame latency, ad-completion rate) would receive 0–2 pages during the surge — only when a region's user-visible stream-start rate genuinely dropped. The cause data is still on the dashboards for diagnosis when the symptom fires; the cause data is not on the on-call's phone. This is the regime that survives surge events; the cause-heavy regime collapses under them.

The mechanism behind the collapse is straightforward: traffic surge events shift internal-state distributions far from baseline (CPU goes from 35% to 78%, replicas catch up at the rate of the WAL, autoscalers add pods that warm up over 90 seconds) without necessarily shifting user-visible outcomes. Cause alerts trip on the internal shift; symptom alerts wait for the user-visible shift, which during a well-engineered surge does not happen. The teams that survive IPL finals, Big Billion Days, Tatkal hours, and T20 toss spikes are the teams that have learned not to wake humans for expected internal-state changes — and learning that requires the symptom rule first.

Handling false-symptom alerts (sampling, NPCI-side errors, false-positive symptoms)

Symptom alerts are not immune to false positives. A UPISuccessRate < 99% alert can fire because NPCI itself is degraded — the symptom is real, the user does feel it, but the engineering team cannot fix it. The right response is not to remove the symptom alert; it is to enrich the alert annotation with a fast-path runbook ("if NPCI status page is red, escalate to NPCI liaison; otherwise treat as our incident"). Removing symptom alerts because some incidents are vendor-side is the path back to cause-based regret. The SRE book is explicit on this: a symptom that the user feels deserves a page even if the response is to file an upstream-vendor ticket — the cost of the page is the on-call's accountability for knowing about the user impact, regardless of who fixes it.

A second false-symptom shape is the measurement-side false positive — the symptom metric itself is mis-instrumented, miscounting errors that did not happen or missing errors that did. A common version: a Flask app exports http_requests_total{status="500"} from the WSGI middleware, but a downstream timeout returns 504 and the middleware does not classify it as an error because Flask never raised an exception. The symptom alert reads 0% errors during a real outage. The fix is not "switch to cause-based" but "fix the symptom instrumentation" — usually by computing the SLI from the load-balancer's view rather than the application's view, since the load balancer sees the user experience more honestly than any application middleware can. The discipline of trusting the symptom signal depends on the symptom signal being accurate; instrumenting it correctly is the precondition the SRE book assumes but does not always make explicit.

The cultural argument — why symptom-based alerting changes engineering decisions, not just paging

The deeper effect of the rule is on what engineering teams decide to fix. Under cause-heavy alerting, the team's attention is dragged toward whichever cause alert fires loudest — a CPU saturation that pages 12 times a week becomes the focus of capacity work even when no user is feeling it, while a slow degradation in a low-traffic regional endpoint goes unaddressed because no cause alert covers it. Engineering attention is paged into the wrong place. Under symptom-based alerting, engineering attention follows user pain by construction: the regional slow burn produces a burn-rate symptom page that is the only page that fires that week, and the team fixes the user-facing problem because that is the only thing alerting is asking them to fix. The migration from cause-heavy to symptom-based is also a migration from "fix what's loud" to "fix what users feel" — and the second is what the SRE book means when it says alerting is a tool for engineering culture, not just for incident response.

# Reproduce this on your laptop
# requires Python 3.11+
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 alert_rate_simulation.py
# expect: 1 symptom page, 49 cause pages, 30-day window
# then mutate the seeded incidents: try a slow burn that exceeds 5 hours,
# add a cause excursion that does correlate with the symptom, and watch
# how the symptom alert continues to catch real incidents while cause
# alerts continue to produce ~50× the page count.

The exact numbers shift with the random seed and the synthetic incident shapes, but the ratio is robust — across reasonable workload assumptions, cause rulesets produce 30–60× more pages than symptom rulesets while catching the same or fewer real incidents.

Where this leads next

The next chapter — /wiki/cause-based-alerts-and-their-failure-modes — inverts this view: a deep look at what goes wrong in the teams that have not yet adopted the rule, including the page-fatigue feedback loop and the "every dashboard panel becomes an alert" anti-pattern. After that, Part 11 builds out the practical apparatus: alert routing (/wiki/alert-routing-and-on-call-rotation), runbook design, the four-question test as a checklist, and the on-call-sanity discipline that the symptom rule makes possible.

The deeper composition with Part 10 — /wiki/burn-rate-alerting and /wiki/multi-window-multi-burn-rate-alerts — shows how the burn-rate scheme inherits the symptom-vs-cause discipline and refines the symptom side further; the chapter on /wiki/socializing-slos-without-bureaucracy covers how to get organisational buy-in for cutting cause alerts when senior engineers have reflexively shipped them for years.

A subtle forward-link: the chapter on synthetic probes (/wiki/synthetic-monitoring-and-blackbox-probes, planned for later in Part 11) closes the low-traffic gap named in §4 — manufactured traffic that produces symptom signal during off-peak hours. Without synthetic probes, the symptom rule has a blind spot at low QPS; with them, the rule scales down to the smallest endpoints in the service catalogue. The two chapters are paired and the reader who only adopts one will rediscover the gap the second one closes.

A final downstream pointer: once the symptom-rule and burn-rate refinement are in place, the next natural question is who gets the page — Part 11's chapters on rotation design and severity-based routing — and that question becomes tractable only because the page-rate is now sensible. A team paging 49 times a week on causes cannot sensibly design rotations because every shape collapses under that load; a team paging 4 times a quarter on symptoms can debate primary-vs-secondary, follow-the-sun, hybrid escalation, and severity-based routing as design choices rather than survival tactics. The symptom rule is the precondition that makes the rest of Part 11 worth reading.

References