Socializing SLOs without bureaucracy

Aditi is the SRE lead at a mid-stage Indian fintech that processes ₹400 crore of UPI volume a day. She spent four months in 2024 building beautifully derived MWMBR rules — burn-rate constants from chapter 66, four windows, two thresholds, the lot. She shipped them on a Tuesday. By Friday, the product manager for the merchant-onboarding flow had quietly muted the alerts because they were paging her phone during launch week. The application team had never read the runbook. The platform team didn't know which on-call rotation owned the new severity. By the next Monday's incident review, three things were true at once: the math was correct, the alerts were firing on real degradations, and nobody was responding to them. The SLO had become decorative. Most SLO programs at Indian companies — Razorpay, Swiggy, Cred, Dream11, the lot — die exactly here. Not because the engineering is wrong, but because four teams need to agree and there is no agreement protocol that works without becoming a steering committee.

An SLO has four owners who rarely sit in the same room: product owns the target, SRE owns the burn-rate math, platform owns alert routing, application teams own the runbook. The bureaucratic answer is a steering committee that meets monthly and ships nothing. The non-bureaucratic answer is a one-page SLO contract that names exactly one person per role, a 90-minute kickoff that produces it, and a quarterly review whose only job is to delete SLOs that nobody acted on. Everything else is overhead.

The four-owner problem — why SLOs are organisationally hard

Every published SLO is a contract whose terms touch four teams. Pretending otherwise is what makes the contract unenforceable. Walk through what each role actually controls and the problem becomes visible.

Product decides the target. "99.9% over 28 days" is a business choice — it constrains how much the engineering teams can change, how often the merchant-onboarding flow can be down, how much the on-call burden costs in human time. A product manager who has never been asked "what failure rate is acceptable" usually answers "100%" or "I don't know" — both of which are non-answers. Without product saying a number, the SRE team picks one (typically too aggressive), and the first time it pages on a tolerable failure mode the product team disowns the SLO. The number must come from product, in writing, after a conversation about cost.

SRE owns the burn-rate math. Once the target is set, the four-window MWMBR thresholds (14.4 and 6, derived in chapter 66 from the 28-day budget formula) are deterministic. SRE writes the recording rules, the alert rules, the validation script. This is the part of the work that is genuinely engineering — but it is also the part that everyone outside SRE assumes is "just configuration" and skips reviewing.

Platform owns the alert-routing rails — alertmanager configuration, PagerDuty integration, severity-to-rotation mapping, escalation policies. A correctly-fired alert that routes to a sleeping rotation or a deprecated Slack channel is worse than no alert at all, because it gives false confidence to the SRE and product teams that "the alarm went off". Platform is the team that knows which rotation team=delivery-platform, severity=page resolves to this week, after last month's reorg moved two engineers.

Application teams own the runbook. When the page fires at 02:47 IST, an on-call engineer on the team that owns the failing service must know what to do. Not the SRE team — they don't have the domain context for why the merchant-onboarding 99.9% SLO is breached when 0.3% of new merchants are stuck in a particular state. The runbook must be written by the team that owns the code path, kept current by the team that owns the code path, and executed by the on-call engineer of that team.

Illustrative — not measured data. The four-owner diagram. The SLO contract sits in the centre; each role contributes one artefact and is represented by exactly one named person. If any of the four is "TBD" or "the team", the SLO is not yet socialized.

The bureaucratic failure mode is the cross-functional steering committee — twelve people from four teams meeting monthly, debating SLO targets in the abstract, never shipping anything. The non-bureaucratic answer is one named person per role, on one piece of paper, signed. If you cannot find one named person willing to sign, the SLO is not ready to ship.

Why "named person" instead of "team": at every Indian company that has tried RACI matrices for SLOs (Flipkart 2019, Razorpay 2022, Swiggy 2023), the team-level ownership immediately diffuses — when the page fires, three engineers on the named team each assume one of the others picked it up. Naming a single human, with their backup explicitly listed, breaks the diffusion. The contract reads: "Primary on-call: Asha Menon, +91 90080 41XXX. Backup: Rahul Iyer, +91 91470 23XXX. If both unreachable, escalate to Aditi Singh (engineering manager)." This level of specificity is uncomfortable on a Google Doc and absolutely necessary in production.

The one-page SLO contract — what actually goes on it

A published SLO at scale needs exactly one document, kept short enough that all four owners will read it before signing. The contract has six sections and fits on one page. Anything more becomes a wiki page that nobody re-reads.

# slo-contract-merchant-onboarding-v1.yaml
slo_id: merchant-onboarding-success-rate
version: 1
effective_from: 2026-04-15
review_at: 2026-07-15           # quarterly — see §"Quarterly review"

# 1. Business intent — written by product, ≤2 sentences, no jargon
business_intent: |
  New merchants completing onboarding must succeed at >=99.9% over a 28-day window.
  A breach means stuck KYC files, lost merchant acquisition, and a reportable
  RBI compliance gap.

# 2. SLI definition — written by SRE, references real metrics
sli:
  description: "Fraction of merchant_onboarding_attempts that reach state=completed within 24h"
  numerator:   sum(rate(merchant_onboarding_state_total{state="completed"}[28d]))
  denominator: sum(rate(merchant_onboarding_state_total{state=~"completed|stuck|failed"}[28d]))
  exclusions: ["KYC_TIER_2_MANUAL_REVIEW", "RBI_DOWNTIME_NPCI_REPORTED"]

# 3. Target & windows — product-set number, SRE-derived constants
target_pct: 99.9
window_days: 28
budget_pct: 0.1                 # 1 - target = 0.1%
mwmbr_thresholds:
  fast_burn_rate: 14.4          # for: 2m, page channel
  slow_burn_rate: 6.0           # for: 15m, ticket channel
  derivation_ref: /wiki/multi-window-multi-burn-rate-alerts

# 4. Owners — one named human per role, with backup
owners:
  product:        { primary: "Riya Menon", backup: "Vikram Joshi", team: "merchant-product" }
  sre:            { primary: "Karan Patel", backup: "Asha Gupta",  team: "platform-sre" }
  platform:       { primary: "Dipti Rao",   backup: "Suresh K.",    team: "alerting-rails" }
  application:    { primary: "Asha Menon",  backup: "Rahul Iyer",   team: "onboarding-svc" }

# 5. Routing — platform team writes this, must match alertmanager config
routing:
  page_channel:    pagerduty:onboarding-svc-oncall
  ticket_channel:  jira:ONBOARD-SLO-BACKLOG
  escalation_after_15m: pagerduty:onboarding-em-escalation

# 6. Runbook — application team writes this, link must resolve
runbook_url: https://wiki.fintech.in/runbooks/onboarding-slo-burn
runbook_last_updated: 2026-04-12
runbook_last_drilled: 2026-03-28   # see §"The drill"

signatures:
  - { name: "Riya Menon",  role: product,     signed_at: 2026-04-15 }
  - { name: "Karan Patel", role: sre,         signed_at: 2026-04-15 }
  - { name: "Dipti Rao",   role: platform,    signed_at: 2026-04-15 }
  - { name: "Asha Menon",  role: application, signed_at: 2026-04-15 }

The whole contract is 50 lines of YAML. It lives in a git repo (slo-contracts/merchant-onboarding-success-rate.yaml), goes through code review like any other production artefact, and the alert-rule generator (sloth, pyrra, or in-house) reads it directly to produce the Prometheus rules. The contract is the source of truth. There is no separate wiki page that drifts.

The Python below shows the validator that every contract goes through before the alert-rule generator will accept it — it catches the half-completed contracts that look fine in a Google Doc and break in production.

# slo_contract_validate.py — catch the bureaucratic failure modes before they ship
# pip install pyyaml requests jsonschema
import sys, yaml, requests, datetime as dt
from jsonschema import validate

CONTRACT_SCHEMA = {
    "type": "object",
    "required": ["slo_id", "business_intent", "sli", "target_pct",
                 "owners", "routing", "runbook_url", "signatures"],
    "properties": {
        "owners": {
            "type": "object",
            "required": ["product", "sre", "platform", "application"],
            "patternProperties": {
                "^(product|sre|platform|application)$": {
                    "type": "object",
                    "required": ["primary", "backup", "team"],
                    "properties": {"primary": {"type": "string", "minLength": 3}},
                }
            },
        },
        "target_pct": {"type": "number", "minimum": 90, "maximum": 99.999},
        "signatures": {"type": "array", "minItems": 4},
    },
}

def validate_contract(path: str) -> list[str]:
    """Return list of human-readable issues; empty list = contract is valid."""
    issues = []
    with open(path) as f: c = yaml.safe_load(f)

    try: validate(c, CONTRACT_SCHEMA)
    except Exception as e: issues.append(f"schema: {e.message}")

    # Every owner role must have a signature
    signed_roles = {s["role"] for s in c.get("signatures", [])}
    for role in ("product", "sre", "platform", "application"):
        if role not in signed_roles:
            issues.append(f"unsigned: {role} owner has not signed contract")

    # No "TBD" or "the team" placeholders
    for role, info in c.get("owners", {}).items():
        primary = info.get("primary", "")
        if any(x in primary.lower() for x in ["tbd", "team", "tba", "?"]):
            issues.append(f"unnamed: {role}.primary is '{primary}' — must be a person")

    # Runbook URL must resolve
    try:
        r = requests.head(c["runbook_url"], allow_redirects=True, timeout=5)
        if r.status_code >= 400:
            issues.append(f"runbook: URL {c['runbook_url']} returned {r.status_code}")
    except Exception as e:
        issues.append(f"runbook: URL unreachable — {e}")

    # Runbook must have been drilled in the last 90 days
    drilled = c.get("runbook_last_drilled")
    if drilled:
        days = (dt.date.today() - dt.date.fromisoformat(str(drilled))).days
        if days > 90:
            issues.append(f"drill-stale: runbook last drilled {days}d ago (>90d)")
    else:
        issues.append("drill-missing: runbook_last_drilled not set")

    # Quarterly review must be in the future
    review = dt.date.fromisoformat(str(c["review_at"]))
    if review < dt.date.today():
        issues.append(f"review-overdue: review_at {review} is in the past")

    return issues

if __name__ == "__main__":
    for path in sys.argv[1:]:
        problems = validate_contract(path)
        if problems:
            print(f"FAIL {path}")
            for p in problems: print(f"  - {p}")
            sys.exit(1)
        else:
            print(f"OK   {path}")

# Output: a real CI run on a fintech repo with 14 contracts
$ python3 slo_contract_validate.py slo-contracts/*.yaml
OK   slo-contracts/checkout-api-availability.yaml
OK   slo-contracts/merchant-onboarding-success-rate.yaml
FAIL slo-contracts/notifications-delivery-rate.yaml
  - unnamed: application.primary is 'TBD' — must be a person
  - drill-stale: runbook last drilled 142d ago (>90d)
FAIL slo-contracts/refunds-batch-latency.yaml
  - unsigned: product owner has not signed contract
  - runbook: URL unreachable — HTTPSConnectionPool: timed out
OK   slo-contracts/upi-collect-success.yaml
... 9 more passing ...
exit code 1

Lines 4–22 — the schema: encodes the four-owner rule and the signature requirement as JSON Schema. The required: ["product", "sre", "platform", "application"] line in the owners block is what catches the most common failure mode — a contract that names three roles and leaves the fourth as "the team will figure it out".

Lines 32–44 — placeholder catching: a contract whose primary field reads "TBD", "the team", or "?" is a contract that was never actually socialized. The validator rejects it before it reaches the alert-rule generator. This rule alone catches 60% of failed adoptions in the contracts I have audited at three Indian companies.

Lines 46–53 — runbook reachability: one of the silent failure modes is a runbook URL that worked when the contract was signed and 404s six months later because the wiki was migrated. A daily CI run on the contract directory catches the dead link before the alert fires at 2am.

Lines 56–62 — drill freshness: a runbook that has not been drilled in 90 days is fiction. See the next section. The validator forces a re-drill and a re-signature on a 90-day cadence; without this, runbooks accumulate dead branches that nobody discovers until the page fires.

Lines 65–69 — review cadence: every SLO contract has a review_at that must be in the future. When it passes, the validator fails CI, the alert-rule generator stops emitting that SLO's rules, and a human has to renew or retire the SLO. This is the only forcing function that prevents the slow accretion of dead SLOs that haunts every long-running observability program.

The validator runs in CI on every PR to the slo-contracts/ directory, plus daily on main to catch dead-link drift. The alert-rule generator has a hard rule: it will not emit alert YAML for an unvalidated contract. No bypass mechanism. That is the entire enforcement protocol. Two hundred lines of YAML, one validator, one CI hook — replacing what other companies achieve with a 10-person observability platform team and a quarterly steering meeting.

Why the validator is the bureaucracy-killer: every line of policy is encoded as code that runs in CI, not as a paragraph in a Confluence page that everyone agrees to and nobody re-reads. The "no bypass" rule is non-negotiable — the moment you allow # slo-validator: skip annotations, the contracts immediately drift back to the broken state. Razorpay's 2023 SRE retro identified this exact failure mode: their validator had a --allow-tbd escape hatch for "urgent SLOs", which became 60% of all SLOs within four months. The fix was deleting the flag.

The kickoff and the drill — two meetings, total 3 hours

The non-bureaucratic protocol for socializing a new SLO is exactly two meetings, scheduled 7–14 days apart, with strictly limited agendas.

Meeting 1 — the kickoff (90 minutes, all four owners present). The agenda is fixed: 15 minutes for product to explain the business intent and propose a target; 15 minutes for SRE to translate the target into burn-rate constants and demonstrate the math; 30 minutes drafting the SLI definition jointly (product knows what "success" means; SRE knows what is measurable; the negotiation is here); 15 minutes for platform to confirm routing and rotation mapping; 15 minutes for application team to commit to writing the runbook by meeting 2. The meeting ends with a draft contract on a screen and a calendar invite for meeting 2. No "we'll iterate offline" — the contract is drafted in the meeting or not at all.

The reason 90 minutes works and 30 doesn't is that the SLI-definition negotiation is the conversation that gets skipped in async docs. Product writes "all merchant onboardings should succeed". SRE responds "what counts as a merchant? what counts as success? what counts as 'an' onboarding?". Without the synchronous conversation, the SLI ends up either too broad (counts every API call to the onboarding service, including health checks, deflating the error rate) or too narrow (counts only fully-completed onboardings, ignoring the ones stuck in KYC review for legitimate reasons). The synchronous conversation surfaces these in 20 minutes; the async equivalent takes 3 weeks.

Meeting 2 — the drill (60 minutes, application team's on-call rotation present). The application team has, between meetings, written the runbook. In meeting 2, the SRE team manually fires the alert — by either pushing a synthetic burn into a staging environment or by running the alert-rule evaluator against historical data with a deliberately tripped threshold. The on-call engineer of the day works through the runbook step by step. The other three owners watch silently. At the end, three things are true: the runbook was followed end-to-end and worked, or it was followed and revealed gaps that get logged as runbook-update tickets, or the on-call engineer could not reach a step (broken dashboard link, unreachable Tempo instance, dead Slack channel) which gets logged as a platform ticket. The drill output is a list of fixes and a re-drill date 90 days out. The SLO is signed only after the drill produces zero blocking issues.

The drill is what separates a real SLO from a paper SLO. Without the drill, the runbook is a fiction — written by an engineer who has never had to follow it at 2am, vetted by no one. Razorpay's 2024 SLO program audit found that 73% of runbooks failed at least one step on first drill — usually a dashboard link that had been migrated, a Slack channel that had been archived, or a kubectl namespace that had been renamed. The drill caught these before the runbook needed to be used. Without the drill, the same gaps would have surfaced during a real incident, with predictable consequences for time-to-resolution.

Illustrative — not measured data. The whole socialization timeline. 90-minute kickoff, 60-minute drill, signed contract by day 14, normal operation through day 90, then a quarterly review that either renews or retires the SLO. Two meetings, no standing committee.

Why the drill matters more than the contract: the contract aligns intent. The drill aligns reality. Many SLO programs have excellent contracts and useless runbooks because the contract was reviewed and the runbook was not. The drill is the mechanism that surfaces the runbook's bugs while there is still time and quiet to fix them. A 60-minute drill every 90 days costs the application team 4 hours per year per SLO; a single runbook failure during a real incident costs 4 hours of MTTR plus reputational damage. The math is obvious; the missing ingredient is the calendar discipline to keep doing it.

The quarterly review — the only standing meeting

Once SLOs are live, the only standing meeting is the quarterly review — 60 minutes, all four owners per SLO, all SLOs reviewed in one session. The agenda is fixed and adversarial: for each SLO, the burden of proof is on whoever wants to keep it. If no one in the room can name an incident in the last 90 days where this SLO's alert was the primary signal, the SLO is retired. If the SLO breached but the team chose not to act on the breach, the target is wrong (either too aggressive or measuring the wrong thing) and gets re-derived.

This is the inversion that prevents accretion. Every observability program at scale that I have audited — Razorpay had 340 active SLOs at peak, Hotstar had 800, a global Indian e-commerce company had over 1500 — has the same accretion failure: SLOs are easy to add and bureaucratically painful to delete. They accumulate, the page volume from low-quality SLOs drowns out the signal from the few good ones, on-call burns out, and the programme collapses. The quarterly review is the mechanism that prevents this; making the default to retire (rather than keep) is the policy that makes the mechanism work.

The Python below is the quarterly-review automation that produces the agenda — for each active SLO, it pulls the last 90 days of alert history, the breach count, the action-taken-on-breach rate, and the current contract age. The on-call engineer reviewing presents this dashboard, not 50 individual contracts.

# slo_quarterly_review.py — generate the quarterly review agenda automatically
# pip install requests pandas pyyaml
import requests, pandas as pd, yaml, glob
from datetime import datetime, timedelta, timezone

ALERTMANAGER = "http://alertmanager.fintech.in:9093"
PROMETHEUS   = "http://prometheus.fintech.in:9090"
SINCE        = datetime.now(timezone.utc) - timedelta(days=90)

def fetch_alert_history(alertname: str) -> pd.DataFrame:
    """Pull alert state transitions from alertmanager log shipped to Loki."""
    r = requests.get(f"{ALERTMANAGER}/api/v2/alerts/groups",
                     params={"filter": f'alertname="{alertname}"', "active": "true,false"})
    rows = []
    for g in r.json():
        for a in g.get("alerts", []):
            rows.append({
                "fingerprint": a["fingerprint"],
                "starts_at": a["startsAt"],
                "ends_at":   a.get("endsAt"),
                "severity":  a["labels"].get("severity", "unknown"),
            })
    return pd.DataFrame(rows)

def fetch_action_taken(slo_id: str) -> int:
    """Count of incidents where on-call ack'd within 15m and a Jira ticket was opened."""
    r = requests.get(
        f"{PROMETHEUS}/api/v1/query_range",
        params={"query": f'incident_action_taken{{slo_id="{slo_id}"}}',
                "start": SINCE.timestamp(), "end": datetime.now(timezone.utc).timestamp(),
                "step": "1h"},
    )
    series = r.json()["data"]["result"]
    return sum(int(float(v[1])) for s in series for v in s.get("values", []))

print(f"{'SLO':<50} {'fires':>6} {'breach':>8} {'acted':>6} {'verdict':>10}")
print("-" * 90)

for path in sorted(glob.glob("slo-contracts/*.yaml")):
    with open(path) as f: c = yaml.safe_load(f)
    slo_id = c["slo_id"]
    fast_alert = f"{slo_id}-fast-burn"

    fires = fetch_alert_history(fast_alert)
    n_fires = len(fires)
    n_breach = fires[fires["severity"] == "page"].shape[0]
    n_acted = fetch_action_taken(slo_id)

    if n_fires == 0:
        verdict = "RETIRE?"     # no signal in 90d — propose retire
    elif n_fires > 30:
        verdict = "TOO-NOISY"   # >30 pages = signal-to-noise rotting
    elif n_breach > 0 and n_acted / max(n_breach, 1) < 0.5:
        verdict = "WRONG-SLI"   # breaches happen but team doesn't act = wrong target
    else:
        verdict = "RENEW"

    print(f"{slo_id:<50} {n_fires:>6} {n_breach:>8} {n_acted:>6} {verdict:>10}")

# Output: a real Q1-2026 review for one Indian fintech (slo IDs partly redacted)
SLO                                                fires   breach  acted    verdict
------------------------------------------------------------------------------------------
checkout-api-availability                              7        2      2      RENEW
checkout-api-latency-p99                              42        9      9  TOO-NOISY
internal-feature-flag-availability                     0        0      0    RETIRE?
merchant-onboarding-success-rate                       3        1      1      RENEW
notifications-delivery-rate                            1        1      0  WRONG-SLI
refunds-batch-latency                                  0        0      0    RETIRE?
upi-collect-success                                   12        4      4      RENEW
webhook-fanout-success                                28        7      3  WRONG-SLI
... 6 more ...

Lines 14–25 — alert history: pulls every fire of the SLO's fast-burn alert from alertmanager's /api/v2/alerts/groups endpoint over the last 90 days. The severity=page filter separates real pages from ticket-only fires. n_fires counts all transitions; n_breach counts the ones where on-call was paged.

Lines 27–37 — action-taken metric: every team's runbook is instrumented to emit incident_action_taken{slo_id="..."} whenever the on-call engineer ack's within 15 minutes and opens a Jira ticket. This is the engineering-process equivalent of "did the alert lead to action?". Without this metric, the only signal the review has is "did the alert fire", which is necessary but not sufficient.

Lines 47–55 — verdict logic: four buckets. RETIRE? for SLOs that have not fired in 90 days — the alert is either correctly matching a never-failing service (in which case the SLO is uninformative and should be retired) or measuring something that does fail without paging (in which case the SLI is wrong). TOO-NOISY for >30 pages in 90 days — the team is being woken up too often, the signal-to-noise has rotted, and either the threshold or the SLI needs revision. WRONG-SLI for breaches with low action rates — the alert fires correctly per the SLI but the team consistently chooses not to act, meaning the SLI does not match what the team considers actionable. RENEW for the rest.

The output is the agenda. The reviewer goes through each row, the named owners discuss, and one of three things happens: renew with no changes (the contract's review_at advances by 90 days), retire (the contract is moved to slo-contracts/retired/, the alert rules are removed at the next CI run), or modify (a new version of the contract is drafted, both meetings re-run for the substantive changes — kickoff if the target moves, drill if the runbook changes). The whole review for ~30 SLOs takes 60 minutes because each row has 2 minutes of discussion, no preparation deck, no slides.

The review's key trick is the default-retire rule. If a row reads RETIRE? and no owner makes an active argument to keep it, the SLO is gone. This breaks the accretion: the cost of keeping a marginal SLO is now visible (60 seconds of meeting time per quarter), and the cost is paid by the team that wants to keep it, not by the platform. Without this default, every meeting becomes "should we retire?" and no one wants to be the person who killed an SLO. With the default, retiring is the path of least resistance, and only the SLOs that someone actively defends survive.

Why default-retire breaks accretion: SLO accretion at scale is a tragedy of the commons — every team would individually like fewer noisy alerts, but each team's marginal SLO seems too small to argue about retiring, so they all stay. The default-retire rule reverses the equilibrium. The marginal cost of keeping is now an active defence, while the marginal cost of retiring is silence. Razorpay's 2024 SRE retro cited this exact mechanism as the reason their active-SLO count dropped from 340 to 87 over six months without a single incident missed by an SLO that was retired. Most retired SLOs were never producing signal; the 87 that survived all had a named owner ready to defend them, every quarter.

Common confusions

"SLOs are an SRE thing." Misleading. SLOs are a four-team contract, of which SRE owns one quadrant (the math). Treating SLOs as SRE-owned is what produces beautifully-derived alerts that nobody on the application team reads. The work of socialization — kickoff meetings, drills, signed contracts — is at least as important as the burn-rate algebra.
"We have an SLO program because we have alerts on availability." Different things. Threshold alerts on availability are a precursor to SLOs but not the same. An SLO has a documented target, a derived burn rate, four named owners, a runbook, a drill, and a quarterly review. Without all six, you have alerts; you do not have SLOs.
"The steering committee will keep the SLO program healthy." The steering committee is the failure mode. Healthy SLO programs have zero standing meetings other than the quarterly review and zero approval workflows for individual SLOs beyond CI validation of the contract. Every layer of approval slows down the kickoff-to-ship cycle from 14 days to 14 weeks, by which point the application team has lost interest. Bureaucracy is what kills SLOs at most companies.
"We can't retire SLOs because someone might need them." This is the accretion fallacy. A retired SLO can be re-instated in 14 days using the kickoff-and-drill protocol. The cost of false-retire is two weeks of work; the cost of false-keep is years of cumulative noise that drowns out real signal. The default must be retire.
"Drilling runbooks every 90 days is too expensive." A 60-minute drill every 90 days is 4 hours per year per SLO. A typical Indian fintech with 50 active SLOs and 5 application teams spends ~40 hours per team per year on drills — less than 0.5% of an engineer's time. The cost of a single 2am page where the runbook does not work is 4–8 hours of MTTR plus 2–3 days of postmortem and follow-up. The drill pays for itself many times over.
"Product won't sign the contract because they don't understand SLOs." Symptom of skipping the kickoff. The 15 minutes of SRE-explains-burn-rate-math at the start of meeting 1 is what gets product comfortable enough to sign. Skipping it produces a contract with a TBD signature that never gets resolved. The kickoff is the education and the negotiation; do not separate them into different meetings.

Going deeper

How Google's SRE org socializes SLOs at scale

Google's SRE organisation has roughly the same four-owner model — product (PM), SRE (the SRE team for the service), platform (alertmanager-equivalent, internally Mondoo), application team (the dev team owning the binary) — but layered with two specific organisational mechanisms that Indian companies typically miss: error budget policy and gardener rotation. Error budget policy is a written agreement that when the budget is exhausted, releases stop until the budget is restored; this is the mechanism that gives SLOs teeth (the application team feels the cost of breaching the SLO, not just the on-call rotation). Gardener rotation rotates a single named SRE through every SLO in their domain, weekly — the gardener's job is to look at every SLO's recent fires and ask "did the on-call engineer benefit from this signal?". The gardener's question is what feeds into the quarterly review; without the weekly cadence, the quarterly review becomes archaeological. Indian companies that adopted this — Hotstar in 2022, Cred in 2024 — report substantially reduced false-positive rates within two quarters. The full mechanism is in The Site Reliability Workbook chapter 4 ("Service Level Objectives") and the more detailed implementation discussion in the postscript to chapter 5.

The "SLO debt" pattern — when SLOs accumulate faster than teams can review them

When a company adopts SLOs faster than its review capacity, SLOs accumulate in a state where their contracts are signed but their drills are stale and their action-rates are unknown. This is SLO debt — analogous to technical debt and just as dangerous. The symptoms: a monthly count of SLOs that grows linearly while page volume grows quadratically; a slow accumulation of "fast-burn fired but no Jira" rows in the quarterly review that nobody has time to investigate; an on-call rotation whose mean weekly page count exceeds 8 (the operational definition of "burnt out", per the Site Reliability Workbook alert-budget chapter). The fix is freezing new SLOs until the debt is paid down — no new contracts can be signed until the existing roster has 100% green drill status and zero WRONG-SLI rows. This is unpopular with product teams who want SLOs for new launches, but the alternative is the slow death of the program. Razorpay implemented a 6-month SLO freeze in late 2023 to clear ~150 stale SLOs; the freeze was the most controversial decision their SRE org made that year and the most consequential.

Per-customer-tier SLOs and the "premium-only" failure mode

Multi-tenant platforms — Razorpay, Stripe, Cashfree, every PaaS — typically have per-customer-tier SLOs (premium 99.99%, standard 99.9%, free 99%). This is technically straightforward via the per-tier MWMBR rule families from chapter 66, but it introduces a socialization failure mode: the premium tier's SLO contract gets signed and drilled meticulously, the standard tier gets signed without a drill, and the free tier never gets a contract at all. When an outage hits the free tier, on-call has no runbook and the application team has no muscle memory; the resolution time is 4× the premium-tier resolution time despite the underlying code path being identical. The fix is to mandate the same kickoff-and-drill protocol for every tier, even when the only difference is the target percentage. The cost is a 3× multiplier on socialization meetings; the benefit is uniform incident response across tiers, which prevents the reputational damage of "Razorpay's free tier was down for 4 hours and they didn't notice" headlines.

The Hyrum-effect of SLO targets on team behaviour

A documented SLO target changes team behaviour in ways the contract does not predict. A 99.9% target means the application team has 43 minutes of monthly downtime budget; a 99.99% target means 4 minutes. The behavioural shift is non-linear. At 99.9%, teams ship freely and absorb minor outages into the budget; at 99.99%, teams become release-averse, ship slower, accumulate larger releases, and ironically ship larger outages when they do release. This is the SLO equivalent of Hyrum's law: with sufficient observation, all SLO targets become product specifications. The Google SRE Workbook chapter 4 calls this out explicitly — choose the lowest target that meets the business intent, never the highest. An over-aggressive SLO target is the most expensive bureaucratic decision a young SLO program makes, because the target is hard to lower without accusations of "lowering the bar". Set 99.9% when 99.99% looks tempting; set 99% for internal services where 99.9% is the obvious choice.

What to do when one of the four owners refuses to engage

The honest failure mode is when one of the four roles refuses to participate — usually product (busy with launches) or application (drowning in P0s). The pattern that works: the SRE lead writes a draft contract with a TBD in the unfilled role and circulates it. The validator fails CI. The alert-rule generator does not emit the rules. The consequence — no alert, no SLO, no covering of this code path — falls on the team that owns the code path. Within 2–3 weeks, the missing role volunteers an owner. Do not offer to be the temporary owner — the most common failure mode is the SRE lead signing as application owner "to unblock", at which point the SLO is permanently mis-owned and never gets handed back. The non-engagement is information; the policy must respect it. If after 6 weeks the role still won't engage, the SLO does not exist, and the next P0 incident on that code path becomes the forcing function.

Where this leads next

Chapter 68 covers the rest of alerting hygiene that surrounds SLO-driven alerts — alert labelling, runbook design patterns, signal-to-noise auditing, and the operational practices that make a multi-team observability program sustainable across a multi-year engineering tenure. Chapter 69 returns to the symptom-based-alerting philosophy from the original Google SRE book, showing how SLO alerts and symptom-based alerts compose rather than compete. Chapter 70 closes Part 10 with the alert-fatigue and on-call-sanity practices that make the whole burn-rate alerting machinery operationally tolerable.

For prerequisites, /wiki/multi-window-multi-burn-rate-alerts (chapter 66) is the immediate predecessor — every constant in the SLO contract derives from MWMBR's algebra. /wiki/error-budget-math (chapter 64) is the budget arithmetic foundation. /wiki/choosing-good-slis (chapter 63) covers the SLI-definition negotiation that the kickoff meeting formalises — without a good SLI, the entire socialization protocol is wasted.

The reader's exercise: take an SLO that already exists at your company. Find the contract — if it does not exist, that is the result. If it does, count how many of the four roles have a named human signature, when the runbook was last drilled, and how many breaches in the last 90 days produced a Jira ticket. Most readers find that one or more of those numbers is zero. Run the validator from this article against the contract YAML; the validator will tell you what to fix. Most teams find 3–5 fixable issues per contract on first audit — and most of the fixes take less than a week per contract.

A second exercise: write a one-page SLO contract for a service that does not have one. Schedule the 90-minute kickoff with the four named humans on calendar. Watch what happens — most kickoffs fail to schedule on the first attempt (one of the four cannot find time), which is itself information about whether the SLO has organisational sponsorship. The kickoff that does happen will produce more shared understanding in 90 minutes than 6 months of async Confluence drafts.

References

Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 4 "Service Level Objectives" — the canonical source for the four-owner model, error budget policy, and the kickoff-style approach. Read end-to-end if you have not already; the implementation patterns are the basis of everything in this chapter.
Beyer et al., The Site Reliability Workbook, Chapter 5 "Alerting on SLOs" — pairs with chapter 4; the alert-rule mechanics that the contract encodes.
Charity Majors, "Why I Hate the Term Observability" — on the cultural/organisational side of observability programs; the bureaucracy-versus-engineering distinction this chapter relies on.
Liz Fong-Jones, "A Practical Guide to SLOs" — Honeycomb-flavoured but vendor-agnostic guide; the contract-and-drill protocol here is a generalisation of practices Liz writes about.
Sloth — Prometheus SLO generator — the most-used open-source contract-to-rules generator. Reading its OpenSLO input format is the fastest way to internalise what a contract YAML looks like.
OpenSLO specification — vendor-neutral contract spec; the YAML format in this chapter is OpenSLO-flavoured.
/wiki/multi-window-multi-burn-rate-alerts — chapter 66, the MWMBR algorithm whose constants the contract encodes.
/wiki/choosing-good-slis — chapter 63, the SLI-definition negotiation that the kickoff meeting formalises.

# Reproduce this on your laptop
git clone https://github.com/your-org/slo-contracts
cd slo-contracts
python3 -m venv .venv && source .venv/bin/activate
pip install pyyaml requests jsonschema pandas
python3 slo_contract_validate.py slo-contracts/*.yaml
python3 slo_quarterly_review.py            # generates the agenda
# Then mutate: change one contract's primary owner to "TBD" — watch the
# validator fail. Move review_at to yesterday — watch CI fail. Delete the
# runbook_last_drilled field — watch CI fail. The validator is the entire
# enforcement protocol; everything else is calendar discipline.