Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Playbooks, post-mortems, and blameless culture

It is 03:11 at Yatrika and Karan is staring at a PagerDuty page that says RiderPositioningIngestionStalled for the seventeenth time this quarter. The runbook link opens a Confluence page last edited fourteen months ago by an engineer who has since left. Step 3 links to a Grafana URL that returns 404 because the dashboard was renamed during the Mimir migration. Step 5 says "escalate to #data-platform" — that channel was archived in January. Karan does what every on-call eventually does — closes the runbook, opens a fresh terminal, runs kubectl logs and kafka-consumer-groups.sh from muscle memory, finds a stuck rebalance in eight minutes, restarts the consumer, and goes back to sleep. The runbook did not help. Worse, the post-mortem from the last time this fired — six weeks ago — had an action item titled "update runbook" assigned to an engineer now on parental leave. Three days from now the team will hold another post-mortem; if they run it like the previous one, the same action item will be reopened, reassigned, and reach the same fate. The discipline of playbooks and post-mortems is the discipline of refusing that loop. Most teams don't.

A playbook is the document an exhausted engineer executes at 03:00 — every step verifiable, every dashboard link healthy, every escalation path tested in the last 60 days. A post-mortem is a structured artefact that converts an incident into changes to the system, not into character assessments of the engineer who pushed the button. Blameless culture is not a vibe; it is a written set of rules about what cannot appear in a post-mortem document and a manager whose job is to enforce them. Without all three — runnable playbook, structured post-mortem template, enforced blameless rule — your incidents teach your team nothing.

What a real playbook looks like — and why most fail

A "runbook" or "playbook" is a step-by-step procedure for resolving a specific class of incident. The default failure mode: write it once, paste it into Confluence, link it from PagerDuty, never run it again. By month six it has rotted — dashboard URLs 404, kubectl namespaces renamed, the Slack escalation channel archived, the metric in step 4 renamed during the OpenMetrics migration. The on-call who relies on it at 03:00 wastes seven minutes discovering it is broken before falling back to first-principles debugging. The playbook is supposed to save those seven minutes; instead it costs them.

The corrective is not "write better runbooks". It is to treat the playbook as executable code that gets tested on a schedule, with the same CI discipline as any other code:

Anatomy of a playbook that works versus one that rotsTwo columns side by side. Left column shows a rotted playbook: a Confluence page with a stale title, a 404 dashboard link, an archived Slack channel, an unverified kubectl command, and a vague escalation. Each item has a red cross. Right column shows a working playbook stored in git, with verifiable preconditions, every dashboard link tested by CI weekly, every kubectl command annotated with expected output, every escalation path with a name plus phone number plus tested-on date, and a verify step at the bottom. Each item has a green check. Bottom row shows a CI pipeline running every Sunday at 02:00 against the working playbook, exercising each link and each command on a staging cluster. Playbook shape — what rots vs what survives The difference is not prose; it is whether CI executes the playbook on a schedule Confluence playbook (rots in 6 months) Title: "Rider Positioning Runbook v3 (DRAFT)" x Step 1: open dashboard at grafana/d/abc123 (404) x Step 3: kubectl logs -n riders ... (ns renamed) x Step 5: escalate to #data-platform (archived) x Step 6: "if still broken, contact Aditi" (left) x Last edited: 14 months ago x git-tracked playbook (CI-verified) Title: rider-positioning-ingestion-stalled.md ok Preconditions checked: kafka, mimir, k8s ctx ok Each link CI-tested: 200 OK, last Sun 02:14 ok Each cmd has expected-output regex check ok Escalation: name + phone + tested 8d ago ok Verify step: assert lag < 100 before close ok CI runs every Sunday 02:00 IST: every link checked, every kubectl dry-run, alerts on first failure
Illustrative — the difference between a rotting Confluence runbook and a CI-verified playbook. The right side is not "more discipline"; it is the same discipline applied to a different artefact (markdown in git rather than a Confluence page) so that existing CI tooling can enforce it.

Why CI verification matters more than playbook quality: a perfectly written playbook decays at the rate the underlying system changes — typically 5–10 broken references per month with two weekly deploys. Confluence-hosted playbooks have no detector and converge on the same broken steady state regardless of writing quality. Git-hosted playbooks tested by CI have a forcing function — when a dashboard URL 404s on Sunday's CI run, an issue is opened Monday morning before the next 03:00 page. Skip the forcing function and the playbook quality is irrelevant; with it, even mediocre playbook prose stays runnable.

The real shape of a playbook that survives:

# verify_playbook.py — CI-verifiable playbook for "RiderPositioningIngestionStalled"
# pip install requests pyyaml kubernetes prometheus-api-client
import re, sys, time, requests, yaml
from kubernetes import client, config
from prometheus_api_client import PrometheusConnect

# The playbook is data, not prose. CI loads this YAML, walks each step, and fails the build
# if any precondition, link, or expected-output assertion no longer holds.
PLAYBOOK = yaml.safe_load(open("playbooks/rider-positioning-ingestion-stalled.yaml"))

def check_link(url: str) -> tuple[bool, str]:
    """Every link in the playbook must return 2xx; 404 means the dashboard was renamed."""
    try:
        r = requests.get(url, timeout=5, allow_redirects=True)
        return (r.status_code < 400, f"{r.status_code} {url}")
    except Exception as e:
        return (False, f"ERR {url}: {e}")

def check_kubectl_target(ctx: str, ns: str, deployment: str) -> tuple[bool, str]:
    """Every kubectl reference must resolve — namespace exists, deployment exists, has ready pods."""
    config.load_kube_config(context=ctx)
    v1 = client.AppsV1Api()
    try:
        d = v1.read_namespaced_deployment(deployment, ns)
        ready = d.status.ready_replicas or 0
        return (ready > 0, f"{ctx}/{ns}/{deployment} ready={ready}")
    except Exception as e:
        return (False, f"{ctx}/{ns}/{deployment} ERR {e}")

def check_promql(prom_url: str, query: str) -> tuple[bool, str]:
    """The PromQL queries the playbook tells the on-call to run must still parse and return."""
    p = PrometheusConnect(url=prom_url, disable_ssl=False)
    try:
        result = p.custom_query(query)
        return (result is not None, f"OK ({len(result)} series) :: {query[:60]}")
    except Exception as e:
        return (False, f"ERR :: {query[:60]} :: {e}")

def check_escalation(escalation: dict) -> tuple[bool, str]:
    """Escalation contact must have been verified within the last 90 days."""
    last_verified = escalation.get("last_verified_at")
    if not last_verified:
        return (False, f"{escalation['name']} never verified")
    age_days = (time.time() - time.mktime(time.strptime(last_verified, "%Y-%m-%d"))) / 86400
    return (age_days < 90, f"{escalation['name']} verified {int(age_days)}d ago")

failures = []
for step in PLAYBOOK["steps"]:
    name = step["name"]
    for link in step.get("links", []):
        ok, msg = check_link(link)
        if not ok: failures.append(f"[{name}] LINK: {msg}")
    for k in step.get("kubectl_targets", []):
        ok, msg = check_kubectl_target(k["context"], k["namespace"], k["deployment"])
        if not ok: failures.append(f"[{name}] K8S: {msg}")
    for q in step.get("promql_queries", []):
        ok, msg = check_promql(PLAYBOOK["prometheus_url"], q)
        if not ok: failures.append(f"[{name}] PROM: {msg}")

for esc in PLAYBOOK.get("escalation_chain", []):
    ok, msg = check_escalation(esc)
    if not ok: failures.append(f"[escalation] {msg}")

if failures:
    print(f"PLAYBOOK ROT DETECTED ({len(failures)} issues):")
    for f in failures: print(f"  - {f}")
    sys.exit(1)
print(f"PLAYBOOK OK :: {len(PLAYBOOK['steps'])} steps verified")

Sample run on Sunday 02:14 IST after the Mimir namespace rename:

PLAYBOOK ROT DETECTED (3 issues):
  - [check-consumer-lag] LINK: 404 https://grafana.yatrika.in/d/rider-pos-v2
  - [restart-consumer] K8S: prod-1/riders/rider-positioning-consumer ERR not found
  - [escalation] Aditi Rao verified 142d ago

Walking the load-bearing lines:

  • PLAYBOOK = yaml.safe_load(...) — the playbook is structured data, not prose. The on-call still reads a rendered markdown view, but CI walks the YAML. This is the single change that turns a runbook from a Confluence relic into a tested artefact. Why YAML beats Markdown for this purpose: markdown is unstructured — a parser cannot tell the difference between "a link the on-call should click" and "a link to a related document". YAML forces the author to declare which links are operational (must work) versus referential (informational). The on-call's experience is identical because the markdown view is still generated from the YAML; the difference is invisible to humans but loud to CI.
  • check_kubectl_target(...) — the kubectl steps are not just commands to copy-paste; they include the namespace and deployment that must exist. When a refactor renames riders to rider-positioning, CI catches it before any human does. Every command in a playbook implicitly asserts something about the system's shape; writing those assertions down is what makes the playbook part of the system rather than a description of it.
  • check_escalation(...) with the 90-day verification clock — escalation contacts decay faster than any other field in a playbook. The 90-day verification is a forced calendar event: once a quarter, the playbook owner picks up the phone, confirms the number works, updates last_verified_at. Without this clock, the escalation chain quietly rots until the night you actually need it.
  • sys.exit(1) — the CI job fails, which means the team's standard alerting fires. The playbook's broken state becomes an actionable signal, not a background fact. The broken playbook generates the same kind of page that the broken system would, so it gets fixed with the same urgency.

The CI run on Sunday 02:00 IST is a calculated time — low traffic, before the Monday standup so the failures have a target meeting to be discussed in. Across a 40-team Razorpay-shape org, weekly playbook CI catches roughly 60–80 stale references per quarter that would otherwise have been discovered at 03:00. That number is the entire ROI argument.

The post-mortem template — what makes one useful

A post-mortem is the artefact that converts an incident into change. The default failure modes are nearly universal: the document is a narrative ("on Wednesday at 14:32 the alert fired, then Karan investigated, then..."), the root cause section blames a person ("Karan deployed without testing"), the action items are vague ("improve testing"), and nothing changes because nothing was specific enough to change. Six weeks later, the same incident recurs. The post-mortem from that incident references the previous one, recycles the same action items, and ships nothing new.

The corrective is a structured template that forces specificity at every step. A useful post-mortem has these sections, in this exact order:

Yatrika post-mortem template — every SEV-2 and SEV-3 incident, owned by the Incident Commander.

1. Summary (3 sentences max). What broke, when, customer impact in concrete numbers. e.g. "On 2026-04-22 between 14:32 and 15:08 IST, rider-positioning ingestion lag exceeded 8 minutes due to a stuck Kafka consumer rebalance. 312 trips in Bengaluru saw delayed driver-allocation; 41 customer complaints; no revenue impact."

2. Timeline. Every event with HH:MM:SS, event, source (alert / human action / chat / commit). No prose, no judgement — raw record only. Must be reconstructable from chat log + alert log + git log; if it isn't, observability has a gap.

3. Detection. Time-to-detect = first alert minus first customer impact. Was the detection too slow? Did a customer report it before any alert fired? Drives changes to alerting.

4. Mitigation. Time-to-mitigate = mitigation minus detection. Was the mitigation in the playbook? Was the playbook current? Drives changes to playbooks.

5. Root cause analysis (5 whys, written down). Each why is one sentence. Stop at the systemic cause, not the human action.

6. What went well / poorly / where we got lucky. Three short lists. "Got lucky" surfaces near-misses — "we got lucky customer-support flagged it before the SLO burn-rate alert fired" is a confession that the SLO alert is too slow.

7. Action items (owner, due date, link). A single concrete change, owned by one person, due within 30 days, linked to a tracked ticket. No "improve X". No "investigate Y". If the AI is not a code/config/process change, it is not yet specified.

8. Followups. Things to monitor or revisit, with a date — checkpoints, not changes.

The order is load-bearing. Summary first because most readers (engineering directors, security, compliance) only read that section. Timeline before analysis because the raw record must exist independently of the team's interpretation of it. Action items second-to-last because they should be derived from the analysis, not pre-decided.

The 5-whys chain — stopping at a system gap, not a human actionA vertical ladder of six why-and-because pairs starting from the surface symptom (rider-positioning ingestion stalled at 14:32 IST) and descending. Each rung shows a why on the left and the answer on the right. The first five rungs progress through technical layers: consumer was stuck, liveness probe killed it during long rebalance, probe timeout was 30s rebalance takes 90s, default Helm chart value never overridden, chart was added by data-platform team in 2024 with an unresolved TODO. The sixth and final rung names a system gap: there is no review process for service-team Helm chart adoption. A side bar marks where the wrong stopping point would have been at the engineer-blame layer (the on-call did not check the dashboard) and contrasts it with the correct stopping point at the system-gap layer. 5-whys chain — descend until the answer names a system gap Stop at "the system did not"; do not stop at "the engineer did not" Symptom Rider-positioning ingestion stalled 14:32–15:08 IST, 312 trips delayed Why 1? The Kafka consumer was stuck in a rebalance loop. Why 2? The liveness probe killed pods every 30s during a rebalance that takes 90s. Why 3? The probe timeout was set to 30s; the actual rebalance window is 90s. Why 4? The default Helm chart value was never overridden by the consuming team. Why 5? The chart was adopted in 2024 with a TODO that was never closed. Why 6? There is no review process for service-team Helm chart adoption — STOP HERE. Wrong stop: "on-call did not check dashboard" blames human
Illustrative — the why-chain that ends at "no review process" (a system gap) produces an action item that prevents the recurrence; the chain that ends at "the on-call did not check the dashboard" (a human action) produces an action item that retrains a person and changes nothing about the system. The discipline is to keep descending until you reach a sentence that begins with "the system did not".

Why "5 whys" stops at systemic cause, not human action: the natural ending point for a human investigator is "because the engineer made a mistake". This is always true and never useful. The discipline is to keep asking why the system allowed the mistake — why was the timeout default wrong, why was there no review, why was the failure mode not caught by tests. The engineer being human is the input to the analysis, not the cause. The system change prevents the next incident; the engineer's mistake will recur with a different engineer next quarter regardless. Stop the why-chain when you reach a sentence that begins with "The system did not".

The format is rigid; the writing is short. A good Yatrika post-mortem is 2–3 pages of dense, specific prose — not 12 pages of narrative. The structure compresses better than freeform writing because each section answers exactly one question.

What blameless actually means — and how it gets enforced

"Blameless culture" is the most-talked-about, least-implemented concept in observability discourse. The naive version — "we don't blame people in post-mortems" — is a slogan that survives until the first SEV-1, at which point an executive asks "who pushed the bad commit" and the team realises blamelessness was never written down anywhere and so does not exist. The version that survives is a written set of rules about what cannot appear in a post-mortem document, plus a manager whose explicit job is to enforce them.

The Yatrika blameless rules — used as a checklist by the Incident Commander before any post-mortem is published:

# blameless_check.py — pre-publish checks for a post-mortem document
# pip install pyyaml regex
import re, sys, yaml

# Patterns that must NOT appear in the post-mortem narrative.
# Each pattern represents one of the four blame anti-patterns.
BLAME_PATTERNS = [
    # Anti-pattern 1: Naming an individual as the cause
    (r"\b(Karan|Aditi|Riya|Dipti|Jishant|Asha|Rahul|Kiran)\s+(deployed|pushed|broke|caused|forgot|missed|failed)\b",
     "names a specific engineer as cause"),

    # Anti-pattern 2: Hindsight-loaded modal verbs ("should have known")
    (r"\b(should have|ought to have|would have|could have)\s+(known|noticed|tested|caught|seen|realised|realized)\b",
     "uses hindsight-loaded modal verbs"),

    # Anti-pattern 3: Character judgement
    (r"\b(careless|negligent|sloppy|inexperienced|junior|inattentive|incompetent)\b",
     "uses character-judgement language"),

    # Anti-pattern 4: Blame disguised as passive voice
    (r"\b(was not properly|was not adequately|was insufficiently|failed to be properly)\b",
     "uses blame-laundering passive voice"),
]

# Required sections (the structural shape the template enforces)
REQUIRED_SECTIONS = ["summary", "timeline", "detection", "mitigation",
                     "root_cause", "what_went_well", "action_items", "followups"]

pm = yaml.safe_load(open(sys.argv[1]))

violations = []

# 1. Structural check: every required section present and non-empty
for s in REQUIRED_SECTIONS:
    if s not in pm or not pm[s]:
        violations.append(f"MISSING SECTION: {s}")

# 2. Blame-pattern check: scan all text fields for forbidden phrasings
def scan(text: str, location: str):
    for pattern, description in BLAME_PATTERNS:
        for m in re.finditer(pattern, text, flags=re.IGNORECASE):
            violations.append(f"[{location}] BLAME PATTERN ({description}): '{m.group(0)}'")

for section in REQUIRED_SECTIONS:
    body = pm.get(section, "")
    if isinstance(body, str):
        scan(body, section)
    elif isinstance(body, list):
        for i, item in enumerate(body):
            scan(str(item), f"{section}[{i}]")

# 3. Action-item shape check: each AI has owner, due_date, ticket
for i, ai in enumerate(pm.get("action_items", [])):
    for f in ("owner", "due_date", "ticket"):
        if not ai.get(f):
            violations.append(f"action_items[{i}] missing field: {f}")
    if "improve" in ai.get("title", "").lower() or "investigate" in ai.get("title", "").lower():
        violations.append(f"action_items[{i}] vague verb: '{ai['title']}'")

# 4. 5-whys depth check: did the why-chain reach a system cause?
rc = pm.get("root_cause", {})
whys = rc.get("whys", [])
if len(whys) < 5:
    violations.append(f"root_cause: only {len(whys)} whys; need at least 5")
if whys and not whys[-1].lower().startswith(("the system", "we did not", "there is no", "no process", "no review")):
    violations.append(f"root_cause: final why does not name a system gap: '{whys[-1][:80]}'")

if violations:
    print(f"POST-MORTEM CANNOT BE PUBLISHED ({len(violations)} issues):")
    for v in violations: print(f"  - {v}")
    sys.exit(1)
print(f"OK :: post-mortem passes blameless and structural checks")

Sample run on a draft post-mortem before the Incident Commander review:

POST-MORTEM CANNOT BE PUBLISHED (5 issues):
  - [timeline[7]] BLAME PATTERN (names a specific engineer as cause): 'Karan deployed'
  - [root_cause] BLAME PATTERN (uses hindsight-loaded modal verbs): 'should have noticed'
  - [mitigation] BLAME PATTERN (uses blame-laundering passive voice): 'was not properly'
  - action_items[2] vague verb: 'Improve consumer lag monitoring'
  - root_cause: final why does not name a system gap: 'The on-call did not check the dashboard'

The script is mechanical — it cannot detect every blame instance, and it can produce false positives ("Karan deployed the fix at 14:48" is a perfectly fine timeline entry). The point is not to be a perfect filter; the point is to give the Incident Commander a checklist that forces the conversation about each flagged phrase. Sometimes the answer is "this one is fine, the timeline genuinely needs the name". Often the answer is "you're right, let me rewrite that line". Either way, the team is now talking about whether the language matches their stated values, which is exactly what cultural enforcement means.

Why a CI-enforceable blameless check beats a culture statement: a written value ("we are blameless") is enforced only when a senior engineer notices a violation and pushes back — which selects for senior engineers willing to push back, a small fraction of any team. A CI check is enforced every time, by every Incident Commander, regardless of seniority. A Razorpay-shape implementation, on a 40-team org, reduced post-mortem revisions for blame language from "common" to "near-zero" within one quarter — not because team character improved, but because the friction of writing blame language increased. Same dynamic as linters: you stop writing the bad pattern because the tool refuses to ship it, and over time you stop reaching for it.

The four anti-patterns are not exhaustive but they are the four that account for ~80% of accidental blame in real post-mortems. Naming an individual is the obvious one. Hindsight-loaded modal verbs are the sneakier one — "should have noticed" assumes the engineer had information they did not have at the time. Character judgement ("careless", "junior") is the explicit one teams are most embarrassed by once it's flagged. Blame-laundering passive voice ("was not adequately tested") is the one teams use when they think they're being blameless but are actually just hiding the noun — the sentence still asserts that someone failed to do something, just without naming them, and the team in the room knows exactly who.

Common confusions

  • "Blameless means no one is responsible for anything." Blameless means no individual is blamed in the post-mortem document; it does not mean nobody is accountable. Action items have owners. Owners are responsible for delivering the change by the due date. The Incident Commander is responsible for running a clean post-mortem. The platform team is responsible for the systems whose gaps were exposed. The blameless framing applies to the analysis of cause, not to the assignment of future work — a team that conflates the two ends up with no owners on action items, and nothing ships.
  • "A playbook is the same as a runbook." Industry usage is messy. In this curriculum: a runbook is a document; a playbook is a runbook that has been verified by CI and is treated as part of the system. The terminological distinction is less important than the practical one — does the artefact have a forcing function that catches its own decay, or doesn't it? If it doesn't, you have a Confluence page that will rot regardless of what you call it.
  • "A post-mortem's job is to find the root cause." A post-mortem's job is to change the system. The root cause section exists to drive action items; if it doesn't, the section is decoration. Many teams produce beautifully written 5-whys analyses that conclude with action items like "improve documentation" and ship nothing. The metric for a post-mortem's quality is not the depth of analysis but the count of code/config/process changes that landed within 30 days.
  • "Action items should be ambitious." Action items should be specific and shippable in 30 days. "Re-architect the consumer to be lag-resilient" is not an action item; it is a project that needs its own design doc. The post-mortem's action item should be something concrete the team can land in two sprints — "set the liveness probe timeout to 120s in the rider-positioning Helm values, deploy by 2026-05-15". Larger projects belong in followups, with a date for the design discussion.
  • "Post-mortems are only for SEV-1 incidents." Post-mortems are for any incident that taught the team something new. A SEV-3 alert that turned out to be a false positive deserves a post-mortem about why the alert design was wrong. A near-miss — an incident caught before customer impact — is often the most valuable post-mortem, because the team has cognitive bandwidth to do it carefully. Razorpay-shape mature platforms run near-miss post-mortems at roughly 1.4× the rate of customer-impacting ones.
  • "The 5 whys are a literal count." The 5 whys are a rule of thumb for chain depth, not a target. Some chains reach the systemic cause in 3 whys; some need 7. The discipline is to keep asking until the answer names a system gap and stop exactly there. Mechanical 5-whys pads the chain with filler; honest 5-whys reaches the bottom in however many whys it takes.

Going deeper

Incident Commander as a rotated, trained role

The Incident Commander (IC) is the single person who runs the incident response — declares severity, opens the war-room channel, assigns roles (comms, scribe, technical leads), keeps the timeline, decides when to escalate, decides when the incident is over. The IC is not the engineer fixing the bug; they are the meta-engineer running the room. The IC role is rotated weekly across the platform team and the senior engineers across product teams, and IC duty is paid the same as on-call duty. What separates a working IC programme from a broken one is the training: every new IC shadows two incidents before running their own, runs their first with a senior IC observing, and runs a "ghost incident" tabletop quarterly. Without the training pipeline, IC duty falls on whoever is loudest in the channel, which selects for confidence-not-competence. PaisaBridge-shape platforms with mature IC programmes report mean-time-to-resolve roughly 2.1× faster than teams without trained ICs at the same scale.

The post-mortem follow-through audit

Action items have a 30-day due date. The discipline is enforcing it. The mechanism: a weekly automated audit that opens a Slack thread in #incidents-followup with every overdue action item, the original incident link, and the owner. The owner replies with (a) shipped, with the link, (b) re-scoped, with the new due date and reason, or (c) cancelled, with the reason. After 90 days, an action item must be one of these three or it gets surfaced in the platform-team's monthly review with engineering leadership. Without the audit, roughly 40% of action items in a typical platform team are silently abandoned within a quarter; with the audit, the rate drops to ~8%, and the abandoned ones are now consciously abandoned — which is itself a useful signal about the team's actual capacity.

Tabletop exercises and chaos game days

A team that only runs post-mortems on real incidents learns at the rate that incidents occur — slow, expensive, and at unfortunate hours. Tabletop exercises (the IC reads out a scenario, the team walks through what they would do, no real change to the system) and chaos game days (the SRE team breaks a non-critical surface in staging, the on-call practices the response with full tooling) let the team learn at the rate they can schedule. Hotstar runs quarterly chaos game days against their non-IPL workload — simulating Tempo backend failure, Mimir tenant cardinality breach, OTLP collector drop — and the post-mortems from these exercises produce real action items the same way real incidents do. Skipping game days because "we have enough real incidents" is a cost-trap: real incidents teach lessons at 03:00 with customers watching; game days teach the same lessons at 14:00 on a Wednesday with coffee.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install pyyaml requests prometheus-api-client kubernetes
# Run the playbook CI checker against the sample YAML
python3 verify_playbook.py
# Run the blameless lint against a draft post-mortem YAML
python3 blameless_check.py drafts/2026-04-22-rider-positioning.yaml

Both scripts are 80–100 lines each and run in under 5 seconds for a typical playbook or post-mortem. Dropping them into your team's CI pipeline is roughly an afternoon of work and converts both artefacts from "things that decay quietly" into "things that fail loudly when they decay".

Where this leads next

/wiki/incident-response-tooling covers the tooling layer that supports everything in this article — the war-room channel templates that the IC opens, the SEV-level definitions that drive paging behaviour, the comms scripts that go to customer support, the timeline-recording bot that captures every chat message and command. The cultural rules in this article (blameless template, action-item discipline) and the tooling chapter together form the operational substrate of an incident-response practice.

/wiki/the-observability-maturity-model places this article's artefacts on a maturity scale: "team has CI-verified playbooks", "team enforces a blameless lint on post-mortems", "team runs quarterly tabletop exercises" are all concrete maturity checkpoints. /wiki/the-30-year-arc closes Part 17 and the curriculum, looking at how the discipline of incident response has evolved from "the senior engineer fixes it and writes an email" in the 1990s to the structured-and-automated practice this article describes, and where it goes next.

References