Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Playbooks, post-mortems, and blameless culture
It is 03:11 at Yatrika and Karan is staring at a PagerDuty page that says RiderPositioningIngestionStalled for the seventeenth time this quarter. The runbook link opens a Confluence page last edited fourteen months ago by an engineer who has since left. Step 3 links to a Grafana URL that returns 404 because the dashboard was renamed during the Mimir migration. Step 5 says "escalate to #data-platform" — that channel was archived in January. Karan does what every on-call eventually does — closes the runbook, opens a fresh terminal, runs kubectl logs and kafka-consumer-groups.sh from muscle memory, finds a stuck rebalance in eight minutes, restarts the consumer, and goes back to sleep. The runbook did not help. Worse, the post-mortem from the last time this fired — six weeks ago — had an action item titled "update runbook" assigned to an engineer now on parental leave. Three days from now the team will hold another post-mortem; if they run it like the previous one, the same action item will be reopened, reassigned, and reach the same fate. The discipline of playbooks and post-mortems is the discipline of refusing that loop. Most teams don't.
A playbook is the document an exhausted engineer executes at 03:00 — every step verifiable, every dashboard link healthy, every escalation path tested in the last 60 days. A post-mortem is a structured artefact that converts an incident into changes to the system, not into character assessments of the engineer who pushed the button. Blameless culture is not a vibe; it is a written set of rules about what cannot appear in a post-mortem document and a manager whose job is to enforce them. Without all three — runnable playbook, structured post-mortem template, enforced blameless rule — your incidents teach your team nothing.
What a real playbook looks like — and why most fail
A "runbook" or "playbook" is a step-by-step procedure for resolving a specific class of incident. The default failure mode: write it once, paste it into Confluence, link it from PagerDuty, never run it again. By month six it has rotted — dashboard URLs 404, kubectl namespaces renamed, the Slack escalation channel archived, the metric in step 4 renamed during the OpenMetrics migration. The on-call who relies on it at 03:00 wastes seven minutes discovering it is broken before falling back to first-principles debugging. The playbook is supposed to save those seven minutes; instead it costs them.
The corrective is not "write better runbooks". It is to treat the playbook as executable code that gets tested on a schedule, with the same CI discipline as any other code:
Why CI verification matters more than playbook quality: a perfectly written playbook decays at the rate the underlying system changes — typically 5–10 broken references per month with two weekly deploys. Confluence-hosted playbooks have no detector and converge on the same broken steady state regardless of writing quality. Git-hosted playbooks tested by CI have a forcing function — when a dashboard URL 404s on Sunday's CI run, an issue is opened Monday morning before the next 03:00 page. Skip the forcing function and the playbook quality is irrelevant; with it, even mediocre playbook prose stays runnable.
The real shape of a playbook that survives:
# verify_playbook.py — CI-verifiable playbook for "RiderPositioningIngestionStalled"
# pip install requests pyyaml kubernetes prometheus-api-client
import re, sys, time, requests, yaml
from kubernetes import client, config
from prometheus_api_client import PrometheusConnect
# The playbook is data, not prose. CI loads this YAML, walks each step, and fails the build
# if any precondition, link, or expected-output assertion no longer holds.
PLAYBOOK = yaml.safe_load(open("playbooks/rider-positioning-ingestion-stalled.yaml"))
def check_link(url: str) -> tuple[bool, str]:
"""Every link in the playbook must return 2xx; 404 means the dashboard was renamed."""
try:
r = requests.get(url, timeout=5, allow_redirects=True)
return (r.status_code < 400, f"{r.status_code} {url}")
except Exception as e:
return (False, f"ERR {url}: {e}")
def check_kubectl_target(ctx: str, ns: str, deployment: str) -> tuple[bool, str]:
"""Every kubectl reference must resolve — namespace exists, deployment exists, has ready pods."""
config.load_kube_config(context=ctx)
v1 = client.AppsV1Api()
try:
d = v1.read_namespaced_deployment(deployment, ns)
ready = d.status.ready_replicas or 0
return (ready > 0, f"{ctx}/{ns}/{deployment} ready={ready}")
except Exception as e:
return (False, f"{ctx}/{ns}/{deployment} ERR {e}")
def check_promql(prom_url: str, query: str) -> tuple[bool, str]:
"""The PromQL queries the playbook tells the on-call to run must still parse and return."""
p = PrometheusConnect(url=prom_url, disable_ssl=False)
try:
result = p.custom_query(query)
return (result is not None, f"OK ({len(result)} series) :: {query[:60]}")
except Exception as e:
return (False, f"ERR :: {query[:60]} :: {e}")
def check_escalation(escalation: dict) -> tuple[bool, str]:
"""Escalation contact must have been verified within the last 90 days."""
last_verified = escalation.get("last_verified_at")
if not last_verified:
return (False, f"{escalation['name']} never verified")
age_days = (time.time() - time.mktime(time.strptime(last_verified, "%Y-%m-%d"))) / 86400
return (age_days < 90, f"{escalation['name']} verified {int(age_days)}d ago")
failures = []
for step in PLAYBOOK["steps"]:
name = step["name"]
for link in step.get("links", []):
ok, msg = check_link(link)
if not ok: failures.append(f"[{name}] LINK: {msg}")
for k in step.get("kubectl_targets", []):
ok, msg = check_kubectl_target(k["context"], k["namespace"], k["deployment"])
if not ok: failures.append(f"[{name}] K8S: {msg}")
for q in step.get("promql_queries", []):
ok, msg = check_promql(PLAYBOOK["prometheus_url"], q)
if not ok: failures.append(f"[{name}] PROM: {msg}")
for esc in PLAYBOOK.get("escalation_chain", []):
ok, msg = check_escalation(esc)
if not ok: failures.append(f"[escalation] {msg}")
if failures:
print(f"PLAYBOOK ROT DETECTED ({len(failures)} issues):")
for f in failures: print(f" - {f}")
sys.exit(1)
print(f"PLAYBOOK OK :: {len(PLAYBOOK['steps'])} steps verified")
Sample run on Sunday 02:14 IST after the Mimir namespace rename:
PLAYBOOK ROT DETECTED (3 issues):
- [check-consumer-lag] LINK: 404 https://grafana.yatrika.in/d/rider-pos-v2
- [restart-consumer] K8S: prod-1/riders/rider-positioning-consumer ERR not found
- [escalation] Aditi Rao verified 142d ago
Walking the load-bearing lines:
PLAYBOOK = yaml.safe_load(...)— the playbook is structured data, not prose. The on-call still reads a rendered markdown view, but CI walks the YAML. This is the single change that turns a runbook from a Confluence relic into a tested artefact. Why YAML beats Markdown for this purpose: markdown is unstructured — a parser cannot tell the difference between "a link the on-call should click" and "a link to a related document". YAML forces the author to declare which links are operational (must work) versus referential (informational). The on-call's experience is identical because the markdown view is still generated from the YAML; the difference is invisible to humans but loud to CI.check_kubectl_target(...)— the kubectl steps are not just commands to copy-paste; they include the namespace and deployment that must exist. When a refactor renamesriderstorider-positioning, CI catches it before any human does. Every command in a playbook implicitly asserts something about the system's shape; writing those assertions down is what makes the playbook part of the system rather than a description of it.check_escalation(...)with the 90-day verification clock — escalation contacts decay faster than any other field in a playbook. The 90-day verification is a forced calendar event: once a quarter, the playbook owner picks up the phone, confirms the number works, updateslast_verified_at. Without this clock, the escalation chain quietly rots until the night you actually need it.sys.exit(1)— the CI job fails, which means the team's standard alerting fires. The playbook's broken state becomes an actionable signal, not a background fact. The broken playbook generates the same kind of page that the broken system would, so it gets fixed with the same urgency.
The CI run on Sunday 02:00 IST is a calculated time — low traffic, before the Monday standup so the failures have a target meeting to be discussed in. Across a 40-team Razorpay-shape org, weekly playbook CI catches roughly 60–80 stale references per quarter that would otherwise have been discovered at 03:00. That number is the entire ROI argument.
The post-mortem template — what makes one useful
A post-mortem is the artefact that converts an incident into change. The default failure modes are nearly universal: the document is a narrative ("on Wednesday at 14:32 the alert fired, then Karan investigated, then..."), the root cause section blames a person ("Karan deployed without testing"), the action items are vague ("improve testing"), and nothing changes because nothing was specific enough to change. Six weeks later, the same incident recurs. The post-mortem from that incident references the previous one, recycles the same action items, and ships nothing new.
The corrective is a structured template that forces specificity at every step. A useful post-mortem has these sections, in this exact order:
Yatrika post-mortem template — every SEV-2 and SEV-3 incident, owned by the Incident Commander.
1. Summary (3 sentences max). What broke, when, customer impact in concrete numbers. e.g. "On 2026-04-22 between 14:32 and 15:08 IST, rider-positioning ingestion lag exceeded 8 minutes due to a stuck Kafka consumer rebalance. 312 trips in Bengaluru saw delayed driver-allocation; 41 customer complaints; no revenue impact."
2. Timeline. Every event with HH:MM:SS, event, source (alert / human action / chat / commit). No prose, no judgement — raw record only. Must be reconstructable from chat log + alert log + git log; if it isn't, observability has a gap.
3. Detection. Time-to-detect = first alert minus first customer impact. Was the detection too slow? Did a customer report it before any alert fired? Drives changes to alerting.
4. Mitigation. Time-to-mitigate = mitigation minus detection. Was the mitigation in the playbook? Was the playbook current? Drives changes to playbooks.
5. Root cause analysis (5 whys, written down). Each why is one sentence. Stop at the systemic cause, not the human action.
6. What went well / poorly / where we got lucky. Three short lists. "Got lucky" surfaces near-misses — "we got lucky customer-support flagged it before the SLO burn-rate alert fired" is a confession that the SLO alert is too slow.
7. Action items (owner, due date, link). A single concrete change, owned by one person, due within 30 days, linked to a tracked ticket. No "improve X". No "investigate Y". If the AI is not a code/config/process change, it is not yet specified.
8. Followups. Things to monitor or revisit, with a date — checkpoints, not changes.
The order is load-bearing. Summary first because most readers (engineering directors, security, compliance) only read that section. Timeline before analysis because the raw record must exist independently of the team's interpretation of it. Action items second-to-last because they should be derived from the analysis, not pre-decided.
Why "5 whys" stops at systemic cause, not human action: the natural ending point for a human investigator is "because the engineer made a mistake". This is always true and never useful. The discipline is to keep asking why the system allowed the mistake — why was the timeout default wrong, why was there no review, why was the failure mode not caught by tests. The engineer being human is the input to the analysis, not the cause. The system change prevents the next incident; the engineer's mistake will recur with a different engineer next quarter regardless. Stop the why-chain when you reach a sentence that begins with "The system did not".
The format is rigid; the writing is short. A good Yatrika post-mortem is 2–3 pages of dense, specific prose — not 12 pages of narrative. The structure compresses better than freeform writing because each section answers exactly one question.
What blameless actually means — and how it gets enforced
"Blameless culture" is the most-talked-about, least-implemented concept in observability discourse. The naive version — "we don't blame people in post-mortems" — is a slogan that survives until the first SEV-1, at which point an executive asks "who pushed the bad commit" and the team realises blamelessness was never written down anywhere and so does not exist. The version that survives is a written set of rules about what cannot appear in a post-mortem document, plus a manager whose explicit job is to enforce them.
The Yatrika blameless rules — used as a checklist by the Incident Commander before any post-mortem is published:
# blameless_check.py — pre-publish checks for a post-mortem document
# pip install pyyaml regex
import re, sys, yaml
# Patterns that must NOT appear in the post-mortem narrative.
# Each pattern represents one of the four blame anti-patterns.
BLAME_PATTERNS = [
# Anti-pattern 1: Naming an individual as the cause
(r"\b(Karan|Aditi|Riya|Dipti|Jishant|Asha|Rahul|Kiran)\s+(deployed|pushed|broke|caused|forgot|missed|failed)\b",
"names a specific engineer as cause"),
# Anti-pattern 2: Hindsight-loaded modal verbs ("should have known")
(r"\b(should have|ought to have|would have|could have)\s+(known|noticed|tested|caught|seen|realised|realized)\b",
"uses hindsight-loaded modal verbs"),
# Anti-pattern 3: Character judgement
(r"\b(careless|negligent|sloppy|inexperienced|junior|inattentive|incompetent)\b",
"uses character-judgement language"),
# Anti-pattern 4: Blame disguised as passive voice
(r"\b(was not properly|was not adequately|was insufficiently|failed to be properly)\b",
"uses blame-laundering passive voice"),
]
# Required sections (the structural shape the template enforces)
REQUIRED_SECTIONS = ["summary", "timeline", "detection", "mitigation",
"root_cause", "what_went_well", "action_items", "followups"]
pm = yaml.safe_load(open(sys.argv[1]))
violations = []
# 1. Structural check: every required section present and non-empty
for s in REQUIRED_SECTIONS:
if s not in pm or not pm[s]:
violations.append(f"MISSING SECTION: {s}")
# 2. Blame-pattern check: scan all text fields for forbidden phrasings
def scan(text: str, location: str):
for pattern, description in BLAME_PATTERNS:
for m in re.finditer(pattern, text, flags=re.IGNORECASE):
violations.append(f"[{location}] BLAME PATTERN ({description}): '{m.group(0)}'")
for section in REQUIRED_SECTIONS:
body = pm.get(section, "")
if isinstance(body, str):
scan(body, section)
elif isinstance(body, list):
for i, item in enumerate(body):
scan(str(item), f"{section}[{i}]")
# 3. Action-item shape check: each AI has owner, due_date, ticket
for i, ai in enumerate(pm.get("action_items", [])):
for f in ("owner", "due_date", "ticket"):
if not ai.get(f):
violations.append(f"action_items[{i}] missing field: {f}")
if "improve" in ai.get("title", "").lower() or "investigate" in ai.get("title", "").lower():
violations.append(f"action_items[{i}] vague verb: '{ai['title']}'")
# 4. 5-whys depth check: did the why-chain reach a system cause?
rc = pm.get("root_cause", {})
whys = rc.get("whys", [])
if len(whys) < 5:
violations.append(f"root_cause: only {len(whys)} whys; need at least 5")
if whys and not whys[-1].lower().startswith(("the system", "we did not", "there is no", "no process", "no review")):
violations.append(f"root_cause: final why does not name a system gap: '{whys[-1][:80]}'")
if violations:
print(f"POST-MORTEM CANNOT BE PUBLISHED ({len(violations)} issues):")
for v in violations: print(f" - {v}")
sys.exit(1)
print(f"OK :: post-mortem passes blameless and structural checks")
Sample run on a draft post-mortem before the Incident Commander review:
POST-MORTEM CANNOT BE PUBLISHED (5 issues):
- [timeline[7]] BLAME PATTERN (names a specific engineer as cause): 'Karan deployed'
- [root_cause] BLAME PATTERN (uses hindsight-loaded modal verbs): 'should have noticed'
- [mitigation] BLAME PATTERN (uses blame-laundering passive voice): 'was not properly'
- action_items[2] vague verb: 'Improve consumer lag monitoring'
- root_cause: final why does not name a system gap: 'The on-call did not check the dashboard'
The script is mechanical — it cannot detect every blame instance, and it can produce false positives ("Karan deployed the fix at 14:48" is a perfectly fine timeline entry). The point is not to be a perfect filter; the point is to give the Incident Commander a checklist that forces the conversation about each flagged phrase. Sometimes the answer is "this one is fine, the timeline genuinely needs the name". Often the answer is "you're right, let me rewrite that line". Either way, the team is now talking about whether the language matches their stated values, which is exactly what cultural enforcement means.
Why a CI-enforceable blameless check beats a culture statement: a written value ("we are blameless") is enforced only when a senior engineer notices a violation and pushes back — which selects for senior engineers willing to push back, a small fraction of any team. A CI check is enforced every time, by every Incident Commander, regardless of seniority. A Razorpay-shape implementation, on a 40-team org, reduced post-mortem revisions for blame language from "common" to "near-zero" within one quarter — not because team character improved, but because the friction of writing blame language increased. Same dynamic as linters: you stop writing the bad pattern because the tool refuses to ship it, and over time you stop reaching for it.
The four anti-patterns are not exhaustive but they are the four that account for ~80% of accidental blame in real post-mortems. Naming an individual is the obvious one. Hindsight-loaded modal verbs are the sneakier one — "should have noticed" assumes the engineer had information they did not have at the time. Character judgement ("careless", "junior") is the explicit one teams are most embarrassed by once it's flagged. Blame-laundering passive voice ("was not adequately tested") is the one teams use when they think they're being blameless but are actually just hiding the noun — the sentence still asserts that someone failed to do something, just without naming them, and the team in the room knows exactly who.
Common confusions
- "Blameless means no one is responsible for anything." Blameless means no individual is blamed in the post-mortem document; it does not mean nobody is accountable. Action items have owners. Owners are responsible for delivering the change by the due date. The Incident Commander is responsible for running a clean post-mortem. The platform team is responsible for the systems whose gaps were exposed. The blameless framing applies to the analysis of cause, not to the assignment of future work — a team that conflates the two ends up with no owners on action items, and nothing ships.
- "A playbook is the same as a runbook." Industry usage is messy. In this curriculum: a runbook is a document; a playbook is a runbook that has been verified by CI and is treated as part of the system. The terminological distinction is less important than the practical one — does the artefact have a forcing function that catches its own decay, or doesn't it? If it doesn't, you have a Confluence page that will rot regardless of what you call it.
- "A post-mortem's job is to find the root cause." A post-mortem's job is to change the system. The root cause section exists to drive action items; if it doesn't, the section is decoration. Many teams produce beautifully written 5-whys analyses that conclude with action items like "improve documentation" and ship nothing. The metric for a post-mortem's quality is not the depth of analysis but the count of code/config/process changes that landed within 30 days.
- "Action items should be ambitious." Action items should be specific and shippable in 30 days. "Re-architect the consumer to be lag-resilient" is not an action item; it is a project that needs its own design doc. The post-mortem's action item should be something concrete the team can land in two sprints — "set the liveness probe timeout to 120s in the rider-positioning Helm values, deploy by 2026-05-15". Larger projects belong in followups, with a date for the design discussion.
- "Post-mortems are only for SEV-1 incidents." Post-mortems are for any incident that taught the team something new. A SEV-3 alert that turned out to be a false positive deserves a post-mortem about why the alert design was wrong. A near-miss — an incident caught before customer impact — is often the most valuable post-mortem, because the team has cognitive bandwidth to do it carefully. Razorpay-shape mature platforms run near-miss post-mortems at roughly 1.4× the rate of customer-impacting ones.
- "The 5 whys are a literal count." The 5 whys are a rule of thumb for chain depth, not a target. Some chains reach the systemic cause in 3 whys; some need 7. The discipline is to keep asking until the answer names a system gap and stop exactly there. Mechanical 5-whys pads the chain with filler; honest 5-whys reaches the bottom in however many whys it takes.
Going deeper
Incident Commander as a rotated, trained role
The Incident Commander (IC) is the single person who runs the incident response — declares severity, opens the war-room channel, assigns roles (comms, scribe, technical leads), keeps the timeline, decides when to escalate, decides when the incident is over. The IC is not the engineer fixing the bug; they are the meta-engineer running the room. The IC role is rotated weekly across the platform team and the senior engineers across product teams, and IC duty is paid the same as on-call duty. What separates a working IC programme from a broken one is the training: every new IC shadows two incidents before running their own, runs their first with a senior IC observing, and runs a "ghost incident" tabletop quarterly. Without the training pipeline, IC duty falls on whoever is loudest in the channel, which selects for confidence-not-competence. PaisaBridge-shape platforms with mature IC programmes report mean-time-to-resolve roughly 2.1× faster than teams without trained ICs at the same scale.
The post-mortem follow-through audit
Action items have a 30-day due date. The discipline is enforcing it. The mechanism: a weekly automated audit that opens a Slack thread in #incidents-followup with every overdue action item, the original incident link, and the owner. The owner replies with (a) shipped, with the link, (b) re-scoped, with the new due date and reason, or (c) cancelled, with the reason. After 90 days, an action item must be one of these three or it gets surfaced in the platform-team's monthly review with engineering leadership. Without the audit, roughly 40% of action items in a typical platform team are silently abandoned within a quarter; with the audit, the rate drops to ~8%, and the abandoned ones are now consciously abandoned — which is itself a useful signal about the team's actual capacity.
Tabletop exercises and chaos game days
A team that only runs post-mortems on real incidents learns at the rate that incidents occur — slow, expensive, and at unfortunate hours. Tabletop exercises (the IC reads out a scenario, the team walks through what they would do, no real change to the system) and chaos game days (the SRE team breaks a non-critical surface in staging, the on-call practices the response with full tooling) let the team learn at the rate they can schedule. Hotstar runs quarterly chaos game days against their non-IPL workload — simulating Tempo backend failure, Mimir tenant cardinality breach, OTLP collector drop — and the post-mortems from these exercises produce real action items the same way real incidents do. Skipping game days because "we have enough real incidents" is a cost-trap: real incidents teach lessons at 03:00 with customers watching; game days teach the same lessons at 14:00 on a Wednesday with coffee.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pyyaml requests prometheus-api-client kubernetes
# Run the playbook CI checker against the sample YAML
python3 verify_playbook.py
# Run the blameless lint against a draft post-mortem YAML
python3 blameless_check.py drafts/2026-04-22-rider-positioning.yaml
Both scripts are 80–100 lines each and run in under 5 seconds for a typical playbook or post-mortem. Dropping them into your team's CI pipeline is roughly an afternoon of work and converts both artefacts from "things that decay quietly" into "things that fail loudly when they decay".
Where this leads next
/wiki/incident-response-tooling covers the tooling layer that supports everything in this article — the war-room channel templates that the IC opens, the SEV-level definitions that drive paging behaviour, the comms scripts that go to customer support, the timeline-recording bot that captures every chat message and command. The cultural rules in this article (blameless template, action-item discipline) and the tooling chapter together form the operational substrate of an incident-response practice.
/wiki/the-observability-maturity-model places this article's artefacts on a maturity scale: "team has CI-verified playbooks", "team enforces a blameless lint on post-mortems", "team runs quarterly tabletop exercises" are all concrete maturity checkpoints. /wiki/the-30-year-arc closes Part 17 and the curriculum, looking at how the discipline of incident response has evolved from "the senior engineer fixes it and writes an email" in the 1990s to the structured-and-automated practice this article describes, and where it goes next.
References
- John Allspaw, "Blameless Post-Mortems and a Just Culture" (Etsy code-as-craft, 2012) — the canonical articulation of blameless post-mortems; the four-anti-pattern frame in this article extends Allspaw's argument with code-enforceable checks.
- Google SRE Workbook, "Postmortem Culture" (Chapter 10) and "On-Call" (Chapter 8) — the structural template (timeline, detection, mitigation, root cause, action items) used in this article is adapted directly from Google's published workbook.
- Sidney Dekker, The Field Guide to Understanding Human Error (CRC Press, 3rd ed., 2014) — the philosophical underpinning of "stop the why-chain at the systemic cause"; informs the blameless-check rules about hindsight-loaded modal verbs and character judgement.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 14 on "Service Level Objectives" and the production-debugging chapters frame the relationship between alerting design and post-mortem learning.
- PagerDuty Incident Response documentation — the public-domain runbook for IC role responsibilities, severity definitions, and comms patterns; the IC training programme structure in §Going deeper draws on this.
- Atlassian, "Incident Management Handbook" — counterweight reference; the action-item-with-owner-and-due-date discipline appears in similar form here.
- /wiki/building-the-team — internal: the team structure (separate observability on-call rotation, written charter) that this article's rituals run inside. The post-mortem culture cannot exist without the team-shape that owns it.
- /wiki/wall-the-discipline-ties-this-all-together — internal: the Part-16 wall that frames observability as a discipline whose three artefacts (CI gate, on-call rotation, quarterly review) include the playbook-CI and post-mortem-lint introduced here.