Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Game days: rehearsing failure with humans in the room

It is 2:00pm on a Wednesday at CricStream's Bengaluru office and a printout is taped to the conference room wall: "Today: the south-1 region's video transcode tier loses 30% of its workers at 2:15pm. The on-call engineer does not know it's planned." The platform team has staged the fault in a chaos-platform draft. The reliability lead has a stopwatch. Six engineers — the on-call, two SREs, a product manager, the head of platform, and a senior backend engineer who has never seen this dashboard before — are seated around a table with laptops open. At 2:15pm, the platform fires. The SLO graph turns yellow at 2:15:03. The on-call gets paged at 2:16:41. The reliability lead writes down both timestamps. The detection took 1 minute 38 seconds, the runbook page had a broken link, and the senior backend engineer asked four questions in the first ten minutes that nobody on the on-call rota could answer. That is the data the game day produces. The fault was rehearsal — the humans were the experiment.

A game day is a planned, scheduled rehearsal where a real fault is injected into a real environment while a human team responds in real time. Its purpose is not to find new bugs in the code — that's what platform-level fault injection does. Its purpose is to find bugs in the response system: the runbook, the dashboards, the paging tree, the assumed knowledge of the on-call. Game days run for 2–4 hours, follow a specific structure (brief → inject → respond → debrief), and produce action items measured in days-to-fix, not lines of code. The output is a sharper team, not a fixed bug.

What a game day actually is, and what it is not

A game day is the chaos-engineering equivalent of a fire drill. The fire is fake; the alarm, the evacuation route, the assembly point, and the fire warden's headcount are real. You do not run a fire drill to test whether fire exists — you run it to test whether the people in the building know what to do when the alarm rings.

A game day is not a chaos-platform experiment. The platform-injection chapter described an automated, hypothesis-driven, SLO-bounded test that runs without humans. A game day flips that on its head: the humans are the system under test. The fault is the controlled variable; the response — detection, triage, communication, mitigation, recovery — is what gets measured. Specifically, a game day tests four things the platform cannot:

Did the right person get paged at the right time? Not just "did the alert fire" — but did the page reach a human who has context, in less time than the SLO budget allows.
Was the runbook usable under stress? A runbook that reads fine on a quiet Tuesday may be incomprehensible at 2am with a real outage and three Slack channels lighting up.
Do the dashboards tell the story the responders need? A graph that goes red is necessary but not sufficient; the responder needs to see which shard, which dependency, which deploy.
Does the team's mental model of the system match reality? When five engineers in a room disagree about how a component fails, the disagreement itself is the most valuable artefact of the day.

Platform injection finds bugs in the system. Game days find bugs in the team. Both are needed; neither replaces the other. Illustrative.

The four phases of a well-run game day

A game day that drifts becomes a stressful afternoon with no useful output. The structure that prevents drift is borrowed from emergency-response drills: brief, inject, respond, debrief. Each phase has a fixed time-box and a different facilitator role.

Phase 1: Brief (30 minutes). The facilitator gathers the responders, walks through the purpose (what we're testing — usually one of the four questions above), introduces the scope (which services, which region, which time window), and reads out the rules of engagement. Crucially, the brief tells responders what they will not be told: the on-call does not know which fault, which target, or which time. Observers know everything. The brief also names the abort conditions — if real customer impact exceeds threshold X, the facilitator stops the day. Without a written abort condition, the team will rationalise running through real harm.

Phase 2: Inject (5 minutes). At a pre-announced time the platform fires the fault. The facilitator notes the wall-clock injection time. From this moment, observers stop talking — they take notes. The on-call sees what they always see: an alert (or not), a graph (or not), a Slack ping (or not). The fault is real. The system's response is real. The pages go to real phones.

Phase 3: Respond (60–120 minutes). This is the meat. The on-call works the incident with whatever tools they normally have — runbook, dashboards, Slack, phone. Observers record:

Time-to-detect (fault fires → first alert acknowledged).
Time-to-context (alert acknowledged → responder identifies which service / shard / dependency).
Time-to-mitigation (context found → SLO recovers or fault isolated).
Time-to-resolution (mitigation in place → fault fully reverted, system green).
Every question the responder asks, every wrong turn, every dashboard they wish they had.

The facilitator does not help. If the responder is stuck, the facilitator may grant a single "hint" after a pre-agreed timer (often 30 minutes) — but every hint granted is itself an action item ("the team needed 30 minutes to find the shard ID; the dashboard must surface this by default").

Phase 4: Debrief (30–60 minutes). Within an hour of the fault clearing, the team gathers. The facilitator reads back the timeline. The responders narrate what they thought was happening at each timestamp. The gap between the two narratives is the gold. Action items are written on the board with owners and dates. No blame. The blameless framing is not optional — the moment a game day becomes a performance review, every future game day produces choreographed responses instead of honest ones.

The four-phase structure has one further property worth naming: the phases are sequential and indivisible. You cannot debrief in pieces over the next week; the freshness of memory decays in hours, not days. You cannot extend the response phase indefinitely just because something interesting is unfolding; the team will be tired and the debrief will be thin. The time-boxes are not arbitrary suggestions — they are calibrated to how long humans can sustain focused incident response (roughly two hours peak) and how quickly memories of subjective experience fade (roughly half-life of one day for narrative detail). A facilitator who lets a phase overrun is sacrificing the most expensive part of the day, which is the debrief. Why: action items written within an hour of the response phase capture detail that disappears overnight — exact wording the responder used, the dashboard panel they wished existed, the moment they realised their mental model was wrong. A debrief written 24 hours later is a summary; a debrief within the hour is a transcript.

The structure is rigid on purpose. Without the time-boxes, the day drifts and the debrief — the only phase that produces lasting change — gets cut.

A scheduler that picks the right fault for the team you have

Not every team is ready for every fault. A team that has never run a game day should not start with cross-region partition; they should start with "one pod restarts unexpectedly". Picking the right fault for the team's maturity is a judgement call, but a small piece of code can encode the rule of thumb: pick the easiest fault the team has not yet exercised, and only graduate to the next tier when the team's response time on the current tier is below the SLO budget.

import json
from dataclasses import dataclass, field

@dataclass(order=True)
class Fault:
    difficulty: int
    name: str = field(compare=False)
    blast: str = field(compare=False)
    expected_detect_s: int = field(compare=False)

CATALOGUE = [
    Fault(1, "single-pod-restart",         "1 pod",        30),
    Fault(2, "single-az-network-blip",     "1 AZ, 30s",    60),
    Fault(3, "dependency-latency-200ms",   "1 service",    90),
    Fault(4, "dependency-503-rate-30pct",  "1 service",   120),
    Fault(5, "primary-replica-failover",   "1 shard",     180),
    Fault(6, "cross-az-partition-2min",    "1 region",    300),
    Fault(7, "cross-region-partition-5min", "global",     600),
]

def pick_next(team_history):
    """team_history: list of (fault_name, observed_detect_s)."""
    completed = {h[0]: h[1] for h in team_history}
    for f in sorted(CATALOGUE):
        if f.name not in completed:
            return f, "first time at this tier"
        if completed[f.name] > f.expected_detect_s:
            return f, f"team detected in {completed[f.name]}s, target {f.expected_detect_s}s — repeat"
    return None, "team has graduated all tiers"

# CricStream's mobile-streaming team history
history = [
    ("single-pod-restart",         18),
    ("single-az-network-blip",     45),
    ("dependency-latency-200ms",  140),  # over the 90s target
]

next_fault, reason = pick_next(history)
print(f"NEXT: {next_fault.name}")
print(f"  blast: {next_fault.blast}")
print(f"  target detect: {next_fault.expected_detect_s}s")
print(f"  reason: {reason}")

Output:

NEXT: dependency-latency-200ms
  blast: 1 service
  target detect: 140s, target 90s — repeat
  reason: team detected in 140s, target 90s — repeat

Walking through it: CATALOGUE is the team-readiness ladder — seven fault types, increasing in blast radius and expected difficulty. pick_next first looks for a fault the team has never run (graduation by exposure), and if all tiers are completed, finds one where the team underperformed against the target detection time. The CricStream mobile-streaming team has run the first three tiers; on the third, they took 140s when the target was 90s. Why: a team that detects a fault slower than the SLO budget allows would not have caught the same fault in production fast enough to mitigate within the budget — repeating the tier until the time is under target is the only way to verify the gap has closed. Only after the third-tier time is under 90s does the scheduler advance to tier 4. The rule encodes a real principle: difficulty is a function of the team's current ability, not a global constant. A team that has run quarterly game days for two years can start at tier 5; a team running their first ever game day starts at tier 1, full stop.

How observers should take notes (the role nobody trains for)

There is one role on a game day that nobody is taught how to do: observer. Five engineers in a room watching one engineer respond will, by default, take terrible notes. They write things they think are interesting, they miss timestamps, they argue with each other in real time, and they form opinions that bias the debrief. A good observer follows three rules. First, write the timestamp before the observation, every time — "14:17:03 — responder opens fraud-scoring dashboard, scrolls to error-rate panel, says 'this is wrong'". The bare narrative without timestamps is worthless because the debrief cannot reconstruct the timeline. Second, distinguish quotes from interpretation — quotes go in double quotes, interpretation goes in square brackets. An observer who writes "responder confused about shard topology" instead of "responder said: 'wait, which shard is this'" has injected their own reading into the record. Third, never speak during the response phase. A whispered comment between observers contaminates the data — the responder hears it, adjusts behaviour, and the room is now testing a hybrid of the responder and the observers. Observers who cannot stay silent should leave the room.

CricStream's first cross-region partition: what the runbook said vs what they did

CricStream's reliability team scheduled their first cross-region partition game day for a Wednesday afternoon, six months into their game-day programme. The team had passed tiers 1 through 5 with detection times under target. The fault: simulate a 5-minute partition between the south-1 (Bengaluru) and west-1 (Mumbai) regions for the live-stream metadata service. The runbook for cross-region partition existed, was three pages long, and had been written by a senior engineer who had since left the company.

At 2:30pm the partition fired. The on-call detected the alert in 47 seconds — well under target. The next 18 minutes were a study in runbook decay. The runbook said "open the regional topology dashboard at /grafana/d/regional-topo" — that dashboard had been renamed nine months earlier. The runbook said "drain south-1 traffic via paas-cli regional drain south-1" — the CLI subcommand had been refactored into paas-cli traffic shift --from=south-1 --pct=0. The runbook referenced a Slack channel #sre-network-partitions that had been archived. The on-call ended up reverse-engineering the correct procedure live, which took the team into a real customer-facing degradation window — the SLO was breached by 4 minutes 12 seconds, and the facilitator was 30 seconds away from calling the abort.

The debrief produced 14 action items. Three of them were "fix the runbook" (small). One of them was "every runbook gets a 90-day staleness alert that pages the team that owns it" (large). Why: a runbook is code with no tests — without an automated trigger that forces it to be re-read on a cadence, every refactor of the underlying CLI or dashboard silently breaks it, and the breakage is only discovered during the very incident the runbook was meant to handle. The largest finding, though, was something the team wrote on a sticky note and stuck to the meeting-room door for a year afterwards: ₹0 was spent fixing the underlying system; ₹0 was spent fixing the partition handling; the entire game day's value was in finding that a runbook the team would have bet money on was wrong in three places. The next quarterly game day repeated tier 7 with the corrected runbook and recorded a detection-to-mitigation time of 4 minutes 30 seconds — under the 10-minute target.

Common confusions

"A game day is the same as a chaos experiment." A chaos experiment runs without humans; a game day is about the humans. They use the same injection mechanism but answer different questions.
"Game days find new bugs in the system." Sometimes, but usually not. They find bugs in the response system — runbooks, dashboards, paging trees, mental models. If you want to find code bugs, run platform-level injection.
"You should not announce a game day in advance — it should be a surprise." Wrong, twice. The on-call is unaware of the specific fault, but the team and the wider org know the day is happening. Surprise game days look like real incidents to leadership and produce panic, escalation, and the wrong kind of pressure.
"Game days only work for SREs." Game days work better when product, engineering, and customer-support people observe — they discover that their dashboards do not show what they thought, and their assumptions about how the system fails are different from how it actually fails.
"If the team handles the fault perfectly, the game day was a waste." A perfect run means you ran the wrong tier. The next game day raises difficulty.

Going deeper

The original Amazon GameDay programme (Jesse Robbins, 2004)

The phrase "GameDay" was coined by Jesse Robbins at Amazon in 2004 — Robbins, a former volunteer firefighter, brought the fire-drill metaphor explicitly. The first GameDays at Amazon disconnected entire data centres for hours at a time. The lesson Robbins emphasised in later talks was that the rehearsal value comes from the combination of real fault and human response — disconnecting a data centre at 3am with nobody watching teaches you nothing; doing it at 3pm with the team in a room watching the dashboards teaches you everything. Why: the failure modes that destroy systems are not the ones the code handles badly; they are the ones the team handles badly under stress, and stress only exists when humans are responding in real time. Modern game-day programmes at Netflix, Google, and Microsoft all trace lineage to Robbins's framing.

The "free play" failure mode

A common drift pattern: a game day starts with a clear hypothesis ("test the database failover runbook") but the on-call ends up debugging a tangentially-related issue they noticed during the response, and the day becomes 90 minutes of free-form debugging. This is the free play failure mode — interesting work happens but the original hypothesis goes untested. The fix is the facilitator's job: if the responder drifts, the facilitator notes the side-quest as a separate ticket and steers them back to the hypothesis. Free play is fine as a planned phase 5 (the "open exploration" phase, often after debrief), but it must not consume phase 3.

Game days as the substrate for incident-response training

New hires at PaySetu shadow three game days before they go on the on-call rota. The shadow learns the dashboard layout, the paging tree, the runbook conventions, and — most importantly — how senior engineers reason about an unfolding incident. The same data the game day produces (timeline, narrative, decision points) becomes the curriculum for next quarter's training. The cost of running a game day for training is essentially zero on top of the cost of running it for the team's own benefit; the marginal benefit is enormous. KapitalKite went further and started recording game days (with consent) as training videos for new SREs.

Why game days fail when the company is in crunch mode

Game days require slack — four hours of focused time from a senior team. When the business is in a release-crunch period, game days are the first thing cut. The pattern is so predictable that several reliability teams now schedule game days as blocking on the engineering calendar — the team commits to four hours that cannot be reclaimed, and if a release blocks the slot the release moves, not the game day. This is a cultural fight more than a technical one. Without leadership buy-in, the programme degrades within two quarters of the first crunch.

The relationship to the chaos-engineering principles

The five Netflix principles (build a hypothesis, vary real-world events, run in production, automate experiments, minimise blast radius) apply to platform injection. Game days extend the principles with three more, specific to the human element: rehearse with the people who will respond at 3am, not their managers; treat the runbook as an artefact under test; and never run a game day you cannot abort within 30 seconds. The third is the safety rule that distinguishes a game day from an actual incident — the facilitator must always have a one-keystroke kill switch.

What goes on the wall — the artefact a good game day produces

The most useful artefact a game day produces is not the timeline or the action items — it is the one-page summary taped to the team's wall the morning after. A good summary fits on A4: hypothesis at the top, the timeline as a horizontal strip with the four key timestamps, three things the team did well, three things the team did poorly, and one row of action items with owners and due dates. The reason the wall matters is durability: action items in Jira drift; a printed page that the team walks past every morning does not. MealRush's reliability team has thirty-one of these pages on a wall along the corridor outside the on-call room, ordered chronologically. A new hire who walks the corridor learns more about the platform's failure modes in fifteen minutes than from a week of reading documentation. Why: failure modes are episodic and contextual, and a chronological wall preserves both — what happened, who was on-call, what we changed, what failed differently next time.

The "tabletop" variant for very expensive systems

Some systems are too expensive to fault for real — a planetary-scale CDN, a primary database tier handling every single user write, the order book of a stockbroker during market hours. For these you run the tabletop variant: the facilitator describes a hypothetical fault verbally ("at 2:15pm, the primary write-tier of the matching engine drops 50% of incoming orders for 90 seconds") and the team narrates what they would do, while the facilitator simulates the responses of dependent systems ("the trade-confirmation queue depth grows to 4 lakh messages, what next?"). Tabletops are 60% as effective as live game days and 5% as expensive. They miss the physical lessons — the runbook link being broken, the dashboard being slow to load — but they catch the cognitive lessons: who do we page, what do we tell the regulator, when do we declare a market-wide halt. KapitalKite runs tabletops monthly and live game days only on weekends when the market is closed; the combination is what allows them to rehearse without ever putting a paying customer's order at risk. The discipline is to be honest about which lessons each format produces and not pretend that a tabletop substitutes for the real thing.

When a game day surfaces something that shouldn't go in the runbook

Occasionally a game day reveals a finding that is too sensitive to put in a written runbook — for example, a race condition in a third-party dependency that the vendor has not patched, or a manual override credential that only two engineers have. The right response is not to write it down: write it in the game-day debrief notes, name the two people who hold the knowledge, and add an action item to make the dependency-fix or credential-rotation explicit. A runbook that contains escalation paths involving named individuals becomes wrong the moment one of those individuals leaves. The discipline is to treat every "ask Anita, she knows how to do that" as a bug, not a feature, and file the action item to remove the dependency on Anita.

How small teams should run game days when they only have one on-call

A four-engineer startup with one on-call rotation cannot run the canonical four-hour, six-observer game day. The constraint is real and the response should not be "skip game days until we are bigger". Instead, run a light variant: one facilitator (a senior engineer or founder), one responder (the on-call), one observer (taking notes), and a one-hour total time-box. The fault is small (tier 1 or 2 from the catalogue above). The debrief is fifteen minutes and produces at most three action items. The discipline is exactly the same — written hypothesis, real injection, blameless retrospective — only the scale is smaller. PaisaCard ran light game days monthly during their first year and graduated to full quarterly game days only when they hit ten engineers. The lesson the founder drew was that early game days invented the on-call culture rather than merely testing it: the act of rehearsing made the team behave like an on-call team months before they otherwise would have.

Where this leads next

Game days produce action items; the platform that processes those action items is incident-response tooling — the dashboards, runbook engines, and paging policies covered in the next chapter. The two are tightly coupled: a game day surfaces a runbook gap, the incident-response tooling chapter is about how to systematically close that kind of gap. After that, the curriculum closes with the observability maturity model, which puts game days in their place: somewhere between "we have dashboards" and "we have predictive failure detection" on a maturity spectrum.

Game days also feed back into platform injection. A fault that surfaced a real human failure mode during a game day deserves to be added to the platform's automated catalogue, so that the next regression in the runbook gets caught before the next game day. See /wiki/incident-response-tooling, /wiki/automating-chaos-in-ci-cd, and /wiki/the-observability-maturity-model for the path forward.

References

Robbins, Jesse. "GameDay: Creating Resiliency Through Destruction." Velocity Conf, 2012. — the original talk that named the practice.
Allspaw, John. "Fault Injection in Production: Making the Case for Resilience Testing." ACM Queue, 2012. — the cultural argument for game days.
Rosenthal, Casey & Jones, Nora. "Chaos Engineering: System Resiliency in Practice." O'Reilly, 2020. Chapter 9 covers game-day operations.
Beyer, Betsy et al. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly, 2016. Chapter 28 ("Accelerating SREs to On-Call and Beyond") covers training rotations that mirror game days.
AWS Well-Architected Framework: "Reliability Pillar — Test reliability." The managed-cloud version of the practice.
/wiki/fault-injection-at-the-platform-level — the automated counterpart that game days sit on top of.
/wiki/the-principles-netflix — the five tenets, extended here with the human-element corollaries.
/wiki/automating-chaos-in-ci-cd — where action items from game days end up encoded.
/wiki/blast-radius-and-recovery — the conceptual ancestor of the abort-condition rule.
Limoncelli, Tom et al. "The Practice of Cloud System Administration." Vol. 2, 2014. Section on incident management drills.

A short list of what to bring to your first game day

Practical kit, learned the hard way:

A printed copy of the current runbook (so the team cannot edit it mid-incident and pretend it always said that).
A fresh whiteboard or shared doc with the four key timestamps pre-templated for the observers.
A phone with the paging app installed and verified an hour before — the paging path is part of the test.
A written hypothesis in a sealed envelope opened at the brief (theatre, but the theatre forces a clean separation between planning and execution).
A one-keystroke kill switch the facilitator owns and has demonstrated to the room before injection.

The kit is the same whether it's your first game day or your hundredth. The thing that changes is what you write down at the end.