Cloudflare and the blog-post-post-mortem culture
At 13:42 UTC on 2 July 2019 a Cloudflare engineer pushed a new WAF rule. By 13:43 every Cloudflare edge node on the planet was burning 100% CPU on a single regex backtracking against itself. From 13:42 to 14:09 — 27 minutes — every site behind Cloudflare returned 502 Bad Gateway, including Discord, Medium, and a long tail of 16 million domains. The fix was a global kill-switch on the WAF. The interesting part of the story is not the regex. The interesting part is that on 12 July, ten days later, Cloudflare published a 4,000-word post on their public engineering blog with the engineer's name on it, the bad regex shown verbatim, the revenue impact estimated, and a list of seven specific changes to their deploy pipeline. The blog post is the artefact this chapter is about. It is also the artefact most Indian platforms still do not produce, and the absence is load-bearing.
A post-mortem is not a debrief; a debrief is internal, retrospective, and optional. A post-mortem is a public, externally-published, structurally-mandated document that names the failure, the cause, and the fix. Cloudflare's culture is unusual in that the post-mortem is treated as a deliverable on par with the fix itself — and the writing of it has more long-term value than the fix, because it changes which engineers want to work there, which mistakes the industry stops making, and what the customer can demand from the next vendor. The cost is the engineer's time and the embarrassment; the return is compounding trust.
What happened on 2 July 2019 — the 27 minutes in detail
The deploy was routine. A Cloudflare engineer added a new rule to the Web Application Firewall to block a specific class of XSS payloads observed in the wild. The rule was a regular expression. It compiled. It passed the staging test suite. It rolled out to all 194 Cloudflare edge nodes within two minutes — that rollout speed is itself the property that made the WAF valuable, and is why the deploy pipeline did not gate on a longer canary. At 13:42 UTC, every edge node started executing the new rule against every HTTP request. At 13:43 UTC, every edge node was at 100% CPU, refusing requests with 502 Bad Gateway. The CPU was not in handler code; it was inside the regex engine, evaluating a pattern that had catastrophic backtracking on inputs the staging tests had not exercised.
The bad regex was published in the post-mortem verbatim:
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
The pathological case is the (?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*) tail. Why this specific regex melts on long inputs: the three nested * quantifiers — (?:\s|...)*, .*, .* — combined with the lack of an anchor on either end, mean the regex engine has to try every possible split of the input string between the three repeating groups. For an input of length n, the number of attempted splits grows as O(n²) in the best case and exponentially when the alternation inside the first group has overlapping matches. On a 1,200-byte HTTP header — the kind sent by a normal browser — the engine performs roughly 10⁸ comparisons before either matching or failing. PCRE on a modern x86 core does ~10⁹ regex steps per second; 10⁸ steps is 100 ms of CPU per request. At Cloudflare's per-node request rate of ~10,000 req/s, the node's CPU budget is exhausted before the first second of traffic finishes evaluating.
The rollback path was a feature flag named WAF_GLOBAL_DISABLE. An on-call engineer flipped it at 14:00 UTC. The flag took 9 minutes to propagate to all 194 edge nodes — the flag's own propagation pipeline used some of the same machinery the WAF lived inside, and the saturated CPU was slowing flag delivery. The last edge node returned to healthy at 14:09 UTC.
A second-order property of this incident shaped the post-mortem more than the regex itself: the engineer who pushed the rule, the engineer who diagnosed it, and the engineer who flipped the kill-switch were three different people on three different time zones (the deploy was during San Francisco morning, the diagnosis happened in London, the rollback was triggered by an Austin on-call). The internal incident channel had 47 active participants by 13:55 UTC. The post-mortem had to reconstruct a sequence of events that no single person had observed end-to-end. This is the universal shape of a global-scale incident — no individual saw the whole thing — and it is the shape Indian platforms running multi-region services should expect for their next outage too.
A third structural property worth naming: the deploy that caused the outage had passed every existing safety check. The regex compiled. The CI test suite passed. The staging cluster did not show the regression because the staging traffic was synthetic and did not include the long User-Agent and Cookie headers that real browsers send. The new rule was rolled out via the same global-deploy mechanism that had successfully deployed thousands of prior WAF rules; nothing about the deploy procedure was anomalous. This is the most important property to internalise about systems failures at scale — they typically do not happen because someone violated a process. They happen because the process did not anticipate the failure mode. A post-mortem that says "the engineer should have been more careful" is almost always wrong; the post-mortem that says "our deploy procedure trusted regex compilation as a proxy for safety, and that proxy was wrong" is the one that produces durable change.
What the post-mortem did — the document as artefact
Cloudflare published "Details of the Cloudflare outage on July 2, 2019" on 12 July, signed by John Graham-Cumming (CTO at the time). The document had a specific shape that is worth dissecting because the shape — not the contents — is the part that generalises. Six structural elements:
Named author. The post-mortem had a byline. Not "Cloudflare engineering" but a specific person who took responsibility on the public record. This costs the company one Hacker News thread of "name and shame" comments. It buys, in exchange, the credibility that comes from accountability not being abstracted away. Most Indian platforms publish post-mortems (when they publish them at all) under a corporate byline; the difference reads, to the technical audience, as the difference between "this was a mistake we own" and "this was an event that happened to us".
Verbatim cause. The bad regex was in the post. Not paraphrased. Not redacted. The reader could copy-paste it into PCRE and reproduce the catastrophic backtracking themselves. This is the technical-credibility bar: you do not need to know the proprietary internals of the WAF to verify the failure mode, because the failure mode has been published in primitive form. Why this matters for industry-level learning: a post-mortem that says "a misconfigured rule caused CPU saturation" teaches nothing; a post-mortem that publishes the specific regex and the specific backtracking math teaches every other team in the world to audit their own regex deploys for the same pathology. The first form is theatre; the second form is genuine knowledge transfer. The economic argument for the second form: every other team that catches this in their own codebase is one fewer 502-storm the internet collectively suffers, and the marginal cost to Cloudflare of publishing the regex is zero.
Revenue impact, named. The post quantified: a 27-minute outage with traffic loss measurable at 82% global drop. Cloudflare did not publish a rupee/dollar amount, but the percentage was specific enough that any reader could multiply against Cloudflare's then-public revenue and arrive at an estimate. Most Indian platform post-mortems redact this. The redaction signals to engineering teams that the post-mortem is a PR document, not an engineering document, and that the right reaction to incidents is to manage the narrative rather than to learn from the failure. This is an avoidable cultural cost.
Specific remediation list. The post listed seven changes Cloudflare would make: regex pattern static-analysis in CI, bounded execution time on WAF rules, kill-switch propagation isolated from the saturated path, canary deploy at 1% of traffic before global rollout, and three more. Each remediation was specific enough that a reader could ask "did Cloudflare actually do this?" six months later. The act of listing them publicly is a commitment device — once you have written "we will gate WAF rules on a 1% canary" on a public engineering blog, the next quarterly review has the artefact to reference.
No personal blame. The post was clear that the engineer who pushed the rule was not the cause; the cause was a deploy pipeline that allowed an unsafe regex to reach production without bounded execution time. This is the blameless post-mortem pattern that Etsy's John Allspaw famously articulated in 2012 — but it is worth naming in this curriculum because the instinct in Indian engineering organisations is often the opposite (find the engineer, communicate the consequence). A blameless post-mortem is not a soft-skills nicety; it is the precondition for the engineer who pushed the rule to write down what they actually did, instead of writing down a sanitised story that protects them.
Time-to-publication. Ten days. Not three months. The 10-day window was just long enough for the engineering investigation to converge but short enough that the audience that cared about the incident was still paying attention. Most Indian platform post-mortems land 60-90 days after the incident, by which time the Hacker News thread is forgotten, the customer has either churned or moved on, and the document reads as a corporate hindsight exercise rather than as a contemporaneous account.
The take-away is that the post-mortem is a deliverable, with a specification, a deadline, and a quality bar that an engineering organisation either meets or does not. Cloudflare meets it. Most teams do not. The rest of this chapter is about why the gap exists and what to do about it.
A useful internal benchmark: when your team next ships an incident, ask the post-mortem author to write the document as though it will be published externally — even if the decision to publish has not yet been made. The exercise of writing for the external audience forces a level of specificity that the internal version often skips. Phrases like "the database had issues" do not survive the external-audience filter; they get rewritten into "PostgreSQL connection pool exhaustion at 18:42 IST when the connection-reaper goroutine deadlocked on a mutex held by the metrics-export path". The discipline of writing for the external audience produces a better internal document, even if the document never leaves the company. Razorpay's published practice (per their 2023 engineering newsletter) is to write every internal RCA in the external-publishable format, then make the publication decision separately. The cost is small (the writing is the same effort either way); the benefit is that internal RCAs reach a consistently higher technical bar.
# regex_backtracking_demo.py
# Reproduce the catastrophic-backtracking shape of the Cloudflare 2019 regex
# without using PCRE's full pattern (which is genuinely slow). This is a
# minimal pathological regex against inputs of growing length.
#
# Run: python3 regex_backtracking_demo.py
import re, time, sys
# A simplified pathological pattern with the same structural problem:
# nested quantifiers without an anchor produce O(2^n) backtracking
# on inputs that fail to match.
PATHOLOGICAL = re.compile(r"(a|a)*b")
# A safe equivalent that matches the same language but with no backtracking
# (no overlap in the alternation, atomic structure).
SAFE = re.compile(r"a*b")
def benchmark(pattern, label, max_n=24):
print(f"\n{label}: pattern = {pattern.pattern!r}")
print(f"{'n':>4} {'time_ms':>10}")
for n in range(8, max_n + 1):
s = "a" * n # input that does NOT match (no trailing 'b')
t0 = time.perf_counter()
pattern.match(s)
elapsed_ms = (time.perf_counter() - t0) * 1000
print(f"{n:>4} {elapsed_ms:>10.3f}")
if elapsed_ms > 5000: # don't blow up the demo
print(f" (aborting — exponential growth confirmed)")
return
if __name__ == "__main__":
benchmark(PATHOLOGICAL, "PATHOLOGICAL — nested quantifiers")
benchmark(SAFE, "SAFE — atomic match")
Sample run on a 2024 MacBook M3 Pro:
$ python3 regex_backtracking_demo.py
PATHOLOGICAL: pattern = '(a|a)*b'
n time_ms
8 0.092
12 1.421
16 22.840
20 358.110
24 5731.200
(aborting — exponential growth confirmed)
SAFE: pattern = 'a*b'
n time_ms
8 0.001
12 0.001
16 0.002
20 0.002
24 0.002
Walk through the lines that carry the lesson:
PATHOLOGICAL = re.compile(r"(a|a)*b")— the alternation(a|a)lets the regex engine match the same character via two distinct paths. When the regex fails (no trailingb), the engine has to try every possible assignment of input characters to the two paths. The number of assignments is 2^n. Why this is the same shape as Cloudflare's regex: the production regex did not have a literal(a|a)but had(?:\s|-|~|!|{}|\|\||\+)*where some of the alternation branches could match the same character (the\|\|branch overlaps with\|if the input had pipes). The overlap creates exponential paths, and the failed-match case forces the engine to enumerate all of them.pattern.match(s)with inputs = "a" * n— the input is engineered to fail the match. The engine only does the full backtracking enumeration on failed matches; matches succeed on the first viable assignment. This is the trap: positive test cases may pass quickly, while inputs designed to fail (or that happen to fail the way the regex was structured) trigger the exponential blowup.- The
n=24row taking 5.7 seconds — Python'sremodule is not particularly fast, but the shape generalises. PCRE on the same input is ~3× faster but still exponential. The point is not the constant; the point is that a 16-character difference in input length produced a 60,000× slowdown. SAFE = re.compile(r"a*b")— same matched language, no nesting, no overlap. Matches in nanoseconds regardless of input length. The lesson for any team running regex in a hot path: anchor the regex (^,$), avoid nested quantifiers, and prefer atomic groups(?>...)or possessive quantifiers*+where the regex flavour supports them. The Cloudflare remediation included a static analyser that flagged the nesting pattern in any new WAF rule before it could deploy.
The reason this benchmark is worth running on your laptop is that the exponential shape is hard to imagine until you see the wall-clock numbers climb. The first column moves linearly; the second column moves exponentially. Engineers reading a static description of "exponential backtracking" often nod and move on; engineers watching the seconds column climb from 22 ms to 5.7 s in three rows internalise the shape and start auditing their own regexes.
The failure modes that public post-mortems uniquely surface
Some classes of failure are diagnosable only by comparing notes across companies, and post-mortem culture is the only durable mechanism for that comparison. Three classes worth naming:
Cross-vendor pathology. The Cloudflare regex pathology was not unique to Cloudflare. AWS WAF had a similar bug class (caught at internal review only because an engineer had read the Cloudflare post). Akamai's Kona Site Defender had structurally similar exposure. None of these companies would have known about the others' near-misses without public post-mortems. The class of bug — "user-supplied pattern compiled by us, evaluated against attacker-controlled input, on a hot path with no execution-time bound" — recurs across many systems (template engines, query parsers, configuration validators), and the public post-mortem is the cheapest mechanism for the industry to converge on the fix. Why no internal mechanism substitutes: each company's internal review only sees its own incidents, which means each company's internal lessons are a strict subset of the industry's accumulated lessons. The Cloudflare engineer who pushed the bad regex had not seen a similar pattern internally because Cloudflare had not had this class of incident before. Reading another company's post-mortem is the only way to learn from incidents the company has not yet had.
Slow-burn drift. The opposite of a sharp incident is a regression that creeps in over months — a service whose p99 climbs from 50 ms to 80 ms to 120 ms across three quarters until it crosses the SLO. These are visible in the data but not in the alarm flow, and post-mortems on them tend to read as "we noticed our p99 had drifted" rather than as a single-incident narrative. The discipline of writing the slow-burn drift as a post-mortem forces the team to identify the specific commits, deploys, or workload changes that contributed to the drift. Without the post-mortem framing, the team's response is usually a one-line ticket ("performance regression") that does not produce the specific-remediation list. With the framing, the team produces "we will add p99 alarms at 80% of SLO instead of 100%, and we will run a weekly perf-regression test against last quarter's golden trace". The shift is in specificity, and specificity is what produces durable change.
Coupled-system blast radius. When a single deploy takes down multiple unrelated services because of an undocumented dependency, the post-mortem is the place where the dependency gets named. The Cloudflare 9-minute kill-switch propagation gap is one example; a 2021 Razorpay incident where a deploy to the metrics service took down the payments service (because the payments service blocked synchronously on a metrics-emit call that timed out) is another. These dependencies are invisible in the architecture diagrams; they only surface when something fails. The post-mortem catalogues them, and over time the catalogue becomes a map of the actual coupling in the system — far more accurate than the official architecture documentation.
These three classes of failure are not unique to large platforms; smaller teams encounter them too. The reason small teams underweight post-mortems is usually that the perceived cost-benefit favours skipping them ("we are too small for this to be worth the time"). The math is the opposite — small teams have less institutional memory and benefit more from each post-mortem, because each document is a larger fraction of the team's accumulated learning. A 20-engineer startup that has published five post-mortems has captured more institutional knowledge per engineer than a 2,000-engineer company that has published 50.
The asymmetry has practical consequences. A founding-team engineer who internalises the post-mortem habit early carries it through the company's growth from 5 to 500 engineers; the practice is then embedded by the time the company is large enough that introducing it later would face cultural resistance. Razorpay's published practice traces to its founding years; PhonePe's lack of a public practice traces to a similar period. The two companies are now different sizes but the path-dependence is visible — Razorpay can publish today because its early engineers built the muscle, and PhonePe would now struggle to introduce the practice against accumulated organisational habit. The lesson for any new company: start the post-mortem habit on incident #1, not on incident #50. The first post-mortem is the hardest to write and ship; every subsequent one is easier, and the cumulative cost of the habit is sub-linear in the number of incidents while the benefit compounds.
How a post-mortem changes incentives — the long-tail benefits
Why does a public, named, specific post-mortem matter beyond the immediate communication value? Three compounding incentive shifts that play out over years:
It changes hiring. Engineers with a choice between vendors read engineering blogs. A platform that publishes detailed post-mortems signals that engineering rigour is rewarded internally — the engineer whose deploy caused the outage is named on the post, not fired. The signal travels: senior engineers who have been burned by blame-cultures read Cloudflare's post-mortem and decide that this is the kind of place where they can take risks. The hiring effect is hard to measure but persistent. Cloudflare's engineering reputation, ten years on, is partly downstream of the post-mortem culture they established in their early years. Razorpay has begun to follow this pattern (their "Razorpay Engineering" blog now publishes detailed incident reviews); Zerodha publishes some, but redacts the engineer's name; most Indian platforms publish nothing and the recruiting consequence is exactly what you would predict — they cannot win bidding wars for senior engineers against companies whose engineering culture is publicly visible.
It changes customer trust. Cloudflare's customers, after the 2 July 2019 incident, did not churn en masse. The post-mortem was a substantial part of why. A vendor that takes you down for 27 minutes and says nothing is a vendor you cannot underwrite. A vendor that takes you down for 27 minutes and publishes a 4,000-word document explaining exactly what happened is a vendor whose failure modes you understand. You can underwrite "the WAF deploy pipeline now has bounded regex execution time"; you cannot underwrite a corporate apology. The asymmetry is most visible in B2B sales — the procurement team asks for the most recent post-mortem during the security review, and the vendor that has published one wins the deal. This is the under-appreciated revenue case for the post-mortem: it is a sales artefact.
It changes industry-level learning. The Cloudflare post-mortem made every WAF and IDS team in the world audit their regex pipelines. ModSecurity (the open-source WAF) added regex-complexity static analysis within months. AWS WAF added per-rule timeout limits. Akamai's (Cloudflare's competitor) similar incident in 2020 was caught at canary because their team had read the Cloudflare post and added the canary stage to their own pipeline. The cumulative cost-savings across the industry from one post-mortem are difficult to estimate but plausibly larger than Cloudflare's own annual security budget — the 2 July incident, by being publicly dissected, prevented an unknown number of similar incidents at other companies. This is the externality argument: the post-mortem produces value that the publishing company does not capture, but that the industry benefits from. A culture of publishing post-mortems is, at the industry level, a public good.
The argument for the post-mortem is not "transparency is virtuous". The argument is "transparency, in this specific form, is the most economically rational way to extract value from an incident that has already happened". The incident's cost is sunk; the post-mortem converts it into an asset. Skipping the post-mortem is leaving that asset on the table.
A subtler benefit, often underweighted: post-mortems compound internally too. An engineering organisation that has published 50 detailed post-mortems over five years has a corpus that new joiners can read on day one. The institutional memory of "things that went wrong and why" lives in the corpus instead of in the heads of senior engineers, which means the organisation tolerates churn better. When a senior engineer leaves, their accumulated incident wisdom does not leave with them — it lives in the post-mortems they wrote. Indian platforms that run internal-only RCAs (root cause analyses) lose this benefit in a particular way: when the senior engineer who wrote the RCA leaves, the next reader has to be granted access to whatever Confluence space the RCA lives in, and that access is often not granted promptly. A public blog post is, perversely, more accessible to your own future engineers than an internal Confluence document, because the public blog post does not require permissions. This is the institutional-memory case for the public form.
A fourth-order benefit, named in Google's SRE Workbook chapter on post-mortem culture: the act of writing a post-mortem produces better engineers. The discipline of converting a chaotic incident into a coherent narrative — with timestamps that are accurate, mechanisms that are named, and remediations that are specific — is itself a skill that transfers to design work, code review, and architectural decisions. An engineer who has authored five post-mortems over their career has a different mental toolkit than one who has authored none; they reason about failure modes in design review, propose specific instrumentation for new systems, and ask "what is the post-mortem we will write if this fails?" before shipping. The skill is rare enough that hiring managers can recognise it in interviews — the candidate who has authored published post-mortems answers "describe a time something went wrong" questions with structural clarity rather than narrative ramble. Indian platforms that produce no public post-mortems also produce engineers who lack this mental toolkit by default; the engineers can develop it, but the development is slower without the practice.
A complementary benefit at the team level: post-mortem reviews — the meeting where the team discusses the document before it ships — are an exceptionally high-bandwidth onboarding mechanism. A new joiner who attends three post-mortem reviews in their first month has learned more about the team's production architecture than they would learn from three months of code-reading. The reviews surface the implicit knowledge ("the auth service is owned by the platform team but the on-call rotation is shared with payments because of historical reasons") that would otherwise take quarters to acquire. Companies that run post-mortem reviews as the default forum for senior engineers and new joiners to interact build organisational competence faster than companies that route onboarding through formal training. The mechanism is the same one Cloudflare uses; the practice is replicable; the prerequisite is again the cultural one of decoupling post-mortem authorship from blame.
Indian platforms — where post-mortem culture stands today
The state of public post-mortems across Indian platforms varies, and the variance maps cleanly to engineering reputation. A snapshot from 2024:
Razorpay publishes detailed post-mortems for major incidents on engineering.razorpay.com. The 2022 post-mortem on a UPI payment double-debit (caused by a race condition in their idempotency layer) named the engineer, included the buggy SQL, and listed eight remediations. This is the closest Indian platform analog to Cloudflare's culture. The downstream effect is visible in Razorpay's hiring — the company is consistently in the top quartile of compensation for senior backend engineers in Bangalore, and the engineering blog is cited by candidates as a reason they take the offer.
Zerodha publishes incident notes on Twitter and brief technical posts on zerodha.tech. The notes are honest and specific (Nithin Kamath, the founder, has tweeted things like "we screwed up, here is what happened, here is the fix") but stop short of the named-engineer + verbatim-cause format. The compromise reflects regulatory caution: Zerodha is SEBI-regulated and conservative about anything that looks like an admission of liability. The trade-off is real — the regulatory environment in India is more cautious than the US — but it is also a partially self-imposed constraint. Other SEBI-regulated entities (most of the trading firms, all the AMCs) publish nothing, so Zerodha's level of disclosure is already a positive outlier in its sector.
Hotstar / JioCinema publish marketing post-mortems after big events ("we served 25M concurrent for the IPL final, here is the architecture") but do not publish incident post-mortems when the platform fails (the 2023 IPL final saw a 90-second video stutter for a measurable fraction of users; no post-mortem was published). The difference between a victory lap and a failure debrief is the difference between marketing and engineering, and the absence of the latter is conspicuous to engineers who watch the platform from outside.
Flipkart / PhonePe publish almost nothing in public for incidents. Both run internal RCA processes that, by the accounts of engineers who have worked there, are rigorous — but the artefacts do not leave the corporate boundary. The recruiting cost is real: senior engineers comparing offers between PhonePe and Razorpay see Razorpay's engineering blog and infer (correctly) that engineering accountability is more publicly visible there. The cost is hard to quantify but compounds over hiring cycles.
IRCTC / UIDAI / NPCI publish post-mortems at most reluctantly and after parliamentary intervention. The 2018 IRCTC Tatkal failure that took the booking system down for the morning rush produced a one-paragraph notice; the 2022 UPI outage that affected ~12% of transactions for 41 minutes produced an NPCI press release with no technical content. The public-sector pattern is consistent: post-mortems exist internally (RBI demands them), but the public artefact is bureaucratic rather than technical. The accumulated cost is to Indian developer-trust in public infrastructure, which is harder to rebuild than the equivalent private-sector trust.
The pattern across all of these: the platforms with the most public post-mortem activity (Razorpay, Zerodha to a lesser extent) are also the platforms with the strongest engineering reputations and the highest senior-engineer compensation. The causation runs both ways — post-mortems attract talent, talent produces fewer incidents, fewer incidents make post-mortems easier to publish — and the feedback loop compounds. Platforms that have not started this loop pay a permanent tax in hiring and trust; the tax does not show up in any single quarterly review, which is why it persists.
A useful exercise for any Indian platform engineer: search your company's engineering blog (if it has one) for "post-mortem" or "incident". If the most recent result is more than 12 months old, or if the document is a marketing piece rather than a technical incident review, the company has not yet committed to the practice. The conversation to have with leadership is not "should we publish post-mortems" — that question rarely produces movement. The conversation is "for our most recent significant incident, what is the specific reason we did not publish a public post-mortem?" The answers are usually variants of "PR concerns" or "legal review", and once articulated they can be negotiated. The legal review on Cloudflare's post-mortem reportedly took a single afternoon; the legal review on the equivalent Indian post-mortem takes weeks because the lawyers have not yet seen one done. The pattern breaks once the first one ships.
A second exercise, more measurable: count the number of public post-mortems your platform has published in the last 24 months that contain (a) a specific verbatim cause and (b) a numbered remediation list. The count for most Indian platforms is zero. For Razorpay it is roughly six. For Cloudflare it is 40+. The order-of-magnitude gap is the visible part of the cultural gap, and it is the part that recruiters, customers, and engineering candidates use as a proxy for the invisible part. Closing the gap is not free — each post-mortem is roughly 2 engineer-weeks of work between investigation, writing, legal review, and editorial pass — but the per-document cost amortises across the recruiting and trust returns over the document's lifetime. The two-week investment at publication compounds into multi-year hiring advantage; the unpublished version produces no such return.
A third structural observation that ties the Indian-platform survey together: the platforms that publish post-mortems are also the platforms that have invested in on-call compensation models that decouple incident response from career risk. Razorpay's on-call rotation pays a per-shift premium and explicitly does not factor incident-involvement into performance review. Zerodha's on-call structure is similar. The platforms that do not publish post-mortems also tend to be the platforms where being on-call during a major incident is a career-limiting event ("you were on the rotation when the outage happened"). The two patterns are linked: a culture that punishes on-call engineers cannot produce honest post-mortems, because the engineers will protect themselves. Fixing the publication gap therefore requires fixing the on-call-incentive gap first; the order matters. A team that tries to mandate post-mortem publication without first fixing the incident-blame culture produces sanitised documents that fail at every level — they neither inform the industry nor the internal team, and they wastefully consume engineer-time that would have been better spent on the fix itself. The cultural prerequisite is non-negotiable.
Common confusions
- "A post-mortem is the same as an internal RCA" An RCA (root cause analysis) is an internal engineering document. A post-mortem, in the Cloudflare sense, is a public artefact. They share content but not audience or function. The RCA exists to prevent recurrence within the team; the post-mortem exists to extract industry-level learning, build customer trust, and signal engineering culture. A company that produces RCAs but no post-mortems is doing 80% of the engineering work and capturing 30% of the value.
- "A post-mortem is the same as a status-page incident notice" A status-page notice is a customer communication during the incident; it tells the customer what is broken and when it will be fixed. A post-mortem is a forensic document published after the incident; it tells the customer (and the industry) why the incident happened. Both are necessary; conflating them is a category error that produces post-mortems that read like extended status-page notices ("we apologise for any inconvenience") and fail to contain technical content.
- "Naming the engineer is blame culture" The opposite. The Cloudflare post-mortem named the engineer who pushed the rule and explicitly absolved them ("the engineer followed our deploy procedure correctly; the procedure was wrong"). Naming the engineer is part of the blameless framing — it acknowledges the human in the loop and shifts the blame to the system that allowed the human's action to cause the outcome. Anonymising the engineer signals that the company is hiding the human, which usually means the company is preparing to blame them.
- "We can't publish because of regulatory risk" Sometimes true (SEBI, RBI, financial regulators do impose disclosure constraints) and often used as an excuse when it is not true. The test: would a minimal, sanitised version of the post-mortem actually expose the company to regulatory liability, or is the legal team uncomfortable because they have not seen one done before? The answer is usually the second. Most regulatory frameworks do not prohibit technical post-mortems; they prohibit specific kinds of customer-data disclosure, which a competent post-mortem does not include.
- "Post-mortems are nice-to-have, not necessary" They are necessary for the engineering hiring market that the company competes in. Senior backend engineers in 2024 read engineering blogs as part of their offer evaluation. A platform that does not publish post-mortems is, to that audience, signalling that the engineering culture is opaque. The cost is a perpetual hiring discount that the company pays in compensation premium to overcome.
- "We will publish a post-mortem when we have a really significant incident" The norm is built incrementally. A team that publishes detailed post-mortems for medium-severity incidents has the muscle to publish quickly when a major incident hits. A team that has never published one will struggle to produce the first one under pressure. Cloudflare's first post-mortems were for relatively minor incidents; the practice was already established when the major outage came. Waiting for "the right incident" is the wrong way to start.
Going deeper
The blameless post-mortem — its origin and the constraint it imposes
The blameless post-mortem is associated with John Allspaw at Etsy (2012, "Blameless PostMortems and a Just Culture"). The core claim is that an engineer who fears blame produces an inaccurate account, and an inaccurate account prevents systemic learning. The constraint is real: a blameless framing requires the company to commit, in advance, to not punishing the engineer named in the post-mortem. Companies that have not made this commitment cannot run a Cloudflare-style post-mortem culture, because the engineer will (correctly) refuse to be named. The first cultural change is therefore not "publish post-mortems" but "decouple post-mortems from performance review". Razorpay's published practice explicitly states that incident involvement is not a performance-review input; this is the structural precondition for engineers to write candidly. Indian platforms that have not made this decoupling — most of them — cannot adopt the practice even if the leadership wants to, because the engineers will not cooperate. The decoupling is the prerequisite, and it is harder than it sounds.
Static analysis of regex patterns — the Cloudflare remediation in detail
The specific remediation that prevented the recurrence was a static analyser run at WAF-rule deploy time that flagged regex patterns with nested unbounded quantifiers, alternation with overlap, or no anchor. The analyser was not novel — research on regex complexity classification (Davis, Coghlan, Servant 2018, "The impact of regular expression denial of service") had described the algorithms — but Cloudflare's contribution was integrating it into the deploy pipeline as a hard gate. The deeper point is that the post-mortem named the missing piece of static analysis; once named, the engineering work to add it was a sprint of work, not a research project. The post-mortem is the artefact that converts a vague concern ("we should be more careful with regex") into a specific deploy-time check. Without the post-mortem to focus the work, the static analyser would have been a "someday" item that competed with feature work and lost. With the post-mortem as a public commitment, the static analyser shipped within six weeks. The pattern generalises: post-mortems convert concerns into specifications, and specifications convert into shipped code.
The 9-minute kill-switch propagation — coupling between control and data planes
A subtle structural lesson from the 2 July incident: the kill-switch mechanism shared infrastructure with the saturated WAF. The flag-propagation pipeline used the same edge-node management daemon that the WAF lived inside, and that daemon was CPU-starved. The remediation was to move the kill-switch into a dedicated propagation channel that does not share resources with the data plane. This is the control-plane / data-plane separation principle, and it generalises: any mechanism whose purpose is to recover from a data-plane failure must not depend on the data plane being healthy. Indian platforms that run blue-green deployments have a similar trap when the deploy controller depends on the cluster's normal request path; if the cluster goes down, the deploy controller cannot ship the rollback. The fix is the same — separate channel for control. The Cloudflare post-mortem named this lesson explicitly; without the public post, the lesson would have stayed inside Cloudflare and other platforms would have rediscovered it (expensively) over their next decade.
Time-to-publication — why ten days is the right number
Cloudflare published on 12 July, ten days after the 2 July incident. This is not arbitrary. The constraint is that the engineering investigation must converge enough that the post-mortem is accurate (premature publication invites corrections, which damages credibility), but the audience that cares is still attentive (a 60-day delay loses the Hacker News thread, the customer email cycle, and the press cycle). Ten days is roughly the minimum for a rigorous investigation of a global incident and roughly the maximum before public attention dissipates. Indian platforms that publish 60-90 days after the fact lose most of the attention value. The 10-day target is hard to hit if the investigation involves multiple teams, but it is achievable if the post-mortem is treated as a deliverable on the incident's critical path rather than as a follow-up. Razorpay's practice is to assign a post-mortem author at the start of the incident (often the incident commander) so the document is being drafted in parallel with the investigation; this is the operational discipline that makes 10-day publication feasible.
What goes wrong when post-mortems become marketing
The failure mode worth naming is the post-mortem that is technically present but functionally a marketing exercise. The shape: a company publishes a document called "post-mortem" after an incident, but the document contains no verbatim cause, no specific remediation, no quantified impact, and no named author. The document apologises, narrates the incident in vague terms ("we experienced an unexpected condition"), and assures the reader that the team is "working hard to prevent recurrence". Engineering audiences read this and discount it to zero. Worse, the existence of the document forecloses on a real one — the company has "published a post-mortem" and can claim the credit without paying the cost. Several Indian platforms have settled into this anti-pattern; the documents tick the publication checkbox but produce no learning, no trust, and no recruiting benefit. The diagnostic test is simple: does the document let an external engineer reproduce the failure, or at least audit their own system for the same class of failure? If not, the document is marketing, and the engineering culture has been corrupted by the appearance of the practice without the substance. Razorpay's published RCA on the 2022 UPI double-debit issue includes the specific SQL INSERT ... ON CONFLICT that lacked a uniqueness constraint on the (transaction_id, retry_count) tuple — that level of specificity is the test. A reader who works on a similar payments system can audit their own schema for the same gap. A reader of the marketing-shaped post-mortem cannot.
The post-mortem corpus as engineering interview material
A side-effect of Cloudflare's published post-mortem corpus that most companies underestimate: the documents are excellent interview material. Cloudflare's hiring pipeline reportedly assigns candidates a published post-mortem and asks them to identify the structural mistakes (e.g. the kill-switch sharing infrastructure with the failing component) and propose alternative remediations. Candidates who can read a real incident and reason about its mechanisms are the candidates the company wants to hire. Companies without published post-mortems lose access to this hiring signal — they have to use synthetic case studies, which are weaker because the candidate knows they are synthetic and reasons accordingly. The post-mortem corpus is therefore an asset that compounds in two directions: it attracts candidates (the recruiting effect named earlier) and it filters them (the interviewing effect named here). Indian platforms running technical interviews on system design would benefit substantially from publishing a post-mortem corpus, even setting aside the trust and learning effects, purely because it produces stronger interview signal. The cost is the same as for the other benefits; the return adds another channel to the multi-channel ROI of the practice.
Reproduce this on your laptop
# Reproduce the regex-backtracking pathology
python3 -m venv .venv && source .venv/bin/activate
pip install --quiet # stdlib only — no extra packages needed
python3 regex_backtracking_demo.py
# Expected: PATHOLOGICAL row times grow exponentially; SAFE row times stay flat.
# Audit your own codebase for the same pattern (Python regex flavour)
pip install regex
python3 -c "
import regex
# regex.compile() with flags=regex.V1 + a timeout protects against ReDoS
p = regex.compile(r'(a|a)*b', flags=regex.V1)
try:
p.match('a' * 30, timeout=0.5)
except TimeoutError:
print('caught ReDoS — bounded execution time saved you')
"
The point of running this on your own machine is to internalise the asymmetry: the bad regex's wall-clock time grows by orders of magnitude as the input grows by tens of bytes. A reader who has watched the seconds-column climb in their own terminal does not need to be persuaded that bounded execution time on regex is a deploy-pipeline requirement; the asymmetry is self-evident. This is why post-mortems include reproducible examples — the reader who reproduces the failure on their own machine becomes an evangelist for the fix in their own organisation, at no marginal cost to the publishing company.
Where this leads next
The next chapter (/wiki/google-the-tail-at-scale-paper-revisited) shifts from a single incident to a foundational paper that codified the patterns Cloudflare's post-mortem culture exemplifies. The chapter after that (/wiki/amazon-why-cells-not-clusters) looks at how AWS structures its infrastructure to bound blast radius — a different remediation for the same class of problem (a single bad deploy taking down everything).
The natural next reads are:
- /wiki/discords-elixir-rust-rewrite — the previous chapter, which is the engineering counterpart to the cultural lesson here: Discord's published 2020 retrospective is itself a Cloudflare-style post-mortem of a slow-burn outage (the BEAM GC pause).
- /wiki/case-p99-spike-that-was-a-gc-tuning-flag — the production-debugging chapter that demonstrates the kind of artefact a Cloudflare-style post-mortem ought to ship as appendix.
- /wiki/coordinated-omission-and-hdr-histograms — the measurement discipline that makes incident timelines accurate, which post-mortems rely on.
- /wiki/wall-performance-engineering-is-culture — the curriculum-spanning chapter on culture, of which post-mortems are the most visible artefact.
A reader who has worked through Parts 7 (tail latency) and 15 (production debugging) should now be able to read Cloudflare's 2 July 2019 post-mortem in its entirety and recognise every mechanism named: the regex backtracking (front-end CPU bound), the kill-switch propagation lag (control-plane / data-plane coupling), the canary-deploy gap (offered-load asymmetry between staging and production). The post-mortem is not adding new mechanisms; it is the public record of mechanisms the curriculum has taught, applied to a single 27-minute window.
The deeper take-away is that culture is the precondition for engineering discipline at scale. A platform with strong individual engineers but weak post-mortem culture loses the organisational learning that makes the engineers' work compound. A platform with average individual engineers but strong post-mortem culture converges, over years, on a higher engineering standard than the individual engineer-quality would predict. The Cloudflare engineers in 2019 were not, individually, ten times better than the engineers at any other CDN. The culture they operated inside was. The same is true at every Indian platform that has gotten engineering culture right (Razorpay, Zerodha at the cultural level if not the publication level) and at every one that has not. The work of building the culture is slower than the work of hiring better engineers, but the compounding effect is larger and harder to lose.
Part 16 ends with one more chapter (/wiki/the-30-year-arc-of-systems-performance) that puts the whole curriculum in historical context. The case studies have built the recurring observation: every system that has held its behaviour at scale has done so because someone, at some point, refused the false economy of moving fast without writing down why things broke. The post-mortem is the cheapest and most leveraged form of that writing-down; the platforms that adopt it as practice compound the benefit, and the platforms that do not pay the tax indefinitely.
References
- Cloudflare Engineering — "Details of the Cloudflare outage on July 2, 2019" — the canonical post-mortem this chapter dissects.
- John Allspaw — "Blameless PostMortems and a Just Culture" (Etsy Code as Craft, 2012) — the foundational essay on the blameless post-mortem.
- Davis, Coghlan, Servant — "The impact of regular expression denial of service (ReDoS) in practice" (FSE 2018) — the academic treatment of regex catastrophic backtracking.
- Google SRE Workbook, Chapter 10 — "Postmortem Culture" — the canonical guide to post-mortem mechanics.
- Razorpay Engineering Blog — Incident post-mortems — the closest Indian platform analog to Cloudflare's practice.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Ch. 13 (Case Studies) — the case-study chapter that informs Part 16's structure.
- /wiki/discords-elixir-rust-rewrite — the previous case study, on a slow-burn outage that drove a runtime rewrite.
- /wiki/wall-performance-engineering-is-culture — the curriculum-spanning argument on culture as engineering input.