Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The observability maturity model

It is Wednesday afternoon at PaisaBridge and Vinay, the new head of platform, has been asked by the CTO a question every head of platform eventually faces: "are we mature on observability?" He scrolls through Grafana — 412 dashboards, mostly auto-generated by Helm charts. He scrolls through PagerDuty — 87 alerts fired last week, 71 auto-resolved before any human looked at them. He pulls up Tempo — sampling rate 1%, retention 24 hours, the last three production incidents all had a trace_id in the logs but no actual trace stored. He opens the most recent post-mortem — Action Item 4 from six weeks ago, "instrument the consumer-rebalance path", is still open, assigned to an engineer who left in February. Vinay's honest answer to the CTO is we have a lot of observability. The CTO's actual question was different. The question was whether the observability the team has would let them detect, attribute, and recover from a Tatkal-hour-shaped outage at 09:17 IST tomorrow morning, or not. The maturity model is the framework that answers that question — not the volume question.

The observability maturity model is a five-level rubric — Reactive, Instrumented, Correlated, Proactive, Adaptive — where each level is defined by mechanical checks (alert-to-page ratio, percent-of-incidents-with-trace, MTTR breakdown, error-budget policy adherence) rather than by tooling inventory. A team is at the highest level whose checks all pass; one failing check drops the team a level regardless of what tools they bought. Maturity is not "did we install Prometheus and Tempo" — it is "when an incident fires at 09:17, do we know within ninety seconds which service caused it, and do our subsequent post-mortem action items actually ship". Use the model to identify the next missing check, not to brag about the level you've reached.

Why a maturity model — and why most are wrong

The default observability "maturity model" most platform teams encounter is a vendor's marketing slide: Level 1: Logs. Level 2: Logs + Metrics. Level 3: Logs + Metrics + Traces. Level 4: All three plus AI. This is tooling inventory pretending to be maturity — it tells you which products to buy, not whether your team can find the cause of an outage at 03:00. A team can be on Level 4 of the inventory model and still spend forty minutes debugging a SEV-1 because their traces have no customer_id baggage, their alerts fire on raw error count instead of burn rate, and their on-call has never opened Tempo before.

A real maturity model measures outcomes the team can verify, not artefacts they own. The five levels below are the version PaisaBridge-shape teams converge on after a few years of running on-call rotations honestly:

Five levels of observability maturity, with the mechanical check that gates each oneA horizontal staircase from left to right showing five ascending levels. Level 1 Reactive at the bottom: alerts fire, humans investigate from scratch, no shared instrumentation. Level 2 Instrumented: metrics, logs, and traces emitted, but each pillar lives in its own world. Level 3 Correlated: trace-id flows through logs, exemplars link metrics to traces, the on-call can pivot across pillars in under a minute. Level 4 Proactive: SLOs are defined and burn-rate alerts replace symptom alerts, error budget policy enforced. Level 5 Adaptive: the team uses observability data to change architecture decisions, capacity plans, and on-call structure. Each step has a small box listing the gating check: alert-to-page ratio above zero point seven; percent of incidents with a trace above ninety; MTTR-attribution under three minutes for SEV-2; error budget policy adhered to in the last quarter; observability-driven architecture change shipped this quarter. A side note reads: a team is at the highest level whose check passes; one failure drops a level. The five levels — and the mechanical check that gates each one Level = highest rung whose check passes on a Friday afternoon, no caveats L1 Reactive page first, debug from scratch every time L2 Instrumented three pillars present, siloed L3 Correlated trace_id, exemplars, cross-pillar pivot < 1m L4 Proactive SLOs + burn-rate alerts, error-budget policy L5 Adaptive obs drives architecture, capacity, org design gate L2: alert/page > 0.7 gate L3: 90%+ inc w/ trace gate L4: MTTR-attr < 3m gate L5: budget policy hit gate L5+: arch change shipped
Illustrative — the staircase metaphor matters because levels are non-skippable. A team that buys an SLO tool without first achieving cross-pillar correlation will report SLO numbers but won't be able to debug the breaches.

The staircase is non-skippable for a reason. A team that installs SLOs (an L4 artefact) without first achieving cross-pillar correlation (L3) will report burn-rate numbers without the ability to debug why the budget is burning — they will see the number rise and stare at it. The level is not a permission tier; it is a prerequisite chain. The check that gates each level is therefore the check that says "the previous level is actually working". Skip it and the higher level is a Potemkin village.

Why outcome-checks beat tooling-inventory checks: a team can have Prometheus, Tempo, Loki, Pyroscope, and Grafana installed and configured, and still answer "no" to the question "when checkout-api went down at 09:17 last Friday, did we attribute it to the right downstream within three minutes?" The tooling is necessary but not sufficient. The outcome-check measures whether the tooling is being used — whether the on-call's muscle memory routes through it under stress, whether the dashboards are linked from the alerts, whether the trace ID actually appears in the log lines the on-call greps. Outcome-checks force the conversation past procurement into adoption.

The five levels — defined by what the team can do, not what they own

The level definitions below assume a team running a production system that takes customer traffic. A pre-production team or an internal-tools team has different gating criteria; the structure transfers but the thresholds soften.

Level 1 — Reactive. Alerts fire when something is on fire. The on-call investigates from scratch every time, opens kubectl, greps logs, asks in Slack "is anything else broken?". There is no shared instrumentation across services beyond what each service team chose to emit. The default tools are whatever ships with the framework — application logs to stdout, occasional CloudWatch metrics, a Datadog dashboard someone made in 2023. Gating check to leave L1: at least one shared metrics backend (Prometheus, Datadog, Cloud Monitoring) is scraping every service in production, and at least one shared log aggregator (Loki, ELK, CloudWatch Logs) is ingesting every service's logs. The check fails if any production service emits its telemetry only to local files or to a dev-team-private dashboard.

Level 2 — Instrumented. The three pillars exist. Every service emits metrics to a shared Prometheus, structured logs to a shared Loki, and (often, but not always) spans to a shared Tempo or Jaeger. The on-call can pull up a service's RED dashboard and check rate / errors / duration. But the pillars are siloed — a span has a trace_id that does not appear in the corresponding log line, or appears but is not searchable, or appears in some services but not others. The on-call who sees a slow request in the latency dashboard has to context-switch into Tempo and search by service+timestamp, hoping to find the trace that matches. Gating check to leave L2: alert-to-page ratio above 0.7 (i.e. >70% of alerts that fire actually result in a paged human looking at the system, not auto-resolution within five minutes). The check fails the moment your alert noise is high enough that humans are filtering pages — at which point new pages get filtered too, and the next real incident slips through.

Level 3 — Correlated. Trace-ID propagates everywhere. Every log line emitted in the context of a request carries the request's trace_id. Every histogram emitted has exemplars (a span ID attached to a sample) so a Grafana panel showing a p99 spike can be clicked through directly to the actual slow trace. The on-call who sees a spike picks any pillar and pivots to the others in under a minute. Gating check to leave L3: of the last twenty SEV-2-or-higher incidents, at least 18 (90%) had a trace captured for the impacted request, attached to the post-mortem. The check fails if your sampling rate is too aggressive (1% sampling means 99% of incidents have no trace), if your retention is too short (24h retention means an incident debugged the next morning has lost its evidence), or if your tail-based sampler isn't actually keeping the error traces it claims to.

Level 4 — Proactive. SLOs are defined for every customer-facing service, error budgets are tracked in a shared dashboard, burn-rate alerts (1h × 14.4 + 6h × 6.0 multi-window) have replaced symptom alerts (CPU > 80%, error count > 100). The team has an error budget policy — a written rule about what happens when budget is exhausted. PaisaBridge's: "When a P0 service exhausts its monthly error budget, all feature work pauses and the team works only on reliability until budget recovers." Gating check to leave L4: in the last calendar quarter, at least once a service exhausted its budget and the policy actually kicked in (feature freeze, reliability sprint, executive review). The check fails the moment the budget exhausts and nothing happens — at which point the SLO is theatre, and the team will discover this the day a real customer-facing outage outruns their non-existent reliability discipline.

Level 5 — Adaptive. Observability data drives architectural decisions. The team has, in the last quarter, shipped a change caused by what observability told them — split a service because tracing showed it was on the critical path of three different request types, moved a workload off a region because latency CDFs showed sustained tail growth, retired a dashboard because nobody opened it during the last six incidents, restructured the on-call rotation because a heatmap showed all SEV-1s clustered between 02:00–05:00 IST and the duty engineers were systematically the most exhausted. Gating check to stay at L5: at least one such observability-driven architecture change shipped per quarter, with a written justification linking the metric/trace/log evidence to the change. The check fails when the team has the data and the dashboards but no longer makes decisions based on them — which is the modal failure mode of teams that reached L5 once and stopped exercising the muscle.

Measuring your level — a Python script that runs the checks

A maturity model that lives only in slides is not a maturity model; it is a poster. The corrective is to run the checks as code. The script below is what a platform team at PaisaBridge runs every Friday afternoon as part of the platform-health review — it pulls live numbers from Prometheus, Tempo, Loki, and the post-mortem tracker, computes each level's gating check, and prints the maturity score:

# maturity_check.py — automated observability maturity assessment
# pip install requests pandas pyyaml prometheus-client python-dateutil
import os, json, datetime as dt, requests, pandas as pd
from collections import Counter

PROM     = os.environ.get("PROM_URL",  "http://prometheus:9090")
TEMPO    = os.environ.get("TEMPO_URL", "http://tempo:3200")
LOKI     = os.environ.get("LOKI_URL",  "http://loki:3100")
PD_TOKEN = os.environ["PD_TOKEN"]   # PagerDuty REST API token
NOW      = dt.datetime.utcnow()

# --- L1 → L2 gate: every prod service has metrics + logs in shared backends -----
def l1_gate() -> tuple[bool, str]:
    services = requests.get(f"{PROM}/api/v1/label/service/values", timeout=10).json()["data"]
    expected = set(yaml_load("services.yaml")["production"])  # 67 services declared
    metrics_have = set(services)
    missing_metrics = expected - metrics_have
    log_streams = requests.get(f"{LOKI}/loki/api/v1/label/service/values", timeout=10).json()["data"]
    missing_logs = expected - set(log_streams)
    ok = not (missing_metrics or missing_logs)
    return ok, f"metrics-missing={len(missing_metrics)}, logs-missing={len(missing_logs)}"

# --- L2 → L3 gate: alert-to-page ratio > 0.7 over last 30 days -----------------
def l2_gate() -> tuple[bool, str]:
    end = NOW; start = end - dt.timedelta(days=30)
    pd_headers = {"Authorization": f"Token token={PD_TOKEN}", "Accept": "application/vnd.pagerduty+json;version=2"}
    incidents = requests.get("https://api.pagerduty.com/incidents",
        headers=pd_headers,
        params={"since": start.isoformat()+"Z", "until": end.isoformat()+"Z", "limit": 100, "total": "true"}).json()
    fired      = incidents["total"]                                            # alert events that became PD incidents
    acked      = sum(1 for i in incidents["incidents"] if i.get("acknowledgements"))
    ratio      = acked / fired if fired else 0.0
    ok = ratio > 0.7
    return ok, f"fired={fired}, paged-and-acked={acked}, ratio={ratio:.2f}"

# --- L3 → L4 gate: 90%+ of SEV-2+ incidents have a trace captured ---------------
def l3_gate() -> tuple[bool, str]:
    pms = pd.read_csv("post_mortems_last_quarter.csv")    # pulled from internal tracker
    sev2 = pms[pms["severity"].isin(["sev1","sev2"])]
    with_trace = sev2[sev2["trace_id"].notna() & (sev2["trace_id"] != "")]
    # Verify the trace is still retrievable in Tempo (retention check, not just logged)
    retrievable = 0
    for tid in with_trace["trace_id"]:
        r = requests.get(f"{TEMPO}/api/traces/{tid}", timeout=5)
        if r.status_code == 200 and r.json().get("batches"):
            retrievable += 1
    pct = retrievable / len(sev2) if len(sev2) else 0.0
    ok = pct >= 0.9
    return ok, f"sev2+={len(sev2)}, with-retrievable-trace={retrievable} ({pct:.0%})"

# --- L4 → L5 gate: error-budget policy actually enforced last quarter -----------
def l4_gate() -> tuple[bool, str]:
    pol_log = pd.read_csv("error_budget_policy_log.csv")  # written by SLO bot when budget exhausts
    last_q = pol_log[pol_log["quarter"] == prev_quarter()]
    enforced = last_q[last_q["action_taken"] == "feature_freeze"]
    ok = len(enforced) >= 1                                 # at least one real enforcement
    return ok, f"exhaustion-events={len(last_q)}, freezes-actually-imposed={len(enforced)}"

# --- L5 maintenance gate: 1+ obs-driven architecture change last quarter --------
def l5_gate() -> tuple[bool, str]:
    arch = pd.read_csv("architecture_decisions.csv")        # one row per ADR, has 'driven_by' col
    last_q = arch[arch["quarter"] == prev_quarter()]
    obs_driven = last_q[last_q["driven_by"].str.contains("observability", case=False, na=False)]
    ok = len(obs_driven) >= 1
    return ok, f"adrs-last-quarter={len(last_q)}, obs-driven={len(obs_driven)}"

gates = [("L1->L2", l1_gate), ("L2->L3", l2_gate), ("L3->L4", l3_gate),
         ("L4->L5", l4_gate), ("L5 hold", l5_gate)]
level = 1
for name, fn in gates:
    ok, detail = fn()
    mark = "PASS" if ok else "FAIL"
    print(f"  {name}: {mark}  ({detail})")
    if ok: level += 1
    else: break  # non-skippable: first failure pins the level
print(f"\n  paisabridge observability maturity = L{level}")

Sample run on a Friday at 16:00 IST:

  L1->L2: PASS  (metrics-missing=0, logs-missing=0)
  L2->L3: PASS  (fired=412, paged-and-acked=311, ratio=0.75)
  L3->L4: FAIL  (sev2+=23, with-retrievable-trace=17 (74%))

  paisabridge observability maturity = L3

The team's honest level is L3. They have the SLO dashboards (an L4 artefact), they have the error-budget policy document (an L4 artefact), they even imposed a feature freeze last quarter (an L4-passing event). But six of their last 23 SEV-2 incidents had no retrievable trace — sampling rate too aggressive, or the trace_id was logged but the trace itself aged out before the post-mortem author retrieved it. The level pins to L3, not L4, because the gating check failed. Walking the load-bearing lines:

  • l1_gate reads services.yaml as ground truth. The platform team owns this list — every production service must appear. The check fails the moment a team ships a new microservice without registering it; the failure is not the service's existence but the platform team's loss of visibility. Why a hand-maintained service registry beats auto-discovery: auto-discovery (scraping Kubernetes for pods, Consul for services) finds services that are running but not necessarily ones that are expected. The maturity check needs the inverse — services the org has committed to running with full observability — so it can flag the missing ones. A service that runs without telemetry is the failure mode the gate is designed to catch; auto-discovery would report it as healthy.
  • l2_gate's alert-to-page ratio. PagerDuty's acknowledgements field is the proxy for "a human actually engaged". Auto-resolved alerts (the page fires, then the underlying alert clears within 5 minutes before any human acks) score zero on the numerator. A ratio of 0.5 means half the pages are noise the on-call is filtering — and the next real incident slips into the same filter.
  • l3_gate verifies the trace is retrievable, not just logged. Most teams measure "trace_id present in post-mortem" — that is a weaker check, because the trace_id can be in the log without the trace itself surviving in Tempo. The hard check is GET /api/traces/{tid} returning a non-empty payload. This is how you catch retention-too-short and tail-sampling-misconfigured.
  • l4_gate reads from error_budget_policy_log.csv — a CSV the SLO bot writes to when budget exhausts. The check is that enforcement actually happened, not merely that the budget was exhausted. A team can exhaust budget every month and still pass L4 if the documented policy kicks in each time. A team can have zero exhaustions and still fail L4 if the one exhaustion in the quarter resulted in a Slack message and no actual freeze.
  • l5_gate reads architecture_decisions.csv — the team's ADR (Architecture Decision Record) tracker. The driven_by field is filled in at ADR-write time. Self-reporting is the weakness — but a team that systematically lies on its own ADRs is failing in a more fundamental way the maturity model cannot fix.

The script's most important property is that it runs every week, not at quarterly reviews. Levels can drop. A team at L4 last quarter that fails the burn-rate enforcement check this quarter is at L3 today — and the script tells them that on Friday afternoon, before the next on-call rotation begins.

Where teams plateau — the L3 → L4 cliff

Most platform teams that adopt observability seriously climb L1 → L2 → L3 within their first 12–18 months. The cliff is L3 → L4 — and the cliff is not technical, it is organisational. Reaching L4 requires three things in rough succession that have nothing to do with telemetry:

The L3 to L4 cliff — three organisational prerequisites teams underestimateA horizontal flow showing four boxes connected by arrows. The first box on the left, labelled L3, is in muted colour. The next three boxes form the climb to L4: defining SLOs that capture customer experience (not internal proxies), getting product-engineering buy-in to share an error budget with the platform team, and writing an error-budget policy that the engineering VP has signed. The fourth box on the right is L4, in accent colour. Each climb-box has an annotation below it: SLO definition takes typical 4 to 6 weeks of cross-functional work, product buy-in is the political battle, the policy document is the executive commitment that turns numbers into decisions. A side note reads: most teams have the dashboards by month 12; most teams cross the L4 cliff between month 18 and 30, often after their first SEV-1 that should have been prevented by an error-budget freeze that never happened. L3 to L4 — the cliff is organisational, not technical Telemetry is ready by month 12; the policy crosses the cliff between month 18 and 30 L3 correlated trace+log+metric define SLO customer-experience SLI, not internal proxy (CPU, RAM) 4-6 weeks cross-fn product buy-in share error budget w/ feature teams, accept feature freeze political battle policy signed VP-eng commits in writing; freeze when budget burns executive commitment L4 proactive Teams reach L3 by month 12 typically; most cross to L4 between month 18 and 30, often catalysed by a SEV-1 that an error-budget freeze would have prevented The cliff is people, not pipelines
Illustrative — Razorpay's published reliability journey, Hotstar's IPL-scale platform writeups, and Zerodha's market-hour SLO discipline all describe variants of this cliff. The technology is months; the organisational alignment is years.

The first prerequisite — defining SLOs that capture customer experience — sounds easy and is not. The wrong SLO is "API p99 latency < 200ms" because the API can be 50ms while the customer experience is broken (the page loads but the "Pay" button doesn't respond). The right SLO at PaisaBridge is "of customer-initiated payment attempts in the last 28 days, ≥99.9% reach a final success or failure state within 8 seconds end-to-end". Defining that took the platform team six weeks of meetings with product, with mobile-engineering, with the data team — because the SLI requires instrumenting the customer journey, not the API. Most teams stop at API-latency SLOs and quietly fail the gate.

Why customer-experience SLOs are harder than API SLOs: an API SLO only requires instrumenting one process — the API service emits a histogram, you set a threshold. A customer-experience SLO requires joining events across the mobile client, the API gateway, the payment microservice, the bank's webhook callback, and the final notification. Each hop is a separate team's instrumentation, each clock has its own drift, and the "did the customer see success" event lives in the mobile client where your tracing usually doesn't reach. The six-week timeline is mostly negotiating with the mobile team to emit a payment_journey_complete event with the right trace_id baggage, and with the payments team to bridge their internal request_id to that trace_id. The technical work is small; the cross-team alignment is the cost.

The second prerequisite — product buy-in for shared error budget — is the political battle. Feature teams have quarterly OKRs around shipping new features. An error budget policy that says "feature work pauses when budget exhausts" cuts directly into those OKRs. The platform team cannot impose this from below; it requires the VP-eng (or sometimes the CTO) to commit in writing that reliability outranks features when the budget is empty. Teams whose VP-eng will not put this in writing are L3 forever, regardless of what tools they install.

The third prerequisite — the written policy itself — is the document that converts SLO theatre into SLO discipline. Razorpay-shape published reliability programs include policy text such as: "When any P0 service exhausts its 28-day error budget, all feature deploys to that service are blocked at the CI level until 50% of budget is restored. The platform team has merge-block authority; the VP-eng is the only escalation path." That sentence is what makes L4 real. Without it, the budget is a number on a dashboard nobody enforces.

Common confusions

  • "We bought Datadog so we are mature." Tooling is the L1 → L2 prerequisite, not the maturity itself. A team with Datadog whose alert-to-page ratio is 0.3 is L1 with a Datadog logo. A team with self-hosted Prometheus + Tempo + Loki whose ratio is 0.85 is L2 properly. Vendors sell tools; maturity is the team's use of the tools, not the procurement.
  • "You can skip levels by hiring an SRE." Levels are non-skippable because each level's check is the previous level being measurable. A new SRE who imposes an L4 SLO policy on a team that hasn't reached L3 (cross-pillar correlation) ends up with burn-rate alerts the team cannot debug — they see the budget burning, they cannot find why. The SRE has imposed L4 artefacts without L3 practice; the team falls back to L1 reactive debugging when budget burns, and the SLO becomes theatre.
  • "Reaching L5 is the goal." Reaching L5 is the natural top of the model, but maintaining L4 reliably for years is harder than reaching L5 once. Many teams hit L5 during a quarter when the architecture review catches an obs-driven insight, then drift back to L4 (or L3) the next quarter when the muscle isn't exercised. The maturity model is a position, not a destination — you can lose levels.
  • "The model is the same for every company." The thresholds soften for a 20-engineer startup with a single service: alert-to-page ratio > 0.7 is the same number, but achieving it is easier with one service than with 67. The structure transfers; the gradient of effort to climb each step depends on org size, traffic volume, and risk profile. A trading platform like Zerodha has stricter thresholds at L4 (burn-rate windows in seconds, not minutes) because the customer-impact-per-minute is two orders of magnitude higher than a content site.
  • "Maturity model is for managers, not engineers." The model is the engineer's tool for arguing for time to invest in observability. "We are L2 because our alert-to-page ratio is 0.4; the next thing I need to ship is alert pruning" is a sentence that gets time on the roadmap. "We need better observability" does not. The model is a vocabulary for the conversation, and the gating checks are the evidence.
  • "The platform team owns maturity." The platform team owns the infrastructure for maturity (the metrics backend, the trace backend, the SLO bot). The service teams own the practice — whether their alerts are tuned, whether their SLOs are customer-facing, whether their post-mortems generate action items that ship. A platform team can build L4 infrastructure into a service team that uses it at L1 — and that team is L1, regardless of what the platform team built.

Going deeper

How Google's Site Reliability Engineering vocabulary maps to the levels

Google's SRE program, articulated in the SRE Book and Workbook, popularised much of the L4 vocabulary — SLI, SLO, error budget, blameless post-mortem, multi-window burn-rate. But Google's writing assumes the reader is already at L3 — that traces are correlated, that pillars are integrated, that the team has telemetry hygiene. Many readers attempt to install L4 vocabulary on top of an L1 organisation and discover the SLO numbers don't lead to action because the underlying instrumentation is too noisy to trust. The maturity model in this article is the staircase the SRE Book skips. Read the SRE Book; then read it again with the maturity-level lens, and you will see which sentences are about climbing the cliff and which are about maintaining steady-state at L4 / L5.

Razorpay, Zerodha, Hotstar — published maturity stories

Razorpay's engineering blog has, over the last few years, documented their journey from per-service alerting (L1–L2) to a unified error-budget-policy framework (L4) — the most-cited inflection was when they replaced 1,200 raw threshold alerts with ~80 burn-rate alerts and saw their on-call escalation rate drop ~70%. Zerodha Kite has written about their market-open SLO at 09:15 IST — they hold a stricter SLO in the 15-minute window around market-open than during off-hours, with separate burn-rate budgets, because customer-impact in those minutes is concentrated. Hotstar's IPL-scale platform writeups describe an L5 pattern: their post-IPL-2024 architecture review explicitly cited tracing-derived hot-spot data as the reason for splitting one of their core services. These published stories are not perfect maps to the model in this article, but the shape — gradual climb, organisational cliff at L4, intermittent L5 — recurs.

The pre-product-launch case — when the model bends

A pre-launch team building toward a product release has no customer traffic, no incidents, no error budget worth speaking of. Applying this model to them would falsely classify them as L1. The corrective is to soften the customer-traffic dependent gates — alert-to-page ratio is not measurable without alerts firing, SLO compliance is not measurable without traffic — and substitute synthetic equivalents (load tests with injected failures, chaos-engineering exercises, dark-launch traffic). The pre-launch team is climbing the same staircase, but their gating checks are about infrastructure readiness for the launch rather than track record from production. After launch, the gates revert to the production-traffic-driven version.

Reproduce this on your laptop

docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 3200:3200 grafana/tempo:latest -config.file=/etc/tempo.yaml
docker run -d -p 3100:3100 grafana/loki:latest
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas pyyaml prometheus-client python-dateutil
# Populate services.yaml with your declared services; populate post_mortems_last_quarter.csv
# from your post-mortem tracker; populate error_budget_policy_log.csv from your SLO bot;
# populate architecture_decisions.csv from your ADR repo.
PD_TOKEN=<token> python3 maturity_check.py

You will get a line per gate, pass or fail with the underlying number, and a final level. The script is intentionally short — under 100 lines of Python — because the maturity model is meant to be auditable by anyone on the team, not a black box owned by the platform team.

Where this leads next

/wiki/the-30-year-arc places the maturity model in historical context — the levels themselves have shifted as the industry's expectations have. What was L4 in 2015 (you have an SLO) is L3 today; what was L5 in 2020 (you have eBPF in production) is L4 today. The arc article describes how the bar moves and why.

/wiki/playbooks-post-mortems-and-blameless-culture is the sister discipline to maturity — without functioning post-mortems and action-item follow-through, the maturity model has no engine to climb. The maturity model tells you where you are; the post-mortem ritual is how you climb.

/wiki/incident-response-tooling is the load-bearing tooling layer that L3 and L4 require. Without an /incident bot, a timeline bot, and severity-driven escalation, the team cannot consistently produce the post-mortem evidence the maturity-model checks read from.

References