Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Building the team

It is 03:14 on a Saturday at Yatrika, the rider-positioning service is paging on IngestionRateBreached, and there is no one to call. There is a #platform-help Slack channel, but it is empty at this hour. The on-call SRE for rider-positioning has acked the page and pinged @observability-team, which resolves to a four-person Google Group where two people are asleep in Bengaluru, one has left the company eight weeks ago, and the fourth — Aditi — is at a wedding in Pune and her phone is on Do Not Disturb. The bill is now growing at ₹1.4 lakh/day from a misconfigured customer_id label that shipped in last Wednesday's release. The cardinality CI gate from chapter 109 does not exist yet because there is no team to own it. Two days from now Aditi will hold a calm meeting with engineering leadership and quietly ask the only question that matters: who, exactly, is the observability team, what are they on call for, and what are they explicitly not on call for? Every answer is a hire, a charter line, or a contract — and at most companies, none of those have ever been written down.

An observability platform team is not three SREs with admin on Grafana — it is a charter, a skill matrix that includes time-series internals and trace-protocol fluency, an on-call rotation that is distinct from the application on-call, and a written contract with every product team about who-pages-whom for what. The team is sized by cardinality and tenant count, not engineering headcount; a 5-person team scales to about 8M active series and 40 product teams before the next hire becomes mandatory. Skip any of these — charter, skill mix, rotation, contract — and the team becomes a Grafana help-desk that everyone resents.

Why most observability teams are mis-shaped at birth

The default origin story is identical at almost every Indian platform org. A senior SRE — usually the one who shipped the first Prometheus instance — accidentally becomes "the observability person". A second SRE joins them when a Loki migration starts. A third joins when traces become non-optional after the first multi-team P1. By the time someone draws an org chart, there is a "Observability Team" box with three names in it, no charter, no on-call definition, and a Slack channel where every product team dumps every dashboard request, alert-tuning question, and "why is my metric missing" debugging session. The team is shaped by the order of crises, not by design — and the shape is wrong in three predictable ways.

Why crisis-shaped teams break: a team formed reactively absorbs every category of work that touched the trigger event. The first SRE got pulled in because Prometheus OOM'd, so now the team owns "metric reliability". The second got pulled in for Loki, so now the team owns "log ingestion". The third got pulled in for tracing, so now the team owns "trace correctness". Six months in, the team owns everything observable, which is unbounded — every PR in every product team can plausibly be routed to them. Without a written charter saying what we do not do, the team becomes a queue with no exit, and the senior engineers leave inside 18 months. The fix is not "hire more"; the fix is to draw the charter line first and grow within it.

The three default mis-shapes:

Mis-shape 1 — the "Grafana help-desk". Every dashboard panel question, every "why does my alert keep firing" debugging session, every Loki query someone forgot how to write — all flow to a single team channel. The team becomes reactive; nobody ships the platform features that would prevent the questions in the first place. A Razorpay-shape platform team trapped in this mode answered ~340 Slack questions/quarter and shipped two platform improvements. After charter rewrite they answered ~80 questions/quarter and shipped eleven improvements. The work was the same volume; the routing was wrong.

Mis-shape 2 — the "on-call for everyone's alerts". Because the team owns the alerting system, every product team's alert that fires also notifies the platform team — "in case it's an observability issue." The platform on-call gets paged ~6× more than any product on-call, burns out within 2 quarters, and the team's hiring funnel collapses because no senior SRE wants to interview for a rotation that pages them at 02:00 for someone else's checkout-API error rate.

Mis-shape 3 — the "skill monoculture". All three SREs are good at running observability tools (Helm, Kubernetes, Prometheus federation, S3-backed Loki). None of them are good at the actual hard parts: time-series internals (why Prometheus quantile is interpolation), trace protocol fluency (OTLP gRPC retry storms), histogram-and-quantile mathematics (HdrHistogram vs Prometheus bucketed), or telemetry pipeline performance (the OTel Collector batch processor). When a team shows up with a real question — "why does our p99 from histogram look 4× worse than our p99 from raw events?" — the answer is "let me get back to you" forever, because nobody on the team has ever read the HdrHistogram paper. The team is operationally competent and intellectually outgunned.

Three mis-shapes of an observability team and what each one fails atA diagram with three columns, each showing one of the three default mis-shapes of an observability platform team. Column one is the Grafana help-desk: a team box surrounded by inbound Slack questions from many product teams, with two thin outbound arrows showing platform improvements shipped. Column two is the on-call sink: a single rotation box receiving pages from every product team alert in addition to its own observability stack alerts, with a paging-rate annotation of six times typical. Column three is the skill monoculture: three identical SRE figures all marked Helm and Kubernetes proficient, with empty slots where time-series internals, trace protocol, and histogram mathematics expertise should be. A bottom annotation row reads charter, contract, skill-matrix as the corrective for each respective mis-shape. Three mis-shapes — and the artefact that fixes each one Almost every observability team starts in one of these failure modes Mis-shape 1: Grafana help-desk obs team Q: panel? Q: alert? Q: query? Q: SLO? Q: dashboard? Q: metric? Q: trace? Q: log? 340 Q/qtr → 2 features/qtr FIX: write charter Mis-shape 2: on-call sink obs on-call 1 rotation payments rider ledger data-plat ml paged 6× more than any product on-call FIX: per-team contract Mis-shape 3: skill monoculture SRE SRE SRE Helm Helm Helm missing skills TSDB internals · OTLP gRPC HdrHistogram · sampling math trace protocol · pipeline perf FIX: skill matrix
Illustrative — three failure modes that almost every observability platform team passes through, and the artefact that corrects each. The corrective is not "hire more SREs"; it is to write the missing artefact (charter, contract, skill matrix) before the next hire.

The corrective in every case is a written artefact, not a headcount request. Charter fixes mis-shape 1 — by enumerating what the team does not do. Contract fixes mis-shape 2 — by giving each product team a documented self-service surface, with a defined escalation path that does not page the platform on-call by default. Skill matrix fixes mis-shape 3 — by making it explicit that the team's hiring loop screens for time-series and trace-protocol fluency, not just operational chops.

The charter — what is, what is not

A platform-team charter is a one-page document. Anything longer is hiding the load-bearing decisions in prose. The shape that works, after the discipline loop of Part 16:

The Yatrika observability platform team's charter (one page, owned by the platform lead, reviewed quarterly).

We own: the metrics, logs, and traces ingestion path; the time-series database (Mimir) and log store (Loki) and trace store (Tempo); the OpenTelemetry Collector fleet; the alerting and notification routing infrastructure; the platform-level SLOs (ingestion latency, query latency, retention SLO); the cardinality CI gate; the cost-discipline review cadence.

We do not own: product-team alert thresholds; product-team dashboards (we provide templates and review them on request); product-team SLOs (we provide the math and review them); the application on-call rotation (we provide the runbook templates); root-cause analysis for product incidents (we provide the trace-and-log access).

We page on: ingestion-rate breach (>30% above 7-day median for 15 minutes), tenant cardinality breach (any team's active-series exceeds budget for 1 hour), query path SLO violation (Mimir p99 query latency >5s for 30 minutes), notification pipeline failure (Alertmanager unreachable for 5 minutes), retention-policy unauthorised edit (any retention config changed without the cost-override label).

We do not page on: any product-team alert that fires; any dashboard rendering as expected; any backfill query someone is running ad-hoc.

Our SLOs to product teams: 99.5% of metrics scraped within 30 seconds of the scrape interval; 99.9% of OTLP-exported spans visible in Tempo within 60 seconds; 99% of LogQL queries < 4s; quarterly cost review delivered within 14 days of quarter-end.

The charter does the work of every subsequent hiring decision — when a product team asks "can we have a meeting to debug our alert", the answer is the charter line: "we provide review of alert design on request, we do not own the threshold." When a senior engineer asks "what does this team do that I would want to be hired for", the charter line is the answer. When the company asks "should the observability team get pulled into the IPL final war room", the charter says yes for telemetry-pipeline issues, no for application-layer issues.

Why the "what we do not own" section is the most important: a team's effective load is determined by what it can refuse. Without a written do-not-own list, every product team's request is implicitly accepted, the queue grows monotonically, and the team's strategic work disappears under the operational load. The do-not-own list is what makes the do-own list real. The seniors on the team measure their fit by how often they get to do the do-own work — when that ratio drops below 60%, the seniors leave inside 18 months. Every retention dollar in this team comes from the discipline of the do-not-own list.

The skill matrix — what to actually hire for

A platform-team-shaped role description usually says "deep experience with Prometheus, Grafana, Loki, Tempo". This is necessary and not sufficient — it screens for operators, not engineers. The skill matrix below is the actual list, ordered by how rarely each skill appears on a CV. The right team is staffed across the whole matrix; not every engineer needs every skill, but every skill needs at least one engineer.

The hard-to-hire skills are the ones at the top — these are the engineers you usually need to grow internally because the open market does not produce them. The bottom skills are the ones any senior SRE has by their fifth year. The team's character is determined by the top three rows.

# skill_matrix.py — the observability platform team skill matrix
# pip install pandas tabulate
import pandas as pd
from tabulate import tabulate

# Each skill: (name, rarity_on_market, why_it_matters, typical_failure_if_missing)
SKILLS = [
    ("TSDB internals (block layout, WAL, compaction)",
     "very rare",
     "Mimir/VictoriaMetrics tuning, OOM diagnosis, query-path debugging",
     "team escalates to vendor support for issues a 1-day investigation would solve"),

    ("Histogram & quantile mathematics (HdrHistogram, bucketed)",
     "very rare",
     "knows when p99-from-histogram lies, designs bucket boundaries that don't",
     "product teams ship SLOs against interpolated quantiles and miss real regressions"),

    ("Trace protocol fluency (OTLP gRPC, B3, W3C traceparent)",
     "rare",
     "diagnoses retry storms, context-propagation gaps, baggage corruption",
     "missing-spans incidents take 4-6 weeks to root-cause, never fully fixed"),

    ("Sampling theory (head, tail, decision-rate skew)",
     "rare",
     "designs samplers that retain errors at 100% and OK traces at 1% with bounded error",
     "team ships head-based 1% sampling and loses every error trace > 0.99% incidence"),

    ("Time-series compression (Gorilla XOR, delta-of-delta)",
     "rare",
     "predicts storage cost from cardinality and churn rate, sizes clusters honestly",
     "cluster sizing is by guess; bills are 2-3x what they should be"),

    ("Telemetry pipeline performance (OTel Collector batch processor, queue tuning)",
     "moderate",
     "tunes the pipeline to absorb traffic spikes (IPL, BBD) without backpressure to apps",
     "telemetry pipeline becomes the bottleneck during peak; apps page on internal queues"),

    ("Kubernetes operator pattern (CRDs, controllers, admission webhooks)",
     "moderate",
     "ships the cardinality admission webhook, the SLO CRD, the dashboard generator",
     "all platform features are manual config; nothing scales past 20 product teams"),

    ("PromQL / LogQL / TraceQL deep fluency",
     "moderate",
     "writes the recording rules, debugs query plans, teaches product teams patterns",
     "queries are slow; product teams cargo-cult bad patterns; cost spikes from inefficient panels"),

    ("On-call discipline (runbook design, blameless postmortem)",
     "common-ish",
     "runs the rotation cleanly, produces postmortems that change the system",
     "on-call is chaotic; postmortems are blame sessions; the team burns out"),

    ("Helm / Terraform / GitOps operations",
     "common",
     "everyday cluster management; nothing distinctive but everyone needs it",
     "the team can't ship at all"),
]

df = pd.DataFrame(SKILLS, columns=["skill", "rarity", "why_it_matters", "if_missing"])
print(tabulate(df, headers="keys", tablefmt="github", showindex=False))

# Coverage check — does the team cover the matrix?
team = {
    "Aditi":   {"TSDB internals", "Histogram", "PromQL", "Helm", "K8s operator"},
    "Karan":   {"OTLP gRPC", "Sampling", "PromQL", "Helm", "On-call"},
    "Dipti":   {"Pipeline performance", "K8s operator", "Helm", "On-call"},
    "Jishant": {"Compression", "TSDB internals", "PromQL", "Helm"},
    "Riya":    {"On-call", "Helm", "PromQL", "LogQL"},  # joining in 4 weeks
}

# A skill is "covered" if at least one person has the matching keyword in their set
def covered(skill_name, team):
    keys = skill_name.split()[0].lower()  # crude but fine for matrix demo
    for who, sk in team.items():
        if any(keys in s.lower() for s in sk):
            return who
    return None

print("\nCOVERAGE CHECK")
for s in SKILLS:
    holder = covered(s[0], team)
    flag = "ok " if holder else "GAP"
    print(f"  {flag}  {s[0][:55]:55s} -> {holder or '(no one)'}")

Sample run:

| skill                                            | rarity      | why_it_matters                | if_missing                         |
|--------------------------------------------------|-------------|-------------------------------|------------------------------------|
| TSDB internals (block layout, WAL, compaction)   | very rare   | Mimir/VictoriaMetrics tuning  | escalates to vendor support       |
| Histogram & quantile mathematics                 | very rare   | knows when p99 lies           | SLOs against interpolated quants  |
| Trace protocol fluency (OTLP gRPC, B3, W3C)      | rare        | diagnoses retry storms        | missing-spans take weeks          |
| Sampling theory (head, tail, decision-rate skew) | rare        | designs error-100% samplers   | loses every error trace           |
| Time-series compression (Gorilla XOR)            | rare        | predicts cost from cardinality| clusters 2-3x oversized           |
| Telemetry pipeline performance                   | moderate    | tunes for IPL/BBD             | pipeline becomes bottleneck       |
| Kubernetes operator pattern                      | moderate    | ships CRDs, webhooks          | manual config; doesn't scale      |
| PromQL / LogQL / TraceQL deep fluency            | moderate    | recording rules, query plans  | slow queries, bad cargo-culting   |
| On-call discipline (runbook, postmortem)         | common-ish  | clean rotation, real PMs      | rotation chaotic, team burns out  |
| Helm / Terraform / GitOps operations             | common      | everyday cluster management   | can't ship anything               |

COVERAGE CHECK
  ok   TSDB internals (block layout, WAL, compaction)          -> Aditi
  ok   Histogram & quantile mathematics (HdrHistogram, bucketed)-> Aditi
  ok   Trace protocol fluency (OTLP gRPC, B3, W3C traceparent)  -> Karan
  ok   Sampling theory (head, tail, decision-rate skew)         -> Karan
  ok   Time-series compression (Gorilla XOR, delta-of-delta)    -> Jishant
  ok   Telemetry pipeline performance (OTel Collector batch)    -> Dipti
  ok   Kubernetes operator pattern (CRDs, controllers, webhooks)-> Aditi
  ok   PromQL / LogQL / TraceQL deep fluency                    -> Aditi
  ok   On-call discipline (runbook design, blameless postmortem)-> Karan
  ok   Helm / Terraform / GitOps operations                     -> Aditi

Walking the load-bearing lines:

  • SKILLS ordered by rarity, not by importance — every skill on this list matters; the ordering is about hiring difficulty. The top four rows are the engineers you grow internally over 12–18 months. The bottom four you can hire in a week. Reading top to bottom tells you the team's recruiting strategy: spend hiring effort on the rare skills, do not waste a senior round on a Helm interview. Why "very rare" skills cannot be hired on the open market: TSDB internals, histogram mathematics, and OTLP protocol fluency are not taught in CS degrees, are not the daily work at most jobs, and are typically learned only by deeply debugging the relevant system over 18+ months. The result is that ~30 engineers in India have all four — and they are not on the market. The team that needs them must either make a multi-quarter offer to one of those 30 (rare), or grow the skill internally by giving someone the time and the problems. The team that does neither ends up paying vendor consulting at ₹40,000/day to answer questions a senior internal engineer would answer for free.
  • covered() returning the person who covers a skill — this is the matrix's actual use. Not "do we cover this skill (yes/no)" but "who is the single point of failure for this skill." Aditi covering four very-rare skills means Aditi is the bus-factor risk for the whole team. The next hire is not "another SRE"; it is "someone who can grow into one of Aditi's coverages within two quarters." Without the matrix, the team thinks of itself as five generalists; with the matrix, it sees the actual structure.
  • The GAP flag — the matrix's main reason to exist. When a skill row prints GAP, that is a documented incident waiting to happen. The team's quarterly planning treats every GAP as either: (a) buy via consulting until we can grow it, (b) prioritise a hire that closes it, (c) deliberately accept the risk and add the failure mode to the runbook. The discipline is to write down the choice. Most teams skip the matrix and end up at (a) by accident, paying for the consulting nobody approved.
  • The empty Riya row{"On-call", "Helm", "PromQL", "LogQL"} — this is the new-hire shape. Riya covers nothing rare yet. Her growth path is set in the matrix: in 12 months she should cover one rare skill, by year two she should cover two. The matrix becomes the engineering-manager-1:1 framework: every quarter, every team member discusses which rare skill they're growing and what evidence would close it. Without the matrix, growth conversations drift to vague "leadership", "communication", "ownership" themes that do not move the team's actual capability.
  • covered keyed on the first word of the skill name — yes, this is crude. The point is to be embarrassingly explicit: if your matrix needs a fancy similarity-match to detect coverage, you have written your skills too vaguely. "TSDB internals" matches "TSDB internals", end of story. The crude implementation forces clean skill names. Why "vague skill names hide gaps": if "observability operations" is a skill on your matrix, everyone covers it, and the matrix tells you nothing. If "Mimir block-layout debugging from a metric-cardinality breach" is on your matrix, exactly one person covers it, and you've found your bus-factor. Skill matrix entries should be specific enough that the GAP cells are surprising and informative — which is exactly when the matrix becomes an engineering-management tool rather than a CV-style decoration.

The on-call rotation — separate from application on-call

The third structural artefact is the rotation. The point is not that the team has on-call (every infrastructure team has on-call). The point is that the rotation is separated from the application on-call rotation, with a different page-set, different runbook, and different escalation path.

How the rotations differ — at a structural level

Application on-call vs observability on-call — different page-sets, different runbooksA two-column layout. Left column shows application on-call — pages on application SLO breach, customer-impact alerts, payment failure rate, login latency, with annotations that pages are time-sensitive customer-impacting and that the runbook is product-team-specific. Right column shows observability on-call — pages on ingestion-rate breach, tenant cardinality breach, query-path SLO violation, retention-policy unauthorised edit, notification pipeline failure. A central arrow shows that the application on-call calls the observability on-call only when they cannot read their own telemetry, not for the original application incident. Below, two boxes show that crossover is one-way: observability on-call assists application incidents on a best-effort basis, but the application incident remains owned by the product team. Two on-call rotations — different page-sets, different runbooks Crossover is one-way; the application incident always stays with the product team Application on-call (per product team) customer-impact pages — owned by product SLO breach: payment-success-rate < 99.5% SLO breach: checkout-p99 > 240ms Error rate: 5xx burn-rate > 14.4 over 1h Saturation: pod CPU > 85% for 10 min Dependency: NPCI bridge p99 > 800ms ~4-9 pages/week, product-team owned runbook: product-team-specific primary, secondary, EM escalation Observability on-call (platform team) telemetry-pipeline pages — owned by platform Ingestion rate breach (>30% over 7d median) Tenant cardinality breach (any team's budget) Query path SLO: Mimir p99 > 5s for 30 min Notification pipeline: Alertmanager unreachable Retention edit without cost-override label ~1-3 pages/week, platform owned runbook: platform-team-specific primary, secondary, lead escalation consult only if telemetry broken
Illustrative — the two rotations cover orthogonal failure surfaces. Application on-call covers customer-visible breakage; observability on-call covers telemetry-pipeline breakage. Crossover is one-way and explicit: when an application incident cannot be diagnosed because telemetry itself is unhealthy, application on-call consults observability on-call. The application incident still belongs to the product team.

The structural separation matters because the two rotations have different urgency profiles, different mean-time-to-resolve targets, and different escalation paths. An application on-call needs sub-5-minute response (a 5-minute outage is a real customer impact). An observability on-call has a 30-minute target — by the time ingestion-rate breaches matter to a real downstream alert, half an hour has passed. Conflating the two rotations forces both to inherit the application's tighter response time, which is exhausting for a team whose actual work-causing pages are slower-tempo.

The escalation path also differs. The application on-call escalates to the EM, then to the director — because customer impact is a leadership-aware event. The observability on-call escalates to the platform-team lead, then to the platform manager — because the resolution is internal-engineering work, not customer-communication work. Why a single shared rotation degrades both: when one rotation covers both surfaces, the on-call engineer optimises for the higher-tempo surface (application) because the page-frequency selects for those reflexes. The observability pages get triaged with application-grade urgency, leading to "the obs team responds in 4 minutes but doesn't fix anything fast" — because the actual fix needs deeper investigation than 4 minutes affords. Splitting the rotations gives observability the breathing room to investigate properly, while keeping application's reflex-tempo intact for customer-visible incidents.

How the team grows — from 3 to 12 over four years

The growth pattern is governed by load metrics, not by company headcount. The two metrics that matter:

Team size Active series budget Tenant teams supported Charter scope
1–2 (informal) ≤500K 1–3 a Prometheus instance and a few dashboards
3 (founding) ≤2M 4–8 metrics + logs, single cluster, no SLOs yet
5 (mature small) ≤8M 10–20 metrics + logs + traces, charter written, on-call separate
8 (scaled) ≤25M 25–40 + cardinality CI gate, + cost discipline, + open-source contributions
12 (large) ≤80M 50+ + dedicated TSDB sub-team, dedicated trace sub-team, dedicated SDK/agent team

The transitions are forced by load, not by ambition. A 3-person team reliably scales to ~2M series. At that point the on-call frequency forces the fourth hire. The fourth hire forces the charter to be written down (because three people can hold the charter in their heads; four cannot). The fifth hire forces the rotation to split (application on-call vs observability on-call). And so on. Skipping a transition does not save headcount; it saves headcount on paper while the existing team burns out and re-hires.

Common confusions

  • "The observability team is a subset of the SRE team." This conflates two functions. The SRE team is application reliability; the observability team is telemetry reliability. They overlap in tooling but the charters are orthogonal. Some companies make obs a subteam of SRE — that works at small scale (≤4M series) but breaks above ~10M series because the SRE charter does not include "build admission webhooks" or "operate a multi-tenant TSDB". When obs becomes its own team, the SRE org is its first customer.
  • "Hiring senior SREs is the same as hiring senior observability engineers." A senior SRE typically can run Prometheus, write PromQL, and configure Grafana — they have operated the stack but rarely have internalised it. A senior observability engineer can additionally tell you why a particular query is slow at the TSDB level, when a histogram quantile is lying, and how an OTLP gRPC retry storm starts. The CV signals overlap; the actual capability does not. Interview specifically for the rare skills.
  • "You can outsource observability to Datadog and avoid building this team." Vendor outsourcing reduces operational load (you don't run the TSDB) but does not eliminate the team. You still need engineers who own the cardinality budgets, the SLO designs, the alert thresholds, the per-team contracts. The team is smaller (3 instead of 8 at the same scale), but it cannot be zero. Companies that try to make it zero discover at the next bill cycle that nobody knows why the bill is what it is.
  • "The platform team should run all the dashboards." The platform team should own templates and review, not every dashboard. If a product team cannot create their own dashboard from a template within an hour, the platform team has built insufficient self-service. The volume of dashboard requests is a leading indicator of whether self-service is working — if the platform team is creating dashboards from scratch, they are doing the wrong work.
  • "The observability on-call is a junior rotation." This is a status hierarchy fallacy. The observability on-call needs senior judgement — the pages are about a multi-tenant system whose breakage affects every product team. A junior on the rotation will look at "Mimir p99 query latency >5s" and not know whether to escalate, restart, or wait. The rotation is expensive in seniority precisely because the page distribution is rare-and-high-stakes rather than frequent-and-routine.
  • "Once the team is staffed, the work stabilises." The work category stabilises; the load grows roughly with cardinality and tenant count, both of which compound. A team that staffed for "current scale" without forecasting six-month cardinality growth will be under-staffed by month four. Plan for the load curve, not the load level.

Going deeper

How big-tech observability teams are structured (and how the Indian-platform shape differs)

At Google's SRE org, the observability function lives inside the SRE team for a given product family — there is not a single "observability team" at company scale because the federated SRE shape pushes ownership down to product surfaces. At Meta, the observability team is split across "ODS" (the metric system), "Scuba" (the log/event system), and "Strobelight" (continuous profiling), each with its own engineering team in the dozens. At Razorpay, Flipkart, Zerodha-shape Indian platforms (≤25M active series, ≤50 tenants), the right shape is one cohesive team of 5–8 engineers with cross-coverage across metrics/logs/traces — not three separate teams. The Indian-platform shape does not benefit from the federation that big tech requires; the splitting overhead exceeds the specialisation benefit until you cross ~80M active series. The mistake to avoid: copying Meta's shape at Razorpay's scale, ending up with three sub-teams of two people each, none of whom can cover for each other on call.

The platform manager — what to hire for

The platform team's manager is a specific role distinct from "SRE manager who also does obs". The hire profile: someone who has shipped at least one production telemetry system end-to-end (not just operated one), who can read a TSDB block layout and a trace-protocol spec, and who can hold a written charter against organisational pressure to expand it. The most common failure mode of this hire is the reverse: a manager who came up through application reliability and accepts every product team's request because that was the SRE-org culture. The charter line "we do not own product-team alert thresholds" is exactly the line such a manager will quietly erode under pressure from a director asking "can the observability team please just fix this for us." The platform manager's job, more than any technical thing, is to hold the charter line.

Why the open-source-contribution metric matters

Mature observability teams (the 8-person shape and above) typically commit at least 5% of their time to upstream contributions — Mimir, Loki, Tempo, OpenTelemetry Collector, VictoriaMetrics. This is not generosity; it is recruiting and capability. Engineers who upstream-fix bugs in the systems they operate become the team's deep-experts on those systems within 12 months. Engineers who only consume vendor releases plateau at "operator" level forever. The teams that do not upstream contribute also struggle to hire senior observability engineers, because the senior pool sees a non-contributing team as a career dead-end. Treat the upstream-contribution time as a recruiting budget line, not a perk.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install pandas tabulate
python3 skill_matrix.py
# Replace the `team` dict with your actual team. GAP rows = hiring backlog.

The matrix becomes a 1:1 framework: each engineer claims growth toward one rare skill per year, and the GAP rows become the hiring backlog. Run it with your actual team next week.

Where this leads next

/wiki/playbooks-post-mortems-and-blameless-culture is the cultural counterpart of this article. Once the team exists with a charter and a rotation, the next question is what the team does when an incident happens — the postmortem culture, the blameless framing, the runbook discipline. The team-shape this article describes is the substrate; the postmortem culture is what runs on it.

/wiki/incident-response-tooling is the tooling layer — war-room channel templates, the incident-commander rotation, SEV-level definitions, comms scripts. /wiki/the-observability-maturity-model places the team-shape on a maturity scale alongside the discipline loop from Part 16: "has a written charter" and "has a separate observability on-call" are concrete maturity checkpoints, alongside a dozen others. Part 17 closes by asking which level your platform is at and which level the next year of work moves you toward.

References