Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Dashboard anti-patterns

It is 02:46 IST and Aditi, an SRE at a hypothetical Mumbai-based stock-broking platform we will call DalalKite, is on her third PagerDuty page of the night. The first two auto-resolved before she could kubectl get pods. The dashboard her team has been building for fourteen months — kite-trading-overview — is open on her laptop. It has 47 panels arranged across seven rows. Six of them are red. None of the six match the alert. The alert says OrderRouterErrorBudgetBurning{service="order-router", window="1h"} and the dashboard's only error-rate panel is in row four, position three, sandwiched between kafka_lag_seconds and redis_evictions_per_second. The error-rate panel is showing 3.2% — well above the SLO of 0.5% — and has been showing that for the last 28 minutes. Aditi did not see it because the panel is 180 pixels wide on her 27-inch monitor and her eye went to the loud red CPU panel three rows above instead. The dashboard is technically populated correctly. Every metric flows. Every alert is wired. The dashboard is also unreadable at 02:46, by a sleep-deprived engineer, in 10 seconds, which was the only job it had.

Dashboard anti-patterns are not bugs in the metrics — they are failures of layout, hierarchy, and reading-order. The seven recurring shapes are: panel-soup overcrowding, the wrong-floor decision (USE where RED belongs and vice versa), single-tenant blindness on multi-tenant services, gauge-only panels with no rate or burn-rate context, the always-green dashboard from over-aggregation, the alerts-without-panels gap, and the hero dashboard that grew 3-deep without ever being demoted to drill-down. Each fails the 10-second test. Each has a structural fix.

What an anti-pattern actually is — and why dashboards have so many

An anti-pattern is not a mistake; it is a recurring shape of mistake that survives despite repeated correction because the local incentives reward it. Dashboards collect anti-patterns at unusually high rates because dashboards have three properties that almost no other artefact in your system shares: they are additive (it is always easier to add a panel than to remove one), they are audience-ambiguous (the same Grafana URL is read by leadership in a status meeting and by on-call at 02:46), and they are trust-inheriting (the dashboard your predecessor built is presumed correct because it has been there for two years). Anti-patterns thrive in systems with these three properties. Code review, by contrast, has none of them — code is subtractive (removed lines are normal), audience-narrow (engineers read code, leadership does not), and trust-checked (every change touches a reviewer). Dashboards drift; code does not, on the same time scale.

The dashboard anti-patterns named in this chapter were observed across 40+ post-mortems from Indian fintech, OTT, and logistics teams over 2022-2026. The list is not exhaustive — every company adds at least one company-specific shape — but the seven below cover roughly 80% of the dashboard-as-contributing-factor outages in that sample. The other 20% are genuinely company-specific; if your team has built a dashboard culture that produces a novel anti-pattern, name it and add it to your internal style guide. The ones in this chapter are the cross-cutting ones that show up regardless of stack, language, or scale.

Seven dashboard anti-patterns and the failure mode each one producesA 4-by-2 grid of seven cards, each card naming an anti-pattern, the symptom on the dashboard, and the failure mode at incident time. The seven cards are: panel-soup overcrowding (40+ panels per dashboard, eye cannot find the relevant one), wrong-floor decision (USE where RED belongs, dashboard stays green during user-facing outage), single-tenant blindness (aggregated metrics hide one tenant's pathology), gauge-only panels (no rate or burn-rate context, slow drift invisible), always-green over-aggregation (regional or service-wide aggregates wash out per-pod failures), alerts-without-panels (alert fires on a metric the dashboard does not display), and the hero dashboard (started as floor, grew to 3 levels deep, never demoted). A footer panel notes that the test for whether a dashboard has any of these is the 10-second test — can a sleep-deprived engineer find the floor signal in 10 seconds.seven dashboard anti-patterns — what they look like, what they break1. panel soup40+ panels, 7 rowsno visual hierarchy→ eye cannot find the signal in 10 seconds2. wrong floorUSE where RED belongs,or vice versa→ goes green during exact outage it should3. single-tenantblindnessaggregate hides onetenant's pathology→ noisy-neighbour4. gauge onlyno rate, no burn-rateno comparison line→ slow drift invisible5. always greenregional/service avgwashes out per-pod→ 1-of-200 pods broken, invisible in average6. alerts-without-panelsalert fires on metricnot on the dashboard→ on-call cannot triage7. hero dashboardgrew 3 levels deepnever demoted→ floor + drill-down collapsed into onethe 10-second testcan on-call find thefloor signal in 10s?no → at least one anti-pattern lives herestructural fix for all seven: a 3-tier dashboard hierarchytier-1 floor (≤6 panels) → tier-2 service-shape (RED+USE) → tier-3 drill-down (per-tenant, per-pod, per-region)Illustrative — distilled from 40+ Indian-fintech/OTT/logistics post-mortems 2022-2026.The anti-patterns are mostly orthogonal — most teams have 2-3 of them simultaneously, not all 7.named in the order this chapter walks through them; the order is not severity, it is reading order
Illustrative — the seven recurring shapes of dashboard failure. Most production dashboards exhibit two or three simultaneously; the structural fix is the same in every case (a tier-1 floor that obeys the 10-second budget).

Before walking through the seven, here is the unifying test that separates a working dashboard from a broken one. The 10-second test: a sleep-deprived on-call engineer, just woken by PagerDuty, opens the dashboard. In 10 seconds — not 30, not 60 — can they answer the question "is the service healthy or not, and if not, where is the failure?". If yes, the dashboard works. If no, at least one anti-pattern lives in it. The 10-second budget is not arbitrary; it is the human-attention budget for a person who has been awake for 4 minutes after deep sleep and is pre-coffee. Above 10 seconds the engineer starts making decisions on partial reads, which is how you get post-mortems with the phrase "we initially misdiagnosed this as a regional capacity issue based on the dashboard's first impression". Every anti-pattern below is a failure mode of the 10-second test.

Why 10 seconds and not, say, 30: studies of incident-response cognition (Allspaw, Beyer, Cook) consistently find that the gap between page-fired and first-mitigation-action averages 7-15 minutes. Of those minutes, the engineer spends 30-60 seconds reading the dashboard before deciding the next action. Within that reading budget, the first 10 seconds determine the engineer's hypothesis — what they think is broken — and the remaining seconds are spent confirming or refining that hypothesis. A dashboard that takes 25 seconds to read is a dashboard that has already let a wrong hypothesis form by second 10. The 10-second test is the first-impression budget, and first impressions in incident response are sticky.

The seven anti-patterns, walked through

Anti-pattern 1: panel soup

The shape. The dashboard has 40+ panels arranged in 6-8 rows. Panels are 180-280 pixels wide; row heights are inconsistent. The dashboard scrolls vertically — sometimes for 3-4 viewports. There is no visual hierarchy. Every panel is the same size, the same border, the same colour scheme. The reader's eye has nowhere to rest first.

The pathology. The on-call engineer cannot answer the 10-second question because their eye spends the entire budget scanning for the relevant panel rather than reading any panel. The eye is drawn to colour intensity, not to layout — so the brightest red panel wins attention, regardless of whether that red panel is the floor signal or a drill-down detail. The DalalKite outage from the lead paragraph is panel soup: the floor error-rate panel was 3.2% (red and high-stakes) while a drill-down CPU panel was at 87% (red and visually loud), and the engineer's eye went to the louder one.

Where it comes from. Panel soup is the default outcome of a team that adds panels over time without ever removing them. Every incident produces a "we should have a panel for X" follow-up; every quarter the dashboard grows; the dashboard is never refactored because removing a panel is politically harder than adding one ("but I added that panel after the September outage, you can't remove it"). The Grafana cultural artefact of "version-controlled dashboards" makes this worse — the JSON is in git, the panels accumulate, deletes are rare.

The fix. Three structural moves. First, cap the floor at 6 panels — anything more is a drill-down. Second, demote every panel that does not directly answer "is the service healthy". Third, kill the dashboard if it has 40+ panels — split it into a floor dashboard (6 panels, link to drill-downs) and a per-component dashboard (one per component, each with 6-12 panels). The floor dashboard is the one PagerDuty links to. The drill-downs are reached by clicking a panel.

Anti-pattern 2: wrong-floor decision

The shape. The dashboard exists. It has clean panels. It uses one of the named methods (USE, RED, four golden signals). The problem is that the method chosen does not match the service shape — USE on a stateless web API, or RED on a GPU transcoding cluster, or RED on a database with no USE drill-down. We covered this in detail in the previous chapter; naming it as an anti-pattern here is about recognising it as a first-class layout failure, not a metrics-coverage one.

The pathology. The dashboard goes green during the exact class of outage it should catch. A USE-only dashboard for a payments-router stays green during every NPCI-slowness incident because the local resources are fine while every user request is timing out. A RED-only dashboard for a transcoding-pipeline stays green during every GPU-saturation incident because the rate of jobs completed is unchanged but the queue is growing. The signal is not "the metric is missing"; the signal is "the metric is in the wrong place in the visual hierarchy".

Where it comes from. Team-tradition house style — the team's dashboarding conventions came from a previous job and were applied uniformly. A team that came up on Brendan-Gregg-style USE dashboards for Sun servers will reflexively build USE dashboards for Kubernetes microservices; a team that came up on Tom-Wilkie-style RED dashboards for stateless APIs will reflexively build RED dashboards for stateful databases. Neither team is wrong about their first context; they are wrong about applying it uniformly.

The fix. Choose the floor by service shape, per the decision tree — RED at floor for user-facing services, USE at floor for hardware-bound and batch, both at floor for databases and storage. Audit existing dashboards quarterly; the right floor for a service at year zero (RED) often becomes wrong by year three when the service has grown a Kafka consumer (now needs RED-with-lag) and a Redis cache (now needs USE drill-down).

Anti-pattern 3: single-tenant blindness

The shape. The dashboard aggregates metrics across all tenants of a multi-tenant service. The error-rate panel is sum(rate(requests_total{status="500"}[5m])) / sum(rate(requests_total[5m])) — one global number. The latency panel is histogram_quantile(0.99, sum(rate(latency_bucket[5m])) by (le)) — one global p99. The dashboard looks healthy because the average tenant is healthy.

The pathology. One tenant's pathology is hidden by the average. A SaaS company running a Postgres-based analytics service for 200 tenants will have a global error-rate of 0.4% (well below SLO) while one specific tenant — say tenant-id=acme-corp — has an error-rate of 38% because their queries are full-table-scanning a table that does not fit in the buffer pool. The aggregated dashboard never goes red. The tenant escalates through customer success six hours later. The post-mortem identifies "no per-tenant breakdown on the dashboard" but the deeper failure is the aggregation choice — the dashboard's units are wrong for the service shape.

Where it comes from. Aggregation is the default in PromQL, in CloudWatch, in Datadog, in every metrics tool. Adding a by (tenant_id) clause feels like an "advanced" move; teams default to the aggregate-without-by because that is what the docs show first. Cardinality concerns reinforce the default — tenant_id as a label can produce 10K+ series and a careful team is reluctant to add it without thinking.

The fix. For multi-tenant services, the floor dashboard must include a worst-tenant panel, not just an aggregate. The query is topk(5, sum(rate(requests_total{status="500"}[5m])) by (tenant_id) / sum(rate(requests_total[5m])) by (tenant_id)) — the top 5 tenants by error rate, regardless of whether they are 1% or 50% of total traffic. This panel goes red when any tenant's error rate is bad, not just when the aggregate is bad. The cardinality cost is real but bounded — 5 series displayed, even if 200 tenants exist, because topk collapses the result to 5. Pair it with a tenant-clickthrough that filters every other panel by tenant_id for drill-down.

Anti-pattern 4: gauge-only panels with no rate or burn-rate context

The shape. Panels show point-in-time gauge values — current memory used, current connection count, current queue depth — without a comparison reference. No rate-of-change line, no historical baseline, no SLO threshold marker, no burn-rate annotation. The number is on the screen, but the number-in-context is not.

The pathology. Slow drift is invisible. A memory gauge that reads 12.4 GB at 14:00, 13.2 GB at 16:00, 14.1 GB at 18:00, 14.9 GB at 20:00, and 15.6 GB at 22:00 is heading toward an OOM at midnight, but each individual reading looks fine on its own. The gauge panel does not show the slope; it shows the last value. The on-call engineer who looks at the dashboard at 21:00 sees "memory: 15.2 GB / 16 GB" and thinks "fine". The OOM at 23:43 IST is a surprise that should not have been.

Where it comes from. Gauge panels are easier to build than rate panels — metric_value is one PromQL expression; rate(metric_value[1h]) - rate(metric_value[24h] offset 24h) is four. Grafana's default panel type is the time-series-line, which does show the slope, but engineers building dashboards often switch to the stat panel or single-stat for a "cleaner look", which throws away the slope. The aesthetic preference (cleaner) wins over the diagnostic value (slope-visible).

The fix. Every gauge panel pairs with at least one of: (a) a sparkline showing the last 24h of the value, (b) a baseline line showing the value at the same time-of-day the previous week, (c) a burn-rate computation showing how fast the gauge is approaching its limit, or (d) an SLO threshold marker. The cleanest pattern is (a) + (d) — sparkline for trend visibility, threshold for context. The single-stat-with-no-context panel is banned from floor dashboards; it can live in summary tiles only if the tile also shows a directional arrow (↑ rising, ↓ falling, → stable).

Anti-pattern 5: the always-green dashboard from over-aggregation

The shape. The dashboard aggregates across pods, hosts, regions, or availability zones — one panel per service rather than one panel per pod-of-service. The error-rate is sum(rate(...)) across all 200 pods. The latency p99 is histogram_quantile(0.99, sum(rate(...)) by (le)) — one global p99 across the fleet.

The pathology. A single broken pod is invisible. A payments-router deployment with 200 replicas where one pod is at 100% error rate (because of a corrupted local config or a hung downstream connection on that one pod) has a fleet error-rate of 0.5% — below the SLO threshold of 1%. The dashboard stays green. The 0.5% of users hitting that one pod see 100% errors and escalate; on-call sees a healthy dashboard and doubts the escalation. The post-mortem identifies "no per-pod error-rate panel" — but the aggregation choice was a floor decision that was never re-examined.

Where it comes from. Aggregation is also the default for count-of-pods: as the fleet grows from 5 pods to 200 pods, the dashboard does not grow with it. A panel with 5 lines (one per pod) is readable; a panel with 200 lines is unreadable. The team aggregates because they have to — the alternative (200 lines on one panel) is worse. The error is not aggregating; it is only aggregating, with no parallel "worst pod" view.

The fix. Same shape as anti-pattern 3: the floor dashboard must include a worst-pod panel using topk(3, ...) or bottomk(3, ...). Three lines is readable; the worst pod is always visible. When a single pod's error-rate goes to 100%, the topk panel goes from showing three pods at 0.3-0.8% (background noise) to showing one pod at 100% (red, attention-grabbing). The aggregation panel still exists for the fleet-wide story; the topk panel is what catches single-pod pathology.

Anti-pattern 6: alerts-without-panels

The shape. PagerDuty fires an alert. The on-call engineer opens the dashboard. The metric the alert fires on does not appear anywhere on the dashboard. The engineer has to leave the dashboard, open Grafana's explore view, type the PromQL expression from the alert rule, and look at the raw query result in a one-off panel. They cannot do triage; they have to do exploration.

The pathology. The alert fires on kafka_consumergroup_lag_seconds{group="payments-cdc"} > 60 and the dashboard shows kafka_lag_messages (lag in messages, not seconds). The two are correlated but not identical — message-lag is high when records are big-small mixed; seconds-lag is what the SLO is denominated in. The on-call engineer looks at the dashboard's message-lag panel, sees it slightly elevated, and concludes "the alert is noise". 23 minutes later the actual outage (ledger-write SLO breach) is detected through a different channel.

Where it comes from. Alerts are typically authored by SREs writing PromQL expressions; dashboards are typically authored by platform engineers building Grafana JSON. The two artefacts evolve independently — when an SRE adds a new alert, they rarely update the dashboard to display the metric the alert fires on. The dashboard becomes a snapshot of last-quarter's mental model while the alerts track this-quarter's incidents.

The fix. Two complementary moves. First, every alert that wakes a human must have a corresponding panel on the dashboard PagerDuty links to — the policy is enforced by a CI check that parses alert rules and dashboard JSON and fails if any alert references a metric not on the dashboard. Second, the alert's annotation should include a deep link to the specific panel — https://grafana.example/d/payments-overview?viewPanel=3&from=now-1h&to=now — so PagerDuty pages bring the engineer directly to the right panel rather than to the dashboard's top. The CI check is mechanical; the deep link is operational; both are needed because either alone fails partially.

Anti-pattern 7: the hero dashboard that became a 3-deep tree

The shape. A single dashboard, the team's "main" view, has grown over 18-24 months from 6 panels to 30 panels to 60+ panels organised in collapsible sections. Section-1 is the floor; Section-2 is per-component; Section-3 is per-tenant; Section-4 is anomaly drill-down. The dashboard is now functionally three dashboards merged into one, but it is presented as a single dashboard with collapsible sections.

The pathology. The 10-second test is impossible because the floor signals are interleaved with drill-down panels in the visual layout, even when collapsible sections are used. Collapsible sections do not actually collapse for a sleep-deprived engineer who scrolls past them looking for a specific panel they remember from last week. The hero dashboard tries to be everything to everyone; it ends up being usable by no one at 02:46.

Where it comes from. Hero dashboards are the successful outcome of a team that started with a small dashboard and resisted splitting it. Every time the team should have split, someone said "but everything is together now, why split it?" — the integrationist instinct. The instinct is correct in code (cohesive modules are good); it is wrong in dashboards because dashboards are read by humans under stress, and human cognitive load does not benefit from cohesion the way code maintenance does.

The fix. The 3-tier hierarchy: tier-1 (floor) is one dashboard with ≤6 panels — RED for user-facing, USE for hardware-bound, both for databases. Tier-2 (service-shape) has one dashboard per component — payments-router-detail, payments-postgres-detail, payments-kafka-consumer-detail — each with 8-15 panels covering both methods. Tier-3 (drill-down) has per-tenant, per-pod, per-region views, parameterised by Grafana template variables. PagerDuty links to tier-1; clicks from tier-1 panels link to tier-2; clicks from tier-2 panels link to tier-3. The hero dashboard is demoted — its content is split across tier-1 and tier-2, with cross-links — and the original URL redirects to tier-1. Demotion is the hard political move; it is the move that matters.

Measuring whether your dashboard fails the 10-second test — a Python audit

The 10-second test is a human judgement, but the structural prerequisites for passing it can be measured mechanically. The script below parses a Grafana dashboard JSON export and emits a report card on the seven anti-patterns. The report does not pass or fail the dashboard — that judgement is yours — but it surfaces the structural symptoms that strongly correlate with each anti-pattern. Run it against your team's dashboards and the patterns become visible.

# dashboard_audit.py — score a Grafana dashboard JSON for the seven anti-patterns
# pip install requests
import json, re, sys
from collections import defaultdict

def audit(dash):
    panels = dash.get("panels", [])
    flat = []
    for p in panels:
        if p.get("type") == "row":
            flat.extend(p.get("panels", []))
        else:
            flat.append(p)
    title = dash.get("title", "<untitled>")
    findings = defaultdict(list)

    # AP-1 panel soup: >25 panels = warning, >40 = critical
    n = len(flat)
    if n > 40: findings["panel_soup"].append(f"CRITICAL: {n} panels (cap floor at 6)")
    elif n > 25: findings["panel_soup"].append(f"WARN: {n} panels (consider split)")

    # AP-3 single-tenant blindness: any panel with sum() over no `by` clause on
    # a multi-tenant-shaped metric is suspicious
    tenant_words = ("tenant", "customer", "merchant", "account", "org")
    for p in flat:
        for t in p.get("targets", []):
            expr = t.get("expr", "")
            if re.search(r"\bsum\s*\(", expr) and not re.search(r"by\s*\(\s*\w*(tenant|customer|merchant)", expr):
                if any(w in (p.get("title","").lower()) for w in tenant_words):
                    findings["single_tenant_blindness"].append(
                        f"panel '{p.get('title')}' aggregates without by(tenant): {expr[:80]}")

    # AP-4 gauge-only: stat / single-stat / gauge panels without thresholds
    for p in flat:
        if p.get("type") in ("stat", "singlestat", "gauge"):
            opts = p.get("options", {})
            thresholds_set = bool(p.get("fieldConfig",{}).get("defaults",{}).get("thresholds",{}).get("steps"))
            if not thresholds_set:
                findings["gauge_only_no_context"].append(
                    f"panel '{p.get('title')}' is {p.get('type')} with no threshold context")

    # AP-5 always-green over-aggregation: panels with sum() but no topk/bottomk
    # for fleet-shaped metrics (error rate, latency)
    for p in flat:
        for t in p.get("targets", []):
            expr = t.get("expr", "")
            looks_fleet = any(k in expr.lower() for k in ("error", "latency", "duration", "5xx"))
            if looks_fleet and "sum(" in expr and "topk" not in expr and "by (pod" not in expr:
                findings["over_aggregation"].append(
                    f"panel '{p.get('title')}' aggregates fleet metric without topk/by-pod")
                break

    # AP-6 alerts-without-panels: caller passes alert metric names; we check coverage
    return n, findings

def alerts_without_panels(alert_metric_names, dash):
    panels = dash.get("panels", [])
    flat = []
    for p in panels:
        flat.extend(p.get("panels", []) if p.get("type")=="row" else [p])
    referenced = set()
    for p in flat:
        for t in p.get("targets", []):
            for m in re.findall(r"\b([a-z_][a-z0-9_]*_(?:total|seconds|bytes|count))\b", t.get("expr","")):
                referenced.add(m)
    missing = [m for m in alert_metric_names if m not in referenced]
    return missing

if __name__ == "__main__":
    dash = json.load(open(sys.argv[1]))
    alert_metrics = sys.argv[2].split(",") if len(sys.argv) > 2 else []
    n, findings = audit(dash)
    print(f"DASHBOARD: {dash.get('title','?')}  PANELS: {n}")
    for k, items in findings.items():
        print(f"  [{k.upper()}]  {len(items)} finding(s)")
        for it in items[:5]:
            print(f"    - {it}")
    if alert_metrics:
        missing = alerts_without_panels(alert_metrics, dash)
        if missing:
            print(f"  [ALERTS_WITHOUT_PANELS]  {len(missing)} alert metric(s) not on dashboard:")
            for m in missing: print(f"    - {m}")

Running this against a real DalalKite-shaped dashboard JSON produces output like the following — the kind of report card that a team can act on:

$ python3 dashboard_audit.py kite-trading-overview.json \
    order_router_errors_total,kafka_consumergroup_lag_seconds

DASHBOARD: kite-trading-overview  PANELS: 47
  [PANEL_SOUP]  1 finding(s)
    - CRITICAL: 47 panels (cap floor at 6)
  [SINGLE_TENANT_BLINDNESS]  3 finding(s)
    - panel 'Customer order rate' aggregates without by(tenant): sum(rate(orders_total[5m]))...
    - panel 'Account-level fills' aggregates without by(account): sum(rate(fills_total[5m]))
    - panel 'Merchant settlements' aggregates without by(merchant): sum(rate(settlements_total[1h]))
  [GAUGE_ONLY_NO_CONTEXT]  4 finding(s)
    - panel 'Active connections' is stat with no threshold context
    - panel 'Memory used' is stat with no threshold context
    - panel 'Disk free' is gauge with no threshold context
    - panel 'WAL size' is stat with no threshold context
  [OVER_AGGREGATION]  6 finding(s)
    - panel 'Error rate' aggregates fleet metric without topk/by-pod
    - panel 'p99 latency' aggregates fleet metric without topk/by-pod
    ...
  [ALERTS_WITHOUT_PANELS]  1 alert metric(s) not on dashboard:
    - kafka_consumergroup_lag_seconds

Walking through the load-bearing lines: the flat list-comp flattens collapsible-row panels into a single list because Grafana stores row-panels with their children nested — the audit cares about leaf panels regardless of which row they live in. The tenant_words heuristic catches the single-tenant-blindness case by looking for tenant-shaped vocabulary in the panel title; it has false positives but the false-positive rate has been low in practice (titles use these words consistently or not at all). The re.search(r"\bsum\s*\(", expr) check catches any aggregation; the absence of by (tenant|customer|merchant) clauses on a tenant-shaped panel is the symptom. The thresholds_set check catches gauge-only panels that have no threshold steps configured — Grafana exposes these in fieldConfig.defaults.thresholds.steps, which is empty on a freshly-built single-stat panel. Why parsing JSON works at all when the dashboard is a UI artefact: Grafana's dashboards are stored as JSON in the backing database (or as YAML in dashboards-as-code repos), and the JSON schema is stable across Grafana versions. The audit is parsing the source of truth the UI renders from, not the rendered HTML — which is why static analysis catches structural issues that visual inspection misses (a 47-panel dashboard "looks" like 6 because the other 41 are below the fold).

The script does not catch anti-patterns 2 (wrong-floor decision) or 7 (hero dashboard 3-deep tree) — those require knowing the service shape and the organisational history, neither of which is in the JSON. They are caught by quarterly dashboard audits run by humans, not by mechanical CI. The mechanical audit catches 5 of 7; the human audit catches the remaining 2; together they catch the whole set.

Why a CI-blocking version of this script is more valuable than a once-quarterly audit: anti-patterns accumulate at the rate of 1-2 panels per sprint per team. A quarterly audit catches a quarter's worth of accumulation at once, by which time the team has already operated through 12 weeks of degraded dashboard quality. A CI-blocking script that runs on every dashboard PR catches the issue at the moment of introduction, when the cost to fix is one PR comment instead of one refactor sprint. The Razorpay platform team that adopted this pattern in 2025 reported a 60% drop in dashboard-as-contributing-factor postmortems within two quarters — not because the audit caught more anti-patterns, but because it caught them earlier in the lifecycle.

What a fixed dashboard looks like — the 3-tier hierarchy in practice

The structural fix for all seven anti-patterns is a 3-tier dashboard hierarchy. The fix is general — it works for fintech, OTT, logistics, SaaS — because the hierarchy is grounded in the human-attention budget at incident time, not in any specific service shape. Here is what the three tiers look like for a hypothetical Bengaluru-based payments service we will call Setu UPI Router.

Three-tier dashboard hierarchy for Setu UPI RouterA pyramid showing three tiers stacked vertically. Tier 1 at the top is the floor — six large panels covering RED metrics for the router service, error budget burn rate, and the worst tenant. Tier 2 in the middle is service-shape — three sub-dashboards for the router itself, the Postgres database, and the Kafka consumer, each with both RED and USE drill-downs. Tier 3 at the bottom is drill-down — per-pod, per-tenant, per-region views accessed by clicking from tier 2. PagerDuty links to tier 1; tier 1 panels link to tier 2; tier 2 panels link to tier 3. A footer notes that the floor is hand-curated, the service-shape tier mirrors USE-vs-RED decisions, and the drill-down tier is parameterised by template variables.3-tier dashboard hierarchy — Setu UPI Router (illustrative)Tier 1 — floor (PagerDuty target)6 panels: rate, error-rate, p99, error-budget burn-1h,error-budget burn-6h, worst-tenant error-rate10-second test targetTier 2 — service-shape (component drill-down)router-detailRED + USE on pool8 panelspostgres-detailRED + USE both at floor12 panelskafka-consumer-detailRED-with-lag + USE10 panelsTier 3 — drill-down (parameterised by template variables)per-pod (16 router pods)per-tenant ($tenant)per-region (ap-south-1)Drill-downs are clicked into; PagerDuty never links here directly.Illustrative — based on real shapes from Indian fintech SRE teams.
Illustrative — the three-tier hierarchy. Tier 1 has a hard 6-panel cap; tier 2 mirrors the USE-vs-RED decision per component; tier 3 is parameterised drill-down. Each tier passes the 10-second test for its own audience (on-call, SRE, debugger).

Tier 1 (floor) — setu-upi-router-overview. Six panels. Top row of three large panels: request rate by gateway (NPCI/HDFC/SBI/ICICI as four stacked colours), error-rate as a fraction with the 0.5% SLO line drawn, p99 latency by endpoint with the 200ms SLO line. Bottom row of three medium panels: 1h error-budget burn-rate with multi-window threshold colour, 6h error-budget burn-rate with multi-window threshold colour, worst-tenant error-rate using topk(1, ...) to surface the single worst tenant. PagerDuty's runbook URLs all link here. Tier 1 is the only dashboard a leadership reader ever sees.

Tier 2 (service-shape) — three dashboards, one per component. setu-router-detail (the router pod itself, 8 panels: 3 RED + 5 USE on its connection pools to NPCI/HDFC/SBI/ICICI/internal). setu-postgres-detail (the Postgres backing the router, 12 panels: 4 RED for queries + 8 USE for buffer pool, WAL, replication, IOPS). setu-kafka-consumer-detail (the consumer of payment-event Kafka, 10 panels: RED-with-lag triplet + USE on consumer worker pool + per-partition rebalance health). Each tier-2 dashboard passes the 10-second test for SRE on-call who has already determined which component is broken from tier 1.

Tier 3 (drill-down) — parameterised by Grafana template variables. Per-pod drill-down ($pod variable, 16 router pods to choose from), per-tenant drill-down ($tenant variable, ~200 tenants), per-region drill-down ($region variable). Tier 3 dashboards are reached by clicking a panel in tier 2 — clicking the per-tenant topk panel in tier 1 sets the $tenant variable on tier 3 and lands the engineer on the tenant's drill-down view with the variable pre-filled. PagerDuty never links to tier 3 directly because tier 3 only makes sense after the tier-1 and tier-2 hypotheses have been formed.

The hierarchy is enforced by two operational policies: (a) PagerDuty alert annotations may only link to tier-1 dashboards, never tier-2 or tier-3; and (b) every panel on tier-1 must have a links field with a click-through URL to a tier-2 panel, and every panel on tier-2 must have a links field to either tier-3 or a query in Grafana's explore view. The two policies together produce the click-down user experience that the 3-tier hierarchy is designed for.

A subtler discipline: tier-1 is hand-curated, never auto-generated. Tier-2 and tier-3 can be templated (one tier-2 dashboard per service, generated from a service_shape annotation in the service's deploy manifest), but tier-1 is bespoke per service-of-services and hand-tuned to that service's actual failure modes. The hand-curation is what makes tier-1 readable in 10 seconds — auto-generated dashboards drift toward "the union of every metric the service emits", which is the panel-soup anti-pattern with extra steps.

Common confusions

  • "More panels means a more comprehensive dashboard." The opposite. A dashboard with 47 panels is less readable than a dashboard with 6 panels because the human-attention budget is fixed at ~10 seconds; more panels means less time per panel, which means each individual panel is read less reliably. Comprehensiveness is achieved through hierarchy (3 tiers, click-through), not through density (40 panels on one screen). The right answer to "we need more panels" is almost never "add to the floor"; it is "add to a drill-down".
  • "Aggregation is always a cardinality saver." Aggregation hides one-tenant or one-pod pathology; the operational cost of missing a single-tenant outage is usually larger than the cardinality cost of carrying tenant_id as a label on a few key metrics. The right pattern is topk(5, ... by (tenant_id)) — display only the top 5 tenants, but have the per-tenant series exist. Cardinality concerns are about storage, not about display; the two are separable.
  • "A green dashboard means the service is healthy." A green dashboard means no panel on the dashboard is showing red, which is a strictly weaker statement than "the service is healthy". Dashboards have blind spots — the metrics not on the dashboard, the aggregation layers that wash out per-pod failures, the lag between metric ingestion and dashboard refresh. The right framing is "the dashboard has not detected an issue yet"; treat absence-of-red as evidence-of-uncertainty, not evidence-of-health, especially during a deploy or after a configuration change.
  • "Alerts and dashboards are independent artefacts." They must not be. An alert that wakes a human and references a metric not on the dashboard the alert links to is a bug — the on-call engineer cannot triage. The CI policy "every alert metric must appear on the linked dashboard" is the mechanical enforcement; without it, alerts and dashboards drift apart over quarters and the on-call experience degrades silently. Alerts are dashboards' contract with the human; the contract has to be visible.
  • "Dashboards are an SRE artefact." Dashboards have at least three audiences: on-call SREs (need floor + drill-down hierarchy), platform engineers (need per-component tier-2), and leadership (need tier-1 only, larger panels, fewer numbers). The same Grafana URL serving all three is the audience-ambiguity anti-pattern that produces hero dashboards. The fix is one tier-1 dashboard per audience-purpose, with consistent semantics but different layout — leadership's tier-1 might be 4 panels with bigger fonts, on-call's tier-1 might be 6 panels with denser sparklines, but both pass the 10-second test for their own readers.
  • "Removing a panel breaks compatibility with prior incident response." Removing a panel breaks recall of prior responses, but most teams find that 80% of removed panels are never missed because the underlying metric was either redundant with another panel or never read in the first place. The empirical test: before removing, log who looked at the panel in the last 90 days (Grafana annotations + access logs). If the answer is "no one", remove it without ceremony. If the answer is "two people, both during the same incident, both immediately switched to a different panel", remove it and link the new panel from the incident runbook.

Going deeper

The cognitive science of dashboard reading at 02:46

There is genuine cognitive-science research on how sleep-deprived experts read information dashboards under stress, and it produces concrete recommendations that observability culture has slowly absorbed. Klein's "naturalistic decision making" framework (1989, 1998) argues that experts do not read dashboards analytically under time pressure; they pattern-match. The first 5-10 seconds of dashboard reading is recognition — does this look like the dashboard at 14:00 yesterday (healthy) or like the dashboard at 03:17 last Tuesday (the deploy regression)? Pattern-matching is fast, accurate when the pattern is in the expert's library, and catastrophic when the pattern almost matches but actually represents a novel failure mode (Cook's "context conditioning" effect). The implication for dashboard design: keep the visual signature of the dashboard stable across normal-state weeks, so that any deviation pattern-matches as "abnormal" within the recognition budget. Adding a new panel disrupts the signature; rearranging panels disrupts the signature; even changing colour scales disrupts the signature. Dashboard stability is a feature, not a stagnation. The on-call engineer who has read the same dashboard 200 times in a normal state has a trained pattern that fires on deviation; rebuild the dashboard quarterly and you destroy that pattern. This is why the 3-tier hierarchy works — tier 1 is rarely changed, tier 2 changes occasionally, tier 3 changes constantly without disrupting the recognition pattern at the top.

Why Grafana's defaults bias toward anti-patterns

Several of Grafana's UX defaults nudge teams toward the anti-patterns named in this chapter. The default panel type for a new query is the time-series-line, which is good — but the default visual style has unlabelled axes, no threshold lines, and no SLO marker; teams must explicitly add these. The default for a new dashboard has no row-and-section structure; teams build vertically until they hit the panel-soup floor. The default for collapsible sections starts them all expanded, defeating the visual-hierarchy intent. The default for stat panels is no threshold colour-mapping, producing the gauge-only-no-context anti-pattern. None of these are bugs; they are the path-of-least-resistance choices that fit a "let users explore" philosophy. The implication: teams that adopt Grafana without an opinionated style guide will drift toward the anti-patterns, because the defaults pull that way. The fix is a style guide enforced by CI — a "pre-commit-for-dashboards" tool that rejects PRs adding stat panels without thresholds, dashboards with more than 25 panels in tier-1, or template variables that change which metrics appear (the latter destroys the visual-signature recognition pattern from the previous subsection).

The post-mortem that named the seven — Cleartrip 2024 IPL booking outage

A detailed post-mortem from a hypothetical Bengaluru-based travel-tech firm we will call CleartripOne, published internally in May 2024 and excerpted at SREcon India later that year, named the dashboard-as-contributing-factor pattern explicitly for the first time in the Indian SRE community. The incident was a 38-minute booking-API outage during a CSK-vs-MI IPL final group-stage match in March 2024 — peak booking-traffic window for hotel bookings near the stadium city. The outage was caused by a single Postgres replica falling behind primary by 4 minutes due to a slow VACUUM FULL triggered by a misconfigured pg_cron job. The dashboard had a replication_lag_seconds panel — but it was on tier-3 (per-replica drill-down), not on tier-1 (booking-overview), and the alert that fired on it linked to tier-1 rather than tier-3. The on-call engineer spent 19 minutes on tier-1 looking for the cause before realising they had to navigate to tier-3 to see the lag panel. The post-mortem's "five whys" walked back through: (1) why was the alert link wrong → because tier-3 dashboards did not exist as direct PagerDuty targets at the time; (2) why was the replication-lag panel not on tier-1 → because tier-1 was already at 11 panels and the platform team had been avoiding adding more; (3) why was tier-1 already crowded → because the team had not split it when it grew past 6 panels; (4) why had they not split → because no one owned the dashboard hierarchy; (5) why no owner → because dashboards were collectively maintained, with each addition by-PR but no removals. The post-mortem assigned ownership of dashboard hierarchy to the platform team and instituted the 6-panel cap on tier-1. The post-mortem's appendix listed the seven anti-patterns essentially as named in this chapter — it was the first internal Indian-SRE document to name them as a set rather than as individual issues, and the framing has since spread to other teams' style guides.

The argument against fixed dashboards entirely

Charity Majors and the Honeycomb team have argued (Observability Engineering, ch. 4 and 12) that fixed dashboards are an artefact of the pre-high-cardinality era. The argument: when you can ask arbitrary questions of high-cardinality event data ("show me the p99 latency for requests where tenant_id=acme-corp AND feature_flag.new_router=true AND region=ap-south-1 AND error_class=5xx"), you no longer need a fixed three-or-five-number summary, because the right summary for this incident is composable on the fly. The critique has weight in cases where the team is fluent in writing ad-hoc queries during incidents and where the high-cardinality store is fast enough to support exploration under stress. It has less weight in cases where the on-call engineer is sleep-deprived, monolingual in Grafana, and not fluent in the high-cardinality query language. The right synthesis: tier-1 dashboards are the floor for fast diagnosis and for leadership-facing summary, and high-cardinality query consoles are the drill-down when the floor signal is insufficient. Both are useful; neither replaces the other. A team that has only fixed dashboards misses the high-cardinality case; a team that has only query consoles misses the 02:46-recognition-pattern case. The 3-tier hierarchy from earlier in this chapter accommodates both: tier-1 is fixed, tier-3 can be a query-console handoff with a pre-filled query template.

The tier-1 panel that catches every anti-pattern at once — the SLO burn-rate panel

A practical observation from teams that have implemented the 3-tier hierarchy: the single most diagnostic panel on tier-1 is the SLO burn-rate panel, not the error-rate or latency panels. Burn-rate is a contract-violation signal — it goes red when the service is consuming its error budget faster than the SLO permits, regardless of whether the failure mode is high error rate, slow latency, or partial unavailability. A burn-rate panel with the multi-window-multi-burn-rate logic from the Google SRE workbook (1h burn-rate ≥ 14.4 OR 6h burn-rate ≥ 6 — covered in /wiki/burn-rate-alerting) catches roughly 90% of real outages at the right time-window — fast enough to page within 2 minutes of breach, slow enough to ignore transient spikes that auto-recover. The burn-rate panel is also audience-friendly — leadership reads "burn-rate 14.4 over 1h, projected error budget exhaustion in 23 hours" as a single number with a single interpretation, where leadership reads "p99 latency 280ms" as ambiguous (is that good or bad?). On tier-1, two of the six panels should be burn-rate panels (1h and 6h windows side by side); the third anti-pattern fix happens by accident when these two panels are present, because they explicitly catch the slow-drift cases that gauge-only panels miss.

Where this leads next

The dashboard hierarchy from this chapter is the floor for the SLO-and-burn-rate parts that follow. /wiki/the-slo-as-a-contract takes the tier-1 burn-rate panels and walks through how their thresholds are derived from the SLO target. /wiki/multi-window-burn-rate-alerting covers the 1h+6h composition that tier-1 panels should display. The chapter on alert hygiene (/wiki/alert-fatigue-as-a-production-failure) closes the loop — dashboards are the passive observability surface, alerts are the active surface, and the alerts-without-panels anti-pattern from this chapter is the linkage point.

Within Part 12, the next chapter on panel arithmetic (/wiki/panel-arithmetic-precomputed-comparisons) addresses the gauge-only anti-pattern more deeply — every gauge needs a paired comparison (rate, baseline, burn-rate, threshold), and the chapter covers the discipline of pre-computing those comparisons in the metric pipeline rather than in the dashboard query, for both performance and clarity. The wall-dashboard chapter (/wiki/wall-dashboards-are-where-observability-touches-leadership) covers the leadership-tier-1 variant — same hierarchy principle, different audience tuning.

Across curricula, this chapter cross-links to the data-engineering pipeline-observability material (/wiki/observability-for-data-pipelines-not-just-services) — the seven anti-patterns reappear in slightly different shapes for batch and streaming dashboards, and the 3-tier hierarchy generalises to those as well, with tier-1 being SLA-on-job-completion rather than SLO-on-request-success.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install requests
# Export your Grafana dashboard JSON via the UI (Settings → JSON Model → Save to file)
# Or via API:
curl -H "Authorization: Bearer $GRAFANA_TOKEN" \
     https://grafana.example/api/dashboards/uid/<UID> > dashboard.json
python3 dashboard_audit.py dashboard.json metric_a,metric_b

References