datadog-review-dashboard

Review Datadog Dashboard

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "datadog-review-dashboard" with this command: npx skills add trogonstack/agentskills/trogonstack-agentskills-datadog-review-dashboard

Review Datadog Dashboard

Audit an existing Datadog dashboard against operational readiness principles. The core principles are: graphs should earn their place with alert thresholds, thresholds should sit close to normal traffic, a customer-facing section should exist, and the dashboard should be readable by someone with zero service knowledge.

These are guiding principles — not a rigid checklist. Apply judgment based on the product and business context. A context-providing metric (like deployment events) may earn its place without a threshold. A service with unusual traffic patterns may need different proximity rules.

Interview Phase

Skip interview if ALL of these are already specified:

  • Dashboard ID or URL

  • Service name or team context

Always interview if: No dashboard ID is provided or multiple dashboards may be relevant.

Questions

Dashboard — "Which dashboard should I review? Provide a dashboard ID, URL, or service name to search for."

  • Impact: Determines which dashboard definition to fetch

Business Context — "Can you tell me what this service does for customers? Are there codebases or docs I can read to understand the product?"

  • Impact: Understanding the domain lets the review focus on whether the right metrics are being tracked, not just whether generic rules are followed

Focus — "Is there anything specific you want me to focus on? (A) Full review, (B) Alert thresholds only, (C) Customer-facing section, (D) Layout and readability"

  • Impact: Determines review scope — default to full review if unspecified

Workflow

  1. Fetch Dashboard Definition

If given a service name, search for matching dashboards

pup dashboards list --filter="<service-name>"

If given a URL, extract the dashboard ID from the path (e.g., /dashboard/abc-def-ghi/...)

Get the full dashboard definition

pup dashboards get <dashboard-id>

Get the dashboard URL for reference

pup dashboards url <dashboard-id>

Parse the response to build an inventory of all widgets, groups, and their configurations.

  1. Build Widget Inventory

Catalog every widget in the dashboard:

Widget Title Prefix Type Group Has Alert Threshold Threshold Value Notes

... I0/P1/D0/— ... ... ... ... ...

Check that every widget title uses the layer-priority prefix system:

  • I0-N: for infrastructure (load balancers, databases, networks)

  • P0-N: for platform (service-specific components from the codebase)

  • D0-N: for domain (business metrics)

  • The number indicates priority within the layer (0 = most critical)

Focus on timeseries and query value widgets — these are the primary candidates for alert threshold markers.

  1. Audit Alert Thresholds

Principle: Timeseries graphs should generally have an alert threshold (red line). If a metric doesn't warrant an alert, question whether it belongs — but use judgment. Some metrics provide valuable context (deployment markers, dependency traffic patterns) without needing a threshold.

For each timeseries widget, check:

  • Does it have a marker/threshold line configured?

  • Is the marker colored red for visibility?

  • Does the threshold correspond to an actual monitor/alert?

Findings format:

Alert Threshold Audit

WidgetGroupStatusFinding
Requests/sRateMISSINGNo threshold marker — add alert line or remove widget
Error rateErrorsOKRed line at 5%
CPU usageInfraMISSINGNo threshold — is this metric alertable?
  1. Audit Threshold Proximity

Principle: Alert thresholds must be close to normal traffic. Large gaps between normal values and the alert line create blind spots where anomalies go unnoticed.

For each widget with a threshold:

  • What is the typical (normal) value range?

  • Where is the threshold set?

  • Is there excessive whitespace between the normal line and the alert line?

  • Is the Y-axis auto-scaled or explicitly set? Auto-scaled Y-axes compress normal traffic into a flat band when the threshold is far above normal — the Y-axis max should be set to slightly above the alert threshold

Bad example: Normal CPU is 20%, alert threshold at 95% — the graph is mostly empty space and a slow climb from 20% to 80% looks flat.

Good example: Normal CPU is 20%, alert threshold at 45% — anomalies visually stand out immediately.

Findings format:

Threshold Proximity Audit

WidgetNormal RangeThresholdGapY-AxisStatus
CPU usage~20%95%75%autoTOO FAR — lower to 40-50%, set Y-max to 55%
Error rate~0.1%5%~5%autoOK gap — but set Y-max to 6%
p99 latency~50ms500ms10xautoTOO FAR — lower to 100-150ms, set Y-max to 175ms
  1. Audit Customer-Facing Section

Principle: A dedicated "Customer-Facing" group should exist at the top of the dashboard with 5-8 key metrics for immediate outage identification. The specific metrics should reflect the product's business — not just generic traffic and error rates.

Check:

  • Does a "Customer-Facing" group exist?

  • Is it the first group on the dashboard?

  • Does it contain 5-8 metrics covering: traffic volume, API latency, error rates, key business transactions, and database health?

  • Can someone determine "are customers affected?" within 5 seconds of opening the dashboard?

Findings format:

Customer-Facing Section Audit

Status: MISSING / INCOMPLETE / OK

Current state: [Description of what exists]

Recommended metrics (if missing or incomplete):

  1. Total request rate (are we receiving traffic?)

  2. Customer-facing error rate (are requests failing?)

  3. API p99 latency (are responses slow?)

  4. Key transaction success rate (are critical flows working?)

  5. Database connection pool usage (is the data layer healthy?)

  6. Queue depth or processing lag (is async work backing up?)

  7. Apply Zero-Knowledge Viewer Test

Principle: Someone with zero knowledge of the service should be able to spot problems by looking for red indicators.

Evaluate:

  • Can you identify a problem in under 10 seconds without reading widget titles?

  • Are thresholds visible as red lines on every graph?

  • Is conditional formatting applied to query value widgets (green/yellow/red)?

  • Are group names self-explanatory?

  • Is there a note widget with runbook links or team ownership?

Findings format:

Zero-Knowledge Readability Audit

CheckStatusFinding
Problems visible in <10sFAILNo red lines on 8 of 12 graphs
Conditional formatting on QV widgetsPARTIAL2 of 4 QV widgets have thresholds
Group names self-explanatoryOKAll groups use clear names
Runbook/ownership noteMISSINGNo note widget with team info
  1. Generate Review Report

Compile all findings into a structured report:

Dashboard Review: [Dashboard Title]

Dashboard ID: [id] URL: [url] Review date: [date]

Summary

[2-3 sentence summary: overall health of the dashboard, critical issues count]

Critical Issues

[List issues that must be fixed before the dashboard is production-ready]

Alert Threshold Audit

[From step 3]

Threshold Proximity Audit

[From step 4]

Customer-Facing Section Audit

[From step 5]

Zero-Knowledge Readability Audit

[From step 6]

Recommended Actions

Must Fix

  1. [Action item with specific widget and group reference]

Should Fix

  1. [Action item]

Nice to Have

  1. [Action item]

Quality Checklist

  • Every widget title uses the layer-priority prefix (I0: , P1: , D0: , etc.)

  • Every timeseries widget audited for alert threshold markers

  • Threshold proximity checked (no large gaps between normal values and alert lines)

  • Customer-Facing group exists with 5-8 key metrics at the top

  • Zero-knowledge viewer test applied (red indicators visible without context)

  • Query Value widgets checked for conditional formatting (green/yellow/red)

  • All findings include specific widget names and group references

  • Recommended actions categorized by priority (must/should/nice-to-have)

  • Dashboard URL included in report for easy reference

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

diataxis-organize-docs

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

diataxis-gen-readme

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

nats-design-subject

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

gh-enrich-pr-description

No summary provided by upstream source.

Repository SourceNeeds Review