Design Datadog Dashboard
Design a dashboard layout that tells a clear story — from high-level health signals down to granular diagnostics — using proper widget types, group organization, and template variables for reusability.
Important: Always check for existing dashboards first with pup dashboards list . Do not create a new dashboard if one already exists for the same service or purpose — update the existing one instead. Only create a new dashboard when no relevant one exists or the user explicitly asks for a new one.
Philosophy: The frameworks, layouts, and widget guides in this skill are starting points — not rigid rules. Every product and business is different. Understand the domain first, then adapt the frameworks to fit. The best dashboards reflect how the business actually works, not how a generic template says they should.
Domain Discovery
Before designing, understand what you are building observability for. The metrics that matter depend entirely on the product and business context.
Ask the user:
-
"Can you tell me about the product and what this service does for the business? What does a customer experience when they interact with it?"
-
"What does a bad day look like for this service? What breaks, and how do customers feel it?"
-
"Are there codebases, architecture docs, or README files I can read to understand the service and its dependencies?"
If the user points you to a codebase: Read it. Look at the entry points, the API routes, the database models, the queue consumers, the external service calls. Understanding the code gives you the context to choose metrics that actually matter — not just generic RED/USE signals.
If the user describes the business: Use that context to tailor the Customer-Facing section. An e-commerce service cares about checkout success rates. A messaging service cares about delivery latency. A payment service cares about transaction completion. Generic "request rate" and "error rate" are a starting point, but the real value comes from metrics that map to business outcomes.
Skip domain discovery if: You already have deep context about the service from prior conversations or the user has provided detailed specifications.
Interview
Skip if ALL of these are already specified: dashboard purpose, target audience, data sources, template variable needs, dashboard strategy.
Always interview if: Auditing or redesigning an existing dashboard (needs current state review first).
-
Purpose — "What is this dashboard for? Service overview, infrastructure, executive KPIs, debugging, or SLO tracking?"
-
Audience — "Who will use this? On-call engineers, platform team, leadership, or mixed?"
-
Data Sources — "Which Datadog products are involved? Metrics only, APM + Metrics, Logs + Metrics, or full stack?"
-
Scope — "Is this for a single service, a group of services, or infrastructure-wide?"
-
Dashboard Strategy — "One dashboard per service, or a consolidated view?" — share the trade-offs from references/layouts.md to help them decide. If unsure, ask: "During an outage, does your team investigate one service at a time, or do they need to see all services simultaneously?"
-
Existing Dashboard — "Is there an existing dashboard to audit or redesign?" If yes, fetch with pup dashboards get <id> before designing.
Workflow
- Gather existing context
pup dashboards list pup dashboards get <dashboard-id> pup dashboards url <dashboard-id>
If auditing an existing dashboard, fetch its definition first and analyze its current structure before redesigning.
- Choose a framework
Match the dashboard purpose to a framework. Read references/frameworks.md for detailed metric mappings and group structures.
Purpose Framework
Service overview RED (Rate, Errors, Duration)
Infrastructure USE (Utilization, Saturation, Errors)
Executive/business Golden Signals
SLO tracking SLI/SLO
Debugging Drill-down
- Design the layout
Using your domain understanding and the chosen framework, design the group structure and select widgets. Read these references as needed:
-
layouts.md — Template variable conventions, group structure patterns, dashboard strategy trade-offs, grid sizing, anti-patterns
-
widgets.md — Widget selection guide, display options, sizing, naming conventions
-
thresholds.md — Alert threshold markers, threshold proximity, Y-axis configuration
Key principles (not rigid rules — use judgment):
-
Prefix every widget title with its layer and priority: I0: (most critical infra), P0: (most critical platform), D0: (most critical domain). See widgets.md for the full prefix system.
-
Start with a Customer-Facing group (5-8 metrics) so someone with zero service knowledge can tell if customers are affected within 5 seconds. Tailor the metrics to the domain.
-
Timeseries widgets should have alert threshold markers (red lines) with thresholds close to normal traffic. If a metric doesn't warrant an alert, question whether it belongs — but context-providing metrics can earn their place.
-
Set Y-axis max explicitly near the threshold — don't let auto-scaling compress the normal range.
-
Order groups macro-to-micro: customer-facing → overview → domain-specific → infrastructure.
- Write the output
Present the design using this template:
Dashboard Design: [Dashboard Title]
Purpose
[1-2 sentences: what this monitors, who uses it]
Template Variables
| Variable | Tag | Default |
|---|---|---|
| ... | ... | * |
Layout
Group: [Group Title]
| Widget | Type | Query/Metric | Width | Alert Threshold |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
[Repeat for each group]
Quality Validation
[Run quality principles below]
- Validate
pup dashboards list pup dashboards get <dashboard-id> pup metrics list --filter="trace.http.request.*"
Quality Principles
-
Dashboard reflects the actual product and business — metrics tailored to the domain
-
Dashboard title is concise (no environment, region, or version)
-
Template variables defined for env, service, and relevant scopes (default * )
-
Customer-Facing group with 5-8 metrics tailored to the service's business impact
-
Groups ordered macro-to-micro (customer-facing → overview → details)
-
Timeseries widgets have alert threshold markers (red lines) where the metric is alertable
-
Thresholds close to normal traffic — no excessive whitespace
-
Zero-knowledge readability — someone with no service knowledge can spot problems via red indicators
-
Widget titles prefixed with layer and priority (I0: , P1: , D0: , etc.)
-
Widget titles use sentence case, don't repeat group name
-
Every metric earns its place — if it spikes, someone can act on it
References
-
Observability Frameworks — RED, USE, Golden Signals, SLI/SLO with metric mappings
-
Layout & Structure — Template variables, group patterns, dashboard strategy, grid sizing, anti-patterns
-
Widgets — Widget types, display options, sizing, naming conventions
-
Alert Thresholds — Threshold markers, proximity guide, Y-axis configuration