BMAD Observability Readiness Skill
When to Invoke
Use this skill when the user:
-
Mentions missing or low-quality logging, metrics, or tracing.
-
Requests monitoring/alerting setup before a launch or major release.
-
Needs SLOs, dashboards, or on-call runbooks.
-
Reports alert fatigue or noise that needs rationalization.
-
Wants to ensure performance and reliability work has data coverage.
If instrumentation already exists and only specific bug fixes are required, hand over to bmad-development-execution with the backlog produced here.
Mission
Deliver a comprehensive observability plan that enables diagnosis, alerting, and measurement across the system. Ensure downstream performance, reliability, and security work has trustworthy telemetry.
Inputs Required
-
Architecture diagrams and component inventory.
-
Existing logging/monitoring/tracing configuration (if any).
-
Current incidents, outages, or blind spots experienced by the team.
-
SLAs/SLOs, business KPIs, or compliance reporting requirements.
Outputs
-
Observability plan detailing metrics, logs, traces, dashboards, and retention policies.
-
Instrumentation backlog with implementation tasks, owners, and acceptance criteria.
-
SLO dashboard specification covering golden signals, alert thresholds, and runbook links.
-
Updated runbook or escalation paths if gaps were discovered.
Process
-
Audit current telemetry coverage, tooling, and data retention. Document gaps.
-
Define observability objectives aligned with user journeys and business KPIs.
-
Design instrumentation strategy: metrics taxonomy, structured logging, trace spans, event schemas.
-
Establish SLOs, SLIs, and alerting strategy with on-call expectations and noise controls.
-
Produce dashboards/reporting requirements and data governance notes.
-
Create backlog with prioritized instrumentation tasks and verification approach.
Quality Gates
-
Every critical user journey has metrics and alerts defined (latency, errors, saturation, traffic).
-
Logging standards specify structure, PII handling, and retention.
-
Alert runbooks documented or flagged for creation.
-
Observability plan references integration with performance, security, and incident workflows.
Error Handling
-
If telemetry tooling is undecided, present comparative options with trade-offs.
-
Highlight dependencies on platform teams or infrastructure before finalizing timeline.
-
Escalate when observability requirements conflict with compliance or privacy constraints.