alerting-dashboard-builder

Alerting & Dashboard Builder

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "alerting-dashboard-builder" with this command: npx skills add monkey1sai/openai-cli/monkey1sai-openai-cli-alerting-dashboard-builder

Alerting & Dashboard Builder

Build effective alerts and dashboards based on SLOs.

SLO Definition

slos:

  • name: api_availability objective: 99.9% window: 30d sli: | sum(rate(http_requests_total{status_code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

  • name: api_latency objective: 95% # 95% of requests under 500ms window: 30d sli: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) < 0.5

Alert Rules

groups:

  • name: slo_alerts rules:

    Fast burn (1% budget in 1h)

    • alert: AvailabilitySLOFastBurn expr: | (1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) > 0.01 for: 5m labels: severity: critical annotations: summary: "Burning 1% error budget per hour" runbook: "https://runbooks.example.com/availability-fast-burn"

    Slow burn (10% budget in 24h)

    • alert: AvailabilitySLOSlowBurn expr: | (1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) / sum(rate(http_requests_total[24h])))) > 0.001 for: 1h labels: severity: warning annotations: summary: "Burning error budget slowly"

Dashboard Template

{ "title": "Service Health Dashboard", "rows": [ { "title": "Golden Signals", "panels": [ { "title": "Request Rate", "query": "sum(rate(http_requests_total[5m]))", "type": "graph" }, { "title": "Error Rate", "query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))", "type": "graph" }, { "title": "Latency (p50, p95, p99)", "queries": [ "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))", "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))" ] }, { "title": "Saturation (CPU, Memory)", "queries": [ "rate(process_cpu_seconds_total[5m])", "process_resident_memory_bytes" ] } ] }, { "title": "SLO Tracking", "panels": [ { "title": "Error Budget Remaining", "query": "1 - ((1 - 0.999) - (1 - slo_availability))" } ] } ] }

What to Do When Alert Fires

Alert Response Guide

HighErrorRate

What it means: More than 5% of requests are failing

First steps:

  1. Check recent deployments (rollback if needed)
  2. Review error logs for patterns
  3. Check dependent services health
  4. Verify database connectivity

Escalation: If not resolved in 15 min, page on-call lead

HighLatency

What it means: p95 latency above 2 seconds

First steps:

  1. Check database query performance
  2. Review recent code changes
  3. Check cache hit rates
  4. Look for slow external API calls

Temporary mitigation:

  • Scale up instances
  • Enable aggressive caching

LowAvailability

What it means: Availability below 99.5%

First steps:

  1. Check infrastructure (AWS status page)
  2. Review load balancer health checks
  3. Check for DDoS activity
  4. Verify auto-scaling functioning

Output Checklist

  • SLOs defined

  • Alert rules configured

  • Dashboards created

  • Runbooks linked

  • Response guides documented ENDFILE

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

websocket-realtime-builder

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

api-docs-generator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

secrets-scanner

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

responsive-design-system

No summary provided by upstream source.

Repository SourceNeeds Review