monitoring-guidelines

Monitoring guidelines for applications and infrastructure including metrics collection, alerting strategies, and SLO-based monitoring

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-guidelines" with this command: npx skills add mindrally/skills/mindrally-skills-monitoring-guidelines

Monitoring Guidelines

Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection.

Core Monitoring Principles

Monitor the four golden signals: latency, traffic, errors, and saturation
Implement monitoring as code for reproducibility
Design monitoring around user experience and business impact
Use SLOs (Service Level Objectives) to guide alerting decisions
Balance comprehensive coverage with actionable insights

Key Metrics to Monitor

Application Metrics

Request rate (requests per second)
Error rate (percentage of failed requests)
Response time (p50, p90, p95, p99 latencies)
Active connections and concurrent users
Queue depths and processing times

Infrastructure Metrics

CPU utilization and load average
Memory usage and available memory
Disk I/O and available storage
Network throughput and error rates
Container and pod health (for Kubernetes)

Business Metrics

Transaction volumes and values
User signups and conversions
Feature usage and adoption rates
Revenue-impacting events
Customer satisfaction indicators

Alerting Strategy

Alert Design Principles

Alert on symptoms, not causes
Make alerts actionable with clear remediation steps
Set appropriate severity levels (critical, warning, info)
Avoid alert fatigue through proper threshold tuning
Include runbook links in alert notifications

SLO-Based Alerting

Define SLOs for critical user journeys
Calculate error budgets and burn rates
Alert when error budget consumption is high
Use multi-window, multi-burn-rate alerts
Review and adjust SLOs quarterly

Alert Configuration

Set meaningful thresholds based on baseline data
Use hysteresis to prevent flapping alerts
Implement alert dependencies to reduce noise
Route alerts to appropriate teams
Configure escalation policies

Dashboard Design

Effective Dashboards

Create overview dashboards for service health
Build detailed dashboards for debugging
Use consistent layouts and naming conventions
Include time range selectors and drill-down capabilities
Display SLO status prominently

Dashboard Content

Show current state and recent trends
Include comparison to baseline or previous periods
Display deployment markers for correlation
Add annotations for significant events
Include links to related dashboards and logs

Monitoring Tools Integration

Data Collection

Use agents or sidecars for metric collection
Implement service discovery for dynamic environments
Configure appropriate scrape intervals
Use push vs pull based on use case
Ensure metric cardinality is manageable

Data Storage and Retention

Set retention periods based on use case
Implement downsampling for long-term storage
Use appropriate storage backends for scale
Plan for disaster recovery of monitoring data
Monitor your monitoring infrastructure

Health Checks and Probes

Implement liveness probes for crash detection
Use readiness probes for traffic management
Create deep health checks that verify dependencies
Expose health endpoints in a standard format
Monitor health check latency as a metric

Incident Response

Use monitoring data to detect incidents early
Correlate metrics, logs, and traces during investigation
Document findings and update monitoring post-incident
Track MTTR (Mean Time to Recovery) metrics
Conduct regular monitoring reviews and improvements

Capacity Planning

Track resource utilization trends
Set alerts for approaching capacity limits
Use forecasting for proactive scaling
Document capacity requirements and headroom
Review capacity quarterly

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub Open in ClawHub

Related Skills

Related by shared tags or category signals.

General

accessibility-a11y

No summary provided by upstream source.

Repository SourceNeeds Review

General

mysql-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review

General

redis-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review

General

web-scraping

No summary provided by upstream source.

Repository SourceNeeds Review