deploying-monitoring-stacks

Deploying Monitoring Stacks

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "deploying-monitoring-stacks" with this command: npx skills add jeremylongshore/claude-code-plugins-plus-skills/jeremylongshore-claude-code-plugins-plus-skills-deploying-monitoring-stacks

Deploying Monitoring Stacks

Overview

Deploy production monitoring stacks (Prometheus + Grafana, Datadog, or Victoria Metrics) with metric collection, custom dashboards, and alerting rules. Configure exporters, scrape targets, recording rules, and notification channels for comprehensive infrastructure and application observability.

Prerequisites

  • Target infrastructure identified: Kubernetes cluster, Docker hosts, or bare-metal servers

  • Metric endpoints accessible from the monitoring platform (application /metrics , node exporters)

  • Storage backend capacity planned for time-series data (Prometheus TSDB, Thanos, or Cortex for long-term)

  • Alert notification channels defined: Slack webhook, PagerDuty integration key, or email SMTP

  • Helm 3+ for Kubernetes deployments using kube-prometheus-stack or similar charts

Instructions

  • Select the monitoring platform: Prometheus + Grafana for open-source self-hosted, Datadog for managed SaaS, Victoria Metrics for high-cardinality workloads

  • Deploy the monitoring stack: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack or Docker Compose for non-Kubernetes

  • Install exporters on monitored systems: node-exporter for host metrics, kube-state-metrics for Kubernetes object states, application-specific exporters

  • Configure scrape targets in prometheus.yml : define job names, scrape intervals, and relabeling rules for service discovery

  • Create recording rules for frequently queried aggregations to reduce dashboard query load

  • Define alerting rules with meaningful thresholds: high CPU (>80% for 5m), high memory (>90%), error rate (>1%), latency P99 (>500ms)

  • Configure Alertmanager with routing, grouping, and notification channels (Slack, PagerDuty, email)

  • Build Grafana dashboards: RED metrics (Rate, Errors, Duration) for services, USE metrics (Utilization, Saturation, Errors) for resources

  • Set up data retention: configure TSDB retention period (15-30 days local), set up Thanos/Cortex for long-term storage if needed

  • Test the full pipeline: trigger a test alert and verify notification delivery

Output

  • Helm values file or Docker Compose for the monitoring stack

  • Prometheus configuration with scrape targets, recording rules, and alerting rules

  • Alertmanager configuration with routing tree and notification receivers

  • Grafana dashboard JSON files for infrastructure and application metrics

  • Exporter deployment manifests (node-exporter DaemonSet, application ServiceMonitor)

Error Handling

Error Cause Solution

No data points in dashboard

Scrape target not reachable or metric name wrong Check Targets page in Prometheus UI; verify service discovery and metric name

Too many time series (high cardinality)

Labels with unbounded values (user IDs, request IDs) Remove high-cardinality labels with metric_relabel_configs ; use recording rules for aggregation

Alert condition met but no notification

Alertmanager routing or receiver misconfigured Verify Alertmanager config with amtool check-config ; test receiver with amtool silence

Prometheus OOMKilled

Insufficient memory for series count Increase memory limits; reduce scrape targets or retention; add WAL compression

Grafana datasource connection failed

Wrong Prometheus URL or network policy blocking access Verify datasource URL in Grafana; check Kubernetes service name and port; review network policies

Examples

  • "Deploy kube-prometheus-stack on Kubernetes with alerts for node CPU > 80%, pod restart count > 5, and API error rate > 1%, sending to Slack."

  • "Set up Prometheus + Grafana on Docker Compose for monitoring 10 application servers with node-exporter and custom application metrics."

  • "Create Grafana dashboards for the four golden signals (latency, traffic, errors, saturation) for a microservices application."

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

backtesting-trading-strategies

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

svg-icon-generator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

performance-lighthouse-runner

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mindmap-generator

No summary provided by upstream source.

Repository SourceNeeds Review