monitoring-expert

Expert guidance for monitoring, observability, and alerting using Prometheus, Grafana, logging systems, and distributed tracing.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-expert" with this command: npx skills add personamanagmentlayer/pcl/personamanagmentlayer-pcl-monitoring-expert

Monitoring Expert

Expert guidance for monitoring, observability, and alerting using Prometheus, Grafana, logging systems, and distributed tracing.

Core Concepts

The Three Pillars of Observability

  • Metrics - Numerical measurements over time (Prometheus)

  • Logs - Discrete events (ELK, Loki)

  • Traces - Request flow through distributed systems (Jaeger, Tempo)

Monitoring Fundamentals

  • Golden Signals (Latency, Traffic, Errors, Saturation)

  • RED Method (Rate, Errors, Duration)

  • USE Method (Utilization, Saturation, Errors)

  • Service Level Indicators (SLIs)

  • Service Level Objectives (SLOs)

  • Service Level Agreements (SLAs)

Key Components

  • Metric collection (exporters, agents)

  • Time-series database

  • Visualization (dashboards)

  • Alerting (rules, receivers)

  • Log aggregation

  • Distributed tracing

Prometheus

Installation (Docker)

docker-compose.yml

version: '3.8'

services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.enable-lifecycle' - '--storage.tsdb.retention.time=30d'

grafana: image: grafana/grafana:latest ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false volumes: - grafana-data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning

node-exporter: image: prom/node-exporter:latest ports: - "9100:9100" command: - '--path.rootfs=/host' volumes: - '/:/host:ro,rslave'

alertmanager: image: prom/alertmanager:latest ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager-data:/alertmanager

volumes: prometheus-data: grafana-data: alertmanager-data:

Prometheus Configuration

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' region: 'us-east-1'

Alertmanager configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load alert rules

rule_files:

  • 'alerts.yml'

Scrape configurations

scrape_configs:

Prometheus itself

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node exporter (system metrics)

  • job_name: 'node' static_configs:
    • targets:
      • 'node-exporter:9100' labels: instance: 'server-1' env: 'production'

Application metrics

  • job_name: 'app' static_configs:
    • targets:
      • 'app-1:8080'
      • 'app-2:8080'
      • 'app-3:8080' metrics_path: '/metrics'

Kubernetes service discovery

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address

Blackbox exporter (endpoint monitoring)

  • job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] static_configs:
    • targets:
    • source_labels: [address] target_label: __param_target
    • source_labels: [__param_target] target_label: instance
    • target_label: address replacement: blackbox-exporter:9115

Alert Rules

alerts.yml

groups:

  • name: app_alerts interval: 30s rules:

    High error rate

    • alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.instance }}" description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"

    API latency

    • alert: HighAPILatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1 for: 10m labels: severity: warning annotations: summary: "High API latency on {{ $labels.instance }}" description: "95th percentile latency is {{ $value }}s"

    Service down

    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} down" description: "{{ $labels.instance }} has been down for 1 minute"

    High memory usage

    • alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.90 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"

    High CPU usage

    • alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"

    Disk space

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining"

    Pod restarts

    • alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} is restarting" description: "Pod has restarted {{ $value }} times in the last 15 minutes"

PromQL Queries

Request rate

rate(http_requests_total[5m])

Error rate

rate(http_requests_total{status=~"5.."}[5m])

Success rate

sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

P95 latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )

Average latency

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

CPU usage per pod

rate(container_cpu_usage_seconds_total{pod!=""}[5m])

Memory usage percentage

(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

QPS per endpoint

sum by(endpoint) (rate(http_requests_total[5m]))

Top 5 slowest endpoints

topk(5, histogram_quantile(0.95, sum by(endpoint, le) (rate(http_request_duration_seconds_bucket[5m])) ))

Predict disk full in 4 hours

predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0

Network I/O

rate(node_network_receive_bytes_total[5m]) rate(node_network_transmit_bytes_total[5m])

Application Instrumentation

Node.js (Express)

// Install: npm install prom-client express-prom-bundle import express from 'express'; import promBundle from 'express-prom-bundle'; import { register, Counter, Histogram, Gauge } from 'prom-client';

const app = express();

// Automatic metrics for all endpoints const metricsMiddleware = promBundle({ includeMethod: true, includePath: true, includeStatusCode: true, includeUp: true, customLabels: { app: 'myapp' }, promClient: { collectDefaultMetrics: {} }, });

app.use(metricsMiddleware);

// Custom metrics const ordersTotal = new Counter({ name: 'orders_total', help: 'Total number of orders', labelNames: ['status', 'payment_method'], });

const orderValue = new Histogram({ name: 'order_value_dollars', help: 'Order value in dollars', buckets: [10, 50, 100, 500, 1000, 5000], });

const activeUsers = new Gauge({ name: 'active_users', help: 'Number of active users', });

// Use metrics in your code app.post('/orders', async (req, res) => { const order = await createOrder(req.body);

ordersTotal.inc({ status: 'created', payment_method: order.paymentMethod }); orderValue.observe(order.total);

res.json(order); });

// Expose metrics endpoint app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });

app.listen(8080, () => { console.log('Server running on :8080'); console.log('Metrics available at http://localhost:8080/metrics'); });

Python (Flask)

Install: pip install prometheus-flask-exporter

from flask import Flask from prometheus_flask_exporter import PrometheusMetrics from prometheus_client import Counter, Histogram, Gauge

app = Flask(name) metrics = PrometheusMetrics(app)

Custom metrics

orders_total = Counter( 'orders_total', 'Total number of orders', ['status', 'payment_method'] )

order_value = Histogram( 'order_value_dollars', 'Order value in dollars', buckets=[10, 50, 100, 500, 1000, 5000] )

active_users = Gauge( 'active_users', 'Number of active users' )

@app.route('/orders', methods=['POST']) def create_order(): order = process_order(request.json)

orders_total.labels(
    status='created',
    payment_method=order['payment_method']
).inc()

order_value.observe(order['total'])

return jsonify(order)

@app.route('/health') def health(): return {'status': 'healthy'}

if name == 'main': app.run(host='0.0.0.0', port=8080) # Metrics available at /metrics

Go

package main

import ( "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" "github.com/prometheus/client_golang/prometheus/promauto" )

var ( ordersTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "orders_total", Help: "Total number of orders", }, []string{"status", "payment_method"}, )

orderValue = promauto.NewHistogram(
    prometheus.HistogramOpts{
        Name:    "order_value_dollars",
        Help:    "Order value in dollars",
        Buckets: []float64{10, 50, 100, 500, 1000, 5000},
    },
)

activeUsers = promauto.NewGauge(
    prometheus.GaugeOpts{
        Name: "active_users",
        Help: "Number of active users",
    },
)

)

func createOrderHandler(w http.ResponseWriter, r *http.Request) { order := processOrder(r.Body)

ordersTotal.WithLabelValues(
    "created",
    order.PaymentMethod,
).Inc()

orderValue.Observe(order.Total)

json.NewEncoder(w).Encode(order)

}

func main() { http.HandleFunc("/orders", createOrderHandler) http.Handle("/metrics", promhttp.Handler())

http.ListenAndServe(":8080", nil)

}

Alertmanager

Configuration

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

route: receiver: 'default' group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 5m repeat_interval: 4h

routes: # Critical alerts to PagerDuty - match: severity: critical receiver: pagerduty continue: true

# Warning alerts to Slack
- match:
    severity: warning
  receiver: slack

# Database alerts
- match_re:
    service: database
  receiver: database-team

receivers:

  • name: 'default' email_configs:

  • name: 'slack' slack_configs:

    • channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
  • name: 'pagerduty' pagerduty_configs:

    • service_key: 'YOUR_PAGERDUTY_KEY' description: '{{ .GroupLabels.alertname }}'
  • name: 'database-team' slack_configs:

inhibit_rules:

Suppress warning if critical alert is firing

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Grafana

Dashboard Configuration (JSON)

{ "dashboard": { "title": "Application Metrics", "tags": ["app", "production"], "timezone": "browser", "panels": [ { "title": "Request Rate", "type": "graph", "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }, "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (status)", "legendFormat": "{{ status }}" } ] }, { "title": "P95 Latency", "type": "graph", "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }, "targets": [ { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "p95" } ] }, { "title": "Error Rate", "type": "stat", "gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 }, "targets": [ { "expr": "sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))" } ], "fieldConfig": { "defaults": { "unit": "percentunit", "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 0.01, "color": "yellow" }, { "value": 0.05, "color": "red" } ] } } } } ] } }

Provisioning Data Sources

grafana/provisioning/datasources/prometheus.yml

apiVersion: 1

datasources:

Logging with Loki

Loki Configuration

loki-config.yml

auth_enabled: false

server: http_listen_port: 3100

ingester: lifecycler: ring: kvstore: store: inmemory replication_factor: 1 chunk_idle_period: 5m chunk_retain_period: 30s

schema_config: configs: - from: 2020-05-15 store: boltdb object_store: filesystem schema: v11 index: prefix: index_ period: 168h

storage_config: boltdb: directory: /tmp/loki/index filesystem: directory: /tmp/loki/chunks

limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h

Promtail Configuration

promtail-config.yml

server: http_listen_port: 9080

positions: filename: /tmp/positions.yaml

clients:

scrape_configs:

Application logs

  • job_name: app static_configs:
    • targets:
      • localhost labels: job: app path: /var/log/app/*.log

Docker logs

  • job_name: docker docker_sd_configs:
    • host: unix:///var/run/docker.sock relabel_configs:
    • source_labels: ['__meta_docker_container_name'] target_label: 'container'

Kubernetes logs

  • job_name: kubernetes kubernetes_sd_configs:
    • role: pod pipeline_stages:
    • docker: {} relabel_configs:
    • source_labels:
      • __meta_kubernetes_pod_name target_label: pod
    • source_labels:
      • __meta_kubernetes_namespace target_label: namespace

LogQL Queries

All logs for a job

{job="app"}

Filter by level

{job="app"} |= "error"

JSON parsing

{job="app"} | json | level="error"

Rate of errors

rate({job="app"} |= "error" [5m])

Count by pod

sum by (pod) (count_over_time({namespace="production"}[5m]))

Extract and filter

{job="app"} | json | line_format "{{.timestamp}} {{.level}} {{.message}}" | level="error"

Metrics from logs

sum(rate({job="app"} |= "status=500" [5m])) by (endpoint)

Distributed Tracing

Jaeger Setup

docker-compose.yml

services: jaeger: image: jaegertracing/all-in-one:latest ports: - "5775:5775/udp" - "6831:6831/udp" - "6832:6832/udp" - "5778:5778" - "16686:16686" # UI - "14268:14268" # Collector - "9411:9411" # Zipkin compatible environment: - COLLECTOR_ZIPKIN_HTTP_PORT=9411

Application Instrumentation (Node.js)

// Install: npm install jaeger-client opentracing import { initTracer } from 'jaeger-client';

const config = { serviceName: 'my-app', sampler: { type: 'probabilistic', param: 1.0, // Sample 100% of traces }, reporter: { logSpans: true, agentHost: 'localhost', agentPort: 6831, }, };

const tracer = initTracer(config);

// Trace HTTP request app.get('/api/users/:id', async (req, res) => { const span = tracer.startSpan('get_user'); span.setTag('user_id', req.params.id);

try { // Database query const dbSpan = tracer.startSpan('db_query', { childOf: span }); const user = await db.user.findById(req.params.id); dbSpan.finish();

// External API call
const apiSpan = tracer.startSpan('external_api', { childOf: span });
const profile = await fetchUserProfile(user.id);
apiSpan.finish();

span.setTag('http.status_code', 200);
res.json({ user, profile });

} catch (error) { span.setTag('error', true); span.setTag('http.status_code', 500); span.log({ event: 'error', message: error.message }); res.status(500).json({ error: error.message }); } finally { span.finish(); } });

Best Practices

Metric Naming

  • Use descriptive names: http_requests_total not requests

  • Use units in name: duration_seconds , bytes_total

  • Use _total suffix for counters

  • Use _bucket suffix for histograms

  • Use consistent label names

Cardinality

  • Avoid high-cardinality labels (user IDs, emails)

  • Use bounded label values

  • Aggregate when possible

  • Monitor metric count

Alert Design

  • Alert on symptoms, not causes

  • Set appropriate thresholds

  • Include actionable annotations

  • Group related alerts

  • Use inhibition rules

Dashboard Design

  • One purpose per dashboard

  • Use consistent time ranges

  • Include SLOs/SLIs

  • Add context with annotations

  • Use appropriate visualization types

Anti-Patterns to Avoid

❌ No SLOs: Define service level objectives ❌ Alert fatigue: Too many non-actionable alerts ❌ High cardinality: Labels with unbounded values ❌ Missing instrumentation: Instrument all critical paths ❌ No runbooks: Alerts should have clear remediation steps ❌ Ignoring trends: Monitor trends, not just current values ❌ No log structure: Use structured logging (JSON) ❌ Missing context: Include relevant labels and tags

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

finance-expert

No summary provided by upstream source.

Repository SourceNeeds Review
General

trading-expert

No summary provided by upstream source.

Repository SourceNeeds Review
General

dart-expert

No summary provided by upstream source.

Repository SourceNeeds Review
General

postgresql-expert

No summary provided by upstream source.

Repository SourceNeeds Review