observability-monitoring

Observability & Monitoring

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "observability-monitoring" with this command: npx skills add manutej/luxor-claude-marketplace/manutej-luxor-claude-marketplace-observability-monitoring

Observability & Monitoring

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.

When to Use This Skill

Use this skill when:

  • Setting up monitoring for production systems and applications

  • Implementing metrics collection and observability for microservices

  • Creating dashboards and visualizations for system health monitoring

  • Defining alerting rules and incident response automation

  • Analyzing system performance and capacity using time-series data

  • Implementing SLIs, SLOs, and SLAs for service reliability

  • Debugging production issues using metrics and traces

  • Building custom exporters for application-specific metrics

  • Setting up federation for multi-cluster monitoring

  • Migrating from legacy monitoring to cloud-native solutions

  • Implementing cost monitoring and optimization tracking

  • Creating real-time operational dashboards for DevOps teams

Core Concepts

The Four Pillars of Observability

Modern observability is built on four fundamental pillars:

Metrics: Numerical measurements of system behavior over time

  • Counter: Monotonically increasing values (requests served, errors)

  • Gauge: Point-in-time values that go up and down (memory usage, temperature)

  • Histogram: Distribution of values (request duration buckets)

  • Summary: Similar to histogram but calculates quantiles on client-side

Logs: Discrete events with contextual information

  • Structured logging (JSON, key-value pairs)

  • Centralized log aggregation (ELK, Loki)

  • Correlation with metrics and traces

Traces: Request flow through distributed systems

  • Span: Single unit of work with start/end time

  • Trace: Collection of spans representing end-to-end request

  • OpenTelemetry for distributed tracing

Events: Significant occurrences in system lifecycle

  • Deployments, configuration changes

  • Scaling events, incidents

  • Business events and user actions

Prometheus Architecture

Prometheus is a pull-based monitoring system with key components:

Time-Series Database (TSDB)

  • Stores metrics as time-series data

  • Efficient compression and retention policies

  • Local storage with optional remote storage

Scrape Targets

  • Service discovery (Kubernetes, Consul, EC2, etc.)

  • Static configuration

  • Relabeling for flexible target selection

PromQL Query Engine

  • Powerful query language for metrics analysis

  • Aggregation, filtering, and mathematical operations

  • Range vectors and instant vectors

Alertmanager

  • Alert rule evaluation

  • Grouping, silencing, and routing

  • Integration with PagerDuty, Slack, email, webhooks

Exporters

  • Bridge between Prometheus and systems

  • Node exporter, cAdvisor, custom exporters

  • Third-party exporters for databases, services

Metric Labels and Cardinality

Labels are key-value pairs attached to metrics:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Label Best Practices:

  • Use labels for dimensions you query/aggregate by

  • Avoid high-cardinality labels (user IDs, timestamps)

  • Keep label names consistent across metrics

  • Use relabeling to normalize external labels

Cardinality Considerations:

  • Each unique label combination = new time-series

  • High cardinality = increased memory and storage

  • Monitor cardinality with prometheus_tsdb_symbol_table_size_bytes

  • Use recording rules to pre-aggregate high-cardinality metrics

Recording Rules

Pre-compute frequently-used or expensive queries:

groups:

  • name: api_performance interval: 30s rules:
    • record: api:http_request_duration_seconds:p99 expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    • record: api:http_requests:rate5m expr: rate(http_requests_total[5m])

Benefits:

  • Faster dashboard loading

  • Reduced query load on Prometheus

  • Consistent metric naming conventions

  • Enable complex aggregations

Service Level Objectives (SLOs)

Define and track reliability targets:

SLI (Service Level Indicator): Metric measuring service quality

  • Availability: % of successful requests

  • Latency: % of requests under threshold

  • Throughput: Requests per second

SLO (Service Level Objective): Target for SLI

  • 99.9% availability (43.8 minutes downtime/month)

  • 95% of requests < 200ms

  • 1000 RPS sustained

SLA (Service Level Agreement): Contract with consequences

  • External commitments to customers

  • Financial penalties for SLO violations

Error Budget: Acceptable failure rate

  • Error budget = 100% - SLO

  • 99.9% SLO = 0.1% error budget

  • Use budget for innovation vs. reliability tradeoff

Prometheus Setup and Configuration

Basic Prometheus Configuration

prometheus.yml

global: scrape_interval: 15s # Default scrape interval evaluation_interval: 15s # Alert rule evaluation interval external_labels: cluster: 'production' region: 'us-west-2'

Alertmanager configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

rule_files:

  • 'rules/*.yml'
  • 'alerts/*.yml'

Scrape configurations

scrape_configs:

Prometheus self-monitoring

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']

Node exporter for system metrics

  • job_name: 'node' static_configs:
    • targets:
      • 'node1:9100'
      • 'node2:9100'
      • 'node3:9100' relabel_configs:
    • source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

Application metrics

  • job_name: 'api' static_configs:
    • targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'

Kubernetes Service Discovery

scrape_configs:

Kubernetes API server

  • job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
    • role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
    • source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Kubernetes pods with prometheus.io annotations

  • job_name: 'kubernetes-pods' kubernetes_sd_configs:
    • role: pod relabel_configs:

    Only scrape pods with prometheus.io/scrape: "true" annotation

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

    Use the port from prometheus.io/port annotation

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: (\d+) target_label: address replacement: ${1}:${2}

    Use the path from prometheus.io/path annotation

    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path

    Add namespace label

    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace

    Add pod name label

    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

Kubernetes services

  • job_name: 'kubernetes-services' kubernetes_sd_configs:
    • role: service metrics_path: /probe params: module: [http_2xx] relabel_configs:
    • source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true
    • source_labels: [address] target_label: __param_target
    • target_label: address replacement: blackbox-exporter:9115
    • source_labels: [__param_target] target_label: instance

Storage and Retention

Storage configuration

storage: tsdb: path: /prometheus/data retention.time: 15d retention.size: 50GB

Remote write for long-term storage

remote_write:

  • url: "https://prometheus-remote-storage.example.com/api/v1/write" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password queue_config: capacity: 10000 max_shards: 50 max_samples_per_send: 5000 write_relabel_configs:

    Drop high-cardinality metrics

    • source_labels: [name] regex: 'container_network_.*' action: drop

Remote read for querying historical data

remote_read:

PromQL: The Prometheus Query Language

Instant Vectors and Selectors

Basic metric selection

http_requests_total

Filter by label

http_requests_total{job="api", status="200"}

Regex matching

http_requests_total{status=~"2..|3.."}

Negative matching

http_requests_total{status!="500"}

Multiple label matchers

http_requests_total{job="api", method="GET", status=~"2.."}

Range Vectors and Aggregations

5-minute range vector

http_requests_total[5m]

Rate of increase per second

rate(http_requests_total[5m])

Increase over time window

increase(http_requests_total[1h])

Average over time

avg_over_time(cpu_usage[5m])

Max/Min over time

max_over_time(response_time_seconds[10m]) min_over_time(response_time_seconds[10m])

Standard deviation

stddev_over_time(response_time_seconds[5m])

Aggregation Operators

Sum across all instances

sum(rate(http_requests_total[5m]))

Sum grouped by job

sum by (job) (rate(http_requests_total[5m]))

Average grouped by multiple labels

avg by (job, instance) (cpu_usage)

Count number of series

count(up == 1)

Topk and bottomk

topk(5, rate(http_requests_total[5m])) bottomk(3, node_memory_available_bytes)

Quantile across instances

quantile(0.95, http_request_duration_seconds)

Mathematical Operations

Arithmetic operations

(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100

Comparison operators

http_request_duration_seconds > 0.5

Logical operators

up == 1 and rate(http_requests_total[5m]) > 100

Vector matching

rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])

Advanced PromQL Patterns

Request success rate

sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Error rate percentage

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Latency percentiles (histogram)

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) )

Predict linear growth

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

Detect anomalies with standard deviation

abs(cpu_usage - avg_over_time(cpu_usage[1h]))

3 * stddev_over_time(cpu_usage[1h])

Calculate saturation (RED method)

sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance) / count(cpu_seconds_total{mode="idle"}) by (instance)

Burn rate for SLO

( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )

(14.4 * (1 - 0.999)) # For 99.9% SLO

Alerting with Prometheus and Alertmanager

Alert Rule Definitions

alerts/api_alerts.yml

groups:

  • name: api_alerts interval: 30s rules:

    High error rate alert

    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

      0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}" runbook_url: "https://runbooks.example.com/HighErrorRate"

    High latency alert

    • alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s on {{ $labels.service }}"

    Service down alert

    • alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"

    Disk space alert

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"

    Memory pressure alert

    • alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"

    CPU saturation alert

    • alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"

Alertmanager Configuration

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

Templates for notifications

templates:

  • '/etc/alertmanager/templates/*.tmpl'

Route tree for alert distribution

route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'

routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true

# Critical alerts also go to Slack
- match:
    severity: critical
  receiver: 'slack-critical'
  group_wait: 0s

# Warning alerts to Slack only
- match:
    severity: warning
  receiver: 'slack-warnings'

# Team-specific routing
- match:
    team: backend
  receiver: 'team-backend'

- match:
    team: frontend
  receiver: 'team-frontend'

# Database alerts to DBA team
- match_re:
    service: 'postgres|mysql|mongodb'
  receiver: 'team-dba'

Alert receivers/integrations

receivers:

  • name: 'team-default' slack_configs:

    • channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
  • name: 'pagerduty-critical' pagerduty_configs:

    • service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
  • name: 'slack-critical' slack_configs:

    • channel: '#incidents' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
  • name: 'slack-warnings' slack_configs:

    • channel: '#monitoring' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
  • name: 'team-backend' slack_configs:

  • name: 'team-frontend' slack_configs:

    • channel: '#team-frontend' send_resolved: true
  • name: 'team-dba' slack_configs:

    • channel: '#team-dba' send_resolved: true pagerduty_configs:
    • service_key: 'DBA_PAGERDUTY_KEY'

Inhibition rules (suppress alerts)

inhibit_rules:

Inhibit warnings if critical alert is firing

  • source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Don't alert on instance down if cluster is down

  • source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']

Multi-Window Multi-Burn-Rate Alerts for SLOs

SLO-based alerting using burn rate

groups:

  • name: slo_alerts interval: 30s rules:

    Fast burn (1h window, 5m burn)

    • alert: ErrorBudgetBurnFast expr: | ( sum(rate(http_requests_total{status="5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "Fast error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"

    Slow burn (6h window, 30m burn)

    • alert: ErrorBudgetBurnSlow expr: | ( sum(rate(http_requests_total{status="5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "Slow error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"

Grafana Dashboards and Visualization

Dashboard JSON Structure

{ "dashboard": { "title": "API Performance Dashboard", "tags": ["api", "performance", "production"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-6h", "to": "now" }, "timepicker": { "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"], "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"] }, "templating": { "list": [ { "name": "cluster", "type": "query", "datasource": "Prometheus", "query": "label_values(up, cluster)", "refresh": 1, "multi": false, "includeAll": false }, { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(up{cluster="$cluster"}, service)", "refresh": 1, "multi": true, "includeAll": true }, { "name": "interval", "type": "interval", "query": "1m,5m,10m,30m,1h", "auto": true, "auto_count": 30, "auto_min": "10s" } ] }, "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}, "targets": [ { "expr": "sum(rate(http_requests_total{service="$service"}[$interval])) by (service)", "legendFormat": "{{ service }}", "refId": "A" } ], "yaxes": [ {"format": "reqps", "label": "Requests/sec"}, {"format": "short"} ], "legend": { "show": true, "values": true, "current": true, "avg": true, "max": true } }, { "id": 2, "title": "Error Rate", "type": "graph", "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}, "targets": [ { "expr": "sum(rate(http_requests_total{service="$service",status="5.."}[$interval])) by (service) / sum(rate(http_requests_total{service="$service"}[$interval])) by (service) * 100", "legendFormat": "{{ service }} error %", "refId": "A" } ], "yaxes": [ {"format": "percent", "label": "Error Rate"}, {"format": "short"} ], "alert": { "conditions": [ { "evaluator": {"params": [5], "type": "gt"}, "operator": {"type": "and"}, "query": {"params": ["A", "5m", "now"]}, "reducer": {"params": [], "type": "avg"}, "type": "query" } ], "executionErrorState": "alerting", "frequency": "1m", "handler": 1, "name": "High Error Rate", "noDataState": "no_data", "notifications": [] } }, { "id": 3, "title": "Latency Percentiles", "type": "graph", "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}, "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[$interval])) by (service, le))", "legendFormat": "{{ service }} p99", "refId": "A" }, { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$service"}[$interval])) by (service, le))", "legendFormat": "{{ service }} p95", "refId": "B" }, { "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~"$service"}[$interval])) by (service, le))", "legendFormat": "{{ service }} p50", "refId": "C" } ], "yaxes": [ {"format": "s", "label": "Duration"}, {"format": "short"} ] } ] } }

RED Method Dashboard

The RED method focuses on Request rate, Error rate, and Duration:

{ "panels": [ { "title": "Request Rate (per service)", "targets": [ { "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)" } ] }, { "title": "Error Rate % (per service)", "targets": [ { "expr": "sum(rate(http_requests_total{status=~"5.."}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100" } ] }, { "title": "Duration p99 (per service)", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))" } ] } ] }

USE Method Dashboard

The USE method monitors Utilization, Saturation, and Errors:

{ "panels": [ { "title": "CPU Utilization %", "targets": [ { "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)" } ] }, { "title": "CPU Saturation (Load Average)", "targets": [ { "expr": "node_load1" } ] }, { "title": "Memory Utilization %", "targets": [ { "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100" } ] }, { "title": "Disk I/O Utilization %", "targets": [ { "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100" } ] }, { "title": "Network Errors", "targets": [ { "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])" } ] } ] }

Exporters and Metric Collection

Node Exporter for System Metrics

Install node_exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 ./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"

Key Metrics from Node Exporter:

  • node_cpu_seconds_total : CPU usage by mode

  • node_memory_MemTotal_bytes , node_memory_MemAvailable_bytes : Memory

  • node_disk_io_time_seconds_total : Disk I/O

  • node_network_receive_bytes_total , node_network_transmit_bytes_total : Network

  • node_filesystem_size_bytes , node_filesystem_avail_bytes : Disk space

Custom Application Exporter (Python)

app_exporter.py

from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary import time import random

Define metrics

REQUEST_COUNT = Counter( 'app_requests_total', 'Total app requests', ['method', 'endpoint', 'status'] )

REQUEST_DURATION = Histogram( 'app_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )

ACTIVE_USERS = Gauge( 'app_active_users', 'Number of active users' )

QUEUE_SIZE = Gauge( 'app_queue_size', 'Current queue size', ['queue_name'] )

DATABASE_CONNECTIONS = Gauge( 'app_database_connections', 'Number of database connections', ['pool', 'state'] )

CACHE_HITS = Counter( 'app_cache_hits_total', 'Total cache hits', ['cache_name'] )

CACHE_MISSES = Counter( 'app_cache_misses_total', 'Total cache misses', ['cache_name'] )

def simulate_metrics(): """Simulate application metrics""" while True: # Simulate requests method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])

    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

    # Simulate request duration
    duration = random.uniform(0.01, 2.0)
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

    # Update gauges
    ACTIVE_USERS.set(random.randint(100, 1000))
    QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
    QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

    # Database connection pool
    DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
    DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

    # Cache metrics
    if random.random() > 0.3:
        CACHE_HITS.labels(cache_name='redis').inc()
    else:
        CACHE_MISSES.labels(cache_name='redis').inc()

    time.sleep(1)

if name == 'main': # Start metrics server on port 8000 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()

Custom Exporter (Go)

package main

import ( "log" "math/rand" "net/http" "time"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

)

var ( requestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "app_requests_total", Help: "Total number of requests", }, []string{"method", "endpoint", "status"}, )

requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "app_request_duration_seconds",
        Help:    "Request duration in seconds",
        Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
    },
    []string{"method", "endpoint"},
)

activeUsers = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "app_active_users",
        Help: "Number of active users",
    },
)

databaseConnections = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "app_database_connections",
        Help: "Number of database connections",
    },
    []string{"pool", "state"},
)

)

func init() { prometheus.MustRegister(requestsTotal) prometheus.MustRegister(requestDuration) prometheus.MustRegister(activeUsers) prometheus.MustRegister(databaseConnections) }

func simulateMetrics() { ticker := time.NewTicker(1 * time.Second) defer ticker.Stop()

for range ticker.C {
    // Simulate requests
    methods := []string{"GET", "POST", "PUT", "DELETE"}
    endpoints := []string{"/api/users", "/api/products", "/api/orders"}
    statuses := []string{"200", "200", "200", "400", "500"}

    method := methods[rand.Intn(len(methods))]
    endpoint := endpoints[rand.Intn(len(endpoints))]
    status := statuses[rand.Intn(len(statuses))]

    requestsTotal.WithLabelValues(method, endpoint, status).Inc()
    requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

    // Update gauges
    activeUsers.Set(float64(rand.Intn(900) + 100))
    databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
    databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}

}

func main() { go simulateMetrics()

http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))

}

PostgreSQL Exporter

docker-compose.yml for postgres_exporter

version: '3.8' services: postgres-exporter: image: prometheuscommunity/postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - "9187:9187" command: - '--collector.stat_statements' - '--collector.stat_database' - '--collector.replication'

Key PostgreSQL Metrics:

  • pg_up : Database reachability

  • pg_stat_database_tup_returned : Rows read

  • pg_stat_database_tup_inserted : Rows inserted

  • pg_stat_database_deadlocks : Deadlock count

  • pg_stat_replication_lag : Replication lag in seconds

  • pg_locks_count : Active locks

Blackbox Exporter for Probing

blackbox.yml

modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"

http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]

tcp_connect: prober: tcp timeout: 5s

icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"

Prometheus config for blackbox exporter

scrape_configs:

  • job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs:
    • targets:
    • source_labels: [address] target_label: __param_target
    • source_labels: [__param_target] target_label: instance
    • target_label: address replacement: blackbox-exporter:9115

Best Practices

Metric Naming Conventions

Follow Prometheus naming best practices:

Format: <namespace><subsystem><metric>_<unit>

Good examples

http_requests_total # Counter http_request_duration_seconds # Histogram database_connections_active # Gauge cache_hits_total # Counter memory_usage_bytes # Gauge

Include unit suffixes

_seconds, _bytes, _total, _ratio, _percentage

Avoid

RequestCount # Use snake_case http_requests # Missing _total for counter request_time # Missing unit (should be _seconds)

Label Guidelines

Good: Low cardinality labels

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Bad: High cardinality labels (avoid)

http_requests_total{user_id="12345", session_id="abc-def-ghi"}

Good: Use bounded label values

http_requests_total{status_class="2xx"}

Bad: Unbounded label values

http_requests_total{response_size="1234567"}

Recording Rule Patterns

groups:

  • name: performance_rules interval: 30s rules:

    Pre-aggregate expensive queries

    • record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)

    Namespace aggregations

    • record: namespace:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (namespace)

    SLI calculations

    • record: job:http_requests_success:rate5m expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

    • record: job:http_requests_error_rate:ratio expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

Alert Design Principles

  • Alert on symptoms, not causes: Alert on user-facing issues

  • Make alerts actionable: Include runbook links

  • Use appropriate severity levels: Critical, warning, info

  • Set proper thresholds: Based on historical data

  • Include context in annotations: Help on-call engineers

  • Group related alerts: Reduce alert fatigue

  • Use inhibition rules: Suppress redundant alerts

  • Test alert rules: Verify they fire when expected

Dashboard Best Practices

  • One dashboard per audience: SRE, developers, business

  • Use consistent time ranges: Make comparisons easier

  • Include SLI/SLO metrics: Show business impact

  • Add annotations for deploys: Correlate changes with metrics

  • Use template variables: Make dashboards reusable

  • Show trends and aggregates: Not just raw metrics

  • Include links to runbooks: Enable quick response

  • Use appropriate visualizations: Graphs, gauges, tables

High Availability Setup

Prometheus HA with Thanos

Deploy multiple Prometheus instances with same config

Use Thanos to deduplicate and provide global view

prometheus-1.yml

global: external_labels: cluster: 'prod' replica: '1'

prometheus-2.yml

global: external_labels: cluster: 'prod' replica: '2'

Thanos sidecar configuration

Uploads blocks to object storage

Provides StoreAPI for querying

Capacity Planning Queries

Disk space exhaustion prediction

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

Memory growth trend

predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)

Request rate growth

predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)

Storage capacity planning

prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)

Advanced Patterns

Federation for Multi-Cluster Monitoring

Global Prometheus federating from cluster Prometheus instances

scrape_configs:

  • job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{name=~"job:.*"}' # All recording rules static_configs:
    • targets:
      • 'prometheus-us-west:9090'
      • 'prometheus-us-east:9090'
      • 'prometheus-eu-central:9090'

Cost Monitoring Pattern

Track cloud costs with custom metrics

groups:

  • name: cost_tracking rules:
    • record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour )

    • record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # Hours in average month

Custom SLO Implementation

SLO: 99.9% availability for API

groups:

  • name: api_slo interval: 30s rules:

    Success rate SLI

    • record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))

    Error budget remaining (30 days)

    • record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )

    Latency SLI (p99 < 500ms)

    • record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )

Examples Summary

This skill includes 20+ comprehensive examples covering:

  • Prometheus configuration (basic, Kubernetes SD, storage)

  • PromQL queries (instant vectors, range vectors, aggregations)

  • Mathematical operations and advanced patterns

  • Alert rule definitions (error rate, latency, resource usage)

  • Alertmanager configuration (routing, receivers, inhibition)

  • Multi-window multi-burn-rate SLO alerts

  • Grafana dashboard JSON (full dashboard, RED method, USE method)

  • Custom exporters (Python, Go)

  • Third-party exporters (PostgreSQL, Blackbox)

  • Recording rules for performance

  • Federation for multi-cluster monitoring

  • Cost monitoring and SLO implementation

  • High availability patterns

  • Capacity planning queries

Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

docker-compose-orchestration

No summary provided by upstream source.

Repository SourceNeeds Review
General

postgresql-database-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
General

jest-react-testing

No summary provided by upstream source.

Repository SourceNeeds Review
General

ui-design-patterns

No summary provided by upstream source.

Repository SourceNeeds Review