Observability & Monitoring

Structured logging, metrics, distributed tracing, and alerting strategies

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Observability & Monitoring" with this command: npx skills add ariegoldkin/ai-agent-hub/ariegoldkin-ai-agent-hub-observability-monitoring

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

When to Use

  • Setting up application monitoring
  • Implementing structured logging
  • Adding metrics and dashboards
  • Configuring distributed tracing
  • Creating alerting rules
  • Debugging production issues

Three Pillars of Observability

┌─────────────────┬─────────────────┬─────────────────┐
│     LOGS        │     METRICS     │     TRACES      │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened   │ How is system   │ How do requests │
│ at specific     │ performing      │ flow through    │
│ point in time   │ over time       │ services        │
└─────────────────┴─────────────────┴─────────────────┘

Structured Logging

Log Levels

LevelUse Case
ERRORUnhandled exceptions, failed operations
WARNDeprecated API, retry attempts
INFOBusiness events, successful operations
DEBUGDevelopment troubleshooting

Best Practice

// Good: Structured with context
logger.info('User action completed', {
  action: 'purchase',
  userId: user.id,
  orderId: order.id,
  duration_ms: 150
});

// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);

See templates/structured-logging.ts for Winston setup and request middleware

Metrics Collection

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency distribution

Prometheus Buckets

// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

See templates/prometheus-metrics.ts for full metrics configuration

Distributed Tracing

OpenTelemetry Setup

Auto-instrument common libraries:

  • Express/HTTP
  • PostgreSQL
  • Redis

Manual Spans

tracer.startActiveSpan('processOrder', async (span) => {
  span.setAttribute('order.id', orderId);
  // ... work
  span.end();
});

See templates/opentelemetry-tracing.ts for full setup

Alerting Strategy

Severity Levels

LevelResponse TimeExamples
Critical (P1)< 15 minService down, data loss
High (P2)< 1 hourMajor feature broken
Medium (P3)< 4 hoursIncreased error rate
Low (P4)Next dayWarnings

Key Alerts

AlertConditionSeverity
ServiceDownup == 0 for 1mCritical
HighErrorRate5xx > 5% for 5mCritical
HighLatencyp95 > 2s for 5mHigh
LowCacheHitRate< 70% for 10mMedium

See templates/alerting-rules.yml for Prometheus alerting rules

Health Checks

Kubernetes Probes

ProbePurposeEndpoint
LivenessIs app running?/health
ReadinessReady for traffic?/ready
StartupFinished starting?/startup

Readiness Response

{
  "status": "healthy|degraded|unhealthy",
  "checks": {
    "database": { "status": "pass", "latency_ms": 5 },
    "redis": { "status": "pass", "latency_ms": 2 }
  },
  "version": "1.0.0",
  "uptime": 3600
}

See templates/health-checks.ts for implementation

Observability Checklist

Implementation

  • JSON structured logging
  • Request correlation IDs
  • RED metrics (Rate, Errors, Duration)
  • Business metrics
  • Distributed tracing
  • Health check endpoints

Alerting

  • Service outage alerts
  • Error rate thresholds
  • Latency thresholds
  • Resource utilization alerts

Dashboards

  • Service overview
  • Error analysis
  • Performance metrics

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

  • Incident investigation - Correlating logs, metrics, traces
  • Alert tuning - Reducing noise, catching real issues
  • Architecture decisions - Choosing monitoring solutions
  • Performance debugging - Cross-service latency analysis

Templates Reference

TemplatePurpose
structured-logging.tsWinston logger with request middleware
prometheus-metrics.tsHTTP, DB, cache metrics with middleware
opentelemetry-tracing.tsDistributed tracing setup
alerting-rules.ymlPrometheus alerting rules
health-checks.tsLiveness, readiness, startup probes

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

testing-strategy-builder

No summary provided by upstream source.

Repository SourceNeeds Review
Security

security-checklist

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

type-safety-validation

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

devops & deployment

No summary provided by upstream source.

Repository SourceNeeds Review