monitoring-observability

OpenTelemetry, structured logging, distributed tracing, alerting, and dashboards

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-observability" with this command: npx skills add travisjneuman/.claude/travisjneuman-claude-monitoring-observability

Monitoring & Observability

Overview

This skill covers the three pillars of observability -- traces, metrics, and logs -- along with alerting, dashboards, and health check patterns. It focuses on OpenTelemetry as the vendor-neutral standard, structured logging for queryability, distributed tracing for microservice debugging, and SLO-based alerting to reduce noise.

Use this skill when instrumenting applications for production visibility, setting up monitoring infrastructure, debugging distributed systems, configuring alerts that matter, or building dashboards for operations and product teams.


Core Principles

  1. Instrument at boundaries - Trace every external call (HTTP, database, queue, cache). Internal function tracing adds noise; boundary tracing reveals system behavior.
  2. Structured over unstructured - Every log entry must be JSON with correlation IDs, service name, and context. Unstructured logs are unsearchable at scale.
  3. Alert on symptoms, not causes - Alert when users are affected (error rate, latency SLO breach), not when a specific server metric spikes. Symptom-based alerting reduces noise by 80%.
  4. Correlate across signals - A trace ID should connect logs, traces, and metrics for a single request. Without correlation, debugging distributed issues requires guesswork.
  5. Budget your error rate - Define SLOs (99.9% availability = 43 minutes/month downtime budget). Alert when the error budget burn rate is too fast, not on individual errors.

Key Patterns

Pattern 1: OpenTelemetry Instrumentation (Node.js)

When to use: Any production Node.js service that needs traces, metrics, and log correlation.

Implementation:

// tracing.ts - Must be imported BEFORE any other module
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT,
} from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? "my-service",
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? "0.0.0",
    [ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 30_000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable noisy fs instrumentation
      "@opentelemetry/instrumentation-fs": { enabled: false },
      // Configure HTTP to capture request/response headers
      "@opentelemetry/instrumentation-http": {
        requestHook: (span, request) => {
          span.setAttribute("http.request.header.x-request-id",
            request.headers?.["x-request-id"] ?? "unknown"
          );
        },
      },
    }),
  ],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});
// Custom span creation for business logic
import { trace, SpanStatusCode, context } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(orderId: string): Promise<Order> {
  return tracer.startActiveSpan("process_order", async (span) => {
    try {
      span.setAttribute("order.id", orderId);

      const order = await tracer.startActiveSpan("fetch_order", async (fetchSpan) => {
        const result = await db.orders.findUnique({ where: { id: orderId } });
        fetchSpan.setAttribute("order.total", result?.total ?? 0);
        fetchSpan.end();
        return result;
      });

      if (!order) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: "Order not found" });
        throw new OrderNotFoundError(orderId);
      }

      await tracer.startActiveSpan("charge_payment", async (paymentSpan) => {
        paymentSpan.setAttribute("payment.amount", order.total);
        await paymentService.charge(order);
        paymentSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Why: OpenTelemetry provides vendor-neutral instrumentation. Auto-instrumentation captures HTTP, database, and gRPC calls automatically. Custom spans add business context (order IDs, payment amounts) that make traces actionable for debugging.


Pattern 2: Structured Logging with Correlation

When to use: Every application that produces logs (which is every application).

Implementation:

import pino from "pino";
import { context, trace } from "@opentelemetry/api";

// Create logger with trace correlation
const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin() {
    // Automatically inject trace context into every log line
    const span = trace.getSpan(context.active());
    if (span) {
      const spanContext = span.spanContext();
      return {
        traceId: spanContext.traceId,
        spanId: spanContext.spanId,
        traceFlags: spanContext.traceFlags,
      };
    }
    return {};
  },
  // Redact sensitive fields
  redact: ["req.headers.authorization", "password", "token", "apiKey"],
});

// Usage - always log structured data, not string interpolation
logger.info({ orderId, userId, total: order.total }, "Order processed successfully");

// NOT this:
// logger.info(`Order ${orderId} processed for user ${userId} with total ${order.total}`);

// Error logging with context
logger.error(
  {
    err: error,
    orderId,
    operation: "payment_charge",
    paymentProvider: "stripe",
  },
  "Payment processing failed"
);

// Child loggers for request-scoped context
function createRequestLogger(req: Request) {
  return logger.child({
    requestId: req.headers.get("x-request-id") ?? crypto.randomUUID(),
    path: req.url,
    method: req.method,
    userAgent: req.headers.get("user-agent"),
  });
}

Why: Structured logs are queryable. You can filter by orderId, correlate with traces via traceId, and aggregate error rates by operation. String-interpolated logs require regex to extract fields, which breaks at scale.


Pattern 3: SLO-Based Alerting

When to use: Setting up production alerting that reduces noise and focuses on user impact.

Implementation:

# Prometheus alerting rules based on SLOs
groups:
  - name: slo-alerts
    rules:
      # Availability SLO: 99.9% success rate
      # Alert when burning through error budget too fast
      - alert: HighErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burn rate is 14.4x (will exhaust in 1 hour)"
          dashboard: "https://grafana.internal/d/slo-overview"

      # Latency SLO: 99% of requests under 500ms
      - alert: HighLatencyBudgetBurn
        expr: |
          (
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
            /
            sum(rate(http_request_duration_seconds_count[5m]))
          ) < 0.99
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latency SLO breach: >1% of requests exceeding 500ms"

      # Multi-window multi-burn-rate alert (Google SRE book pattern)
      - alert: SLOBreach_MultiWindow
        expr: |
          (
            error_ratio:rate1h > (14.4 * 0.001)
            and
            error_ratio:rate5m > (14.4 * 0.001)
          )
          or
          (
            error_ratio:rate6h > (6 * 0.001)
            and
            error_ratio:rate30m > (6 * 0.001)
          )
        labels:
          severity: critical
// Application-level SLO tracking with custom metrics
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("slo-metrics");

const requestCounter = meter.createCounter("http.requests.total", {
  description: "Total HTTP requests",
});

const requestDuration = meter.createHistogram("http.request.duration", {
  description: "HTTP request duration in seconds",
  unit: "s",
});

// Middleware to track SLO metrics
function sloMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = performance.now();

  res.on("finish", () => {
    const duration = (performance.now() - start) / 1000;
    const attributes = {
      method: req.method,
      route: req.route?.path ?? "unknown",
      status: String(res.statusCode),
      success: String(res.statusCode < 500),
    };

    requestCounter.add(1, attributes);
    requestDuration.record(duration, attributes);
  });

  next();
}

Why: Traditional threshold-based alerts (CPU > 80%, errors > 10) generate noise. SLO-based alerting asks "are users affected?" and "how fast are we burning our error budget?" This approach pages only when action is needed.


Pattern 4: Health Checks and Readiness Probes

When to use: Any service deployed to Kubernetes or behind a load balancer.

Implementation:

import { Router } from "express";

interface HealthCheckResult {
  status: "healthy" | "degraded" | "unhealthy";
  checks: Record<string, {
    status: "pass" | "fail" | "warn";
    latencyMs: number;
    message?: string;
  }>;
  uptime: number;
  version: string;
}

const healthRouter = Router();

// Liveness probe - is the process alive?
// Should NEVER check dependencies. Only checks if the process can respond.
healthRouter.get("/healthz", (req, res) => {
  res.status(200).json({ status: "alive" });
});

// Readiness probe - can this instance serve traffic?
// Checks critical dependencies.
healthRouter.get("/readyz", async (req, res) => {
  const checks: HealthCheckResult["checks"] = {};

  // Check database
  const dbStart = performance.now();
  try {
    await db.$queryRaw`SELECT 1`;
    checks.database = { status: "pass", latencyMs: performance.now() - dbStart };
  } catch (err) {
    checks.database = {
      status: "fail",
      latencyMs: performance.now() - dbStart,
      message: (err as Error).message,
    };
  }

  // Check Redis
  const redisStart = performance.now();
  try {
    await redis.ping();
    checks.redis = { status: "pass", latencyMs: performance.now() - redisStart };
  } catch (err) {
    checks.redis = {
      status: "fail",
      latencyMs: performance.now() - redisStart,
      message: (err as Error).message,
    };
  }

  const allPassing = Object.values(checks).every((c) => c.status === "pass");
  const anyFailing = Object.values(checks).some((c) => c.status === "fail");

  const result: HealthCheckResult = {
    status: anyFailing ? "unhealthy" : allPassing ? "healthy" : "degraded",
    checks,
    uptime: process.uptime(),
    version: process.env.SERVICE_VERSION ?? "unknown",
  };

  res.status(anyFailing ? 503 : 200).json(result);
});

Why: Kubernetes uses liveness probes to restart stuck processes and readiness probes to stop sending traffic to unready instances. Getting these wrong causes cascading failures: a liveness probe that checks the database will restart healthy pods during a database outage, making things worse.


Grafana Dashboard Quick Reference

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{ "expr": "sum(rate(http_requests_total[5m])) by (status)" }]
    },
    {
      "title": "Error Rate (%)",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
      }],
      "thresholds": [
        { "value": 0, "color": "green" },
        { "value": 0.1, "color": "yellow" },
        { "value": 1, "color": "red" }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [{
        "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))"
      }]
    }
  ]
}

Anti-Patterns

Anti-PatternWhy It's BadBetter Approach
Logging with console.log in productionNo structure, no correlation, no levelsUse pino or winston with JSON output
Alerting on every 5xx errorAlert fatigue, team ignores alertsAlert on SLO breach / error budget burn rate
Liveness probe checks databaseRestarts pods during DB outage (cascade)Liveness checks process only; readiness checks deps
No trace context propagationCannot follow requests across servicesUse W3C traceparent header, inject in all clients
Sampling 100% of tracesStorage costs explode at scaleHead-based sampling (10-20%) or tail-based for errors
Logging PII (emails, IPs)GDPR/privacy violationRedact sensitive fields in logger config
Dashboard with 50 panelsInformation overload, slow to loadFour golden signals: rate, errors, duration, saturation

Checklist

  • OpenTelemetry SDK initialized before all other imports
  • Auto-instrumentation enabled for HTTP, database, and queue clients
  • Custom spans added at business-logic boundaries with relevant attributes
  • Structured JSON logging with trace ID correlation
  • Sensitive fields redacted in logger configuration
  • Liveness probe: checks process only (no dependency checks)
  • Readiness probe: checks all critical dependencies
  • SLOs defined for availability and latency
  • Alerts based on error budget burn rate, not raw thresholds
  • Dashboard with four golden signals per service
  • Trace sampling strategy configured for production scale
  • Log aggregation pipeline shipping to centralized store

Related Resources

  • Skills: performance-engineering (latency optimization), application-security (security logging)
  • Rules: docs/reference/stacks/fullstack-nextjs-nestjs.md (NestJS instrumentation patterns)
  • Rules: docs/reference/tooling/troubleshooting.md (debugging with logs)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-skills

No summary provided by upstream source.

Repository SourceNeeds Review
General

brand-identity

No summary provided by upstream source.

Repository SourceNeeds Review
General

finance

No summary provided by upstream source.

Repository SourceNeeds Review
General

macos-native

No summary provided by upstream source.

Repository SourceNeeds Review