Logging & Observability Skill

Activate when working with logging systems, distributed tracing, debugging, monitoring, or any observability-related tasks across applications.

Logging Best Practices

Log Levels

Use appropriate log levels for different severity:

Level Severity When to Use

DEBUG Low Development only - detailed info, variable states, control flow. Use sparingly in production.

INFO Low Important application lifecycle events - startup, shutdown, config loaded, user actions, key state changes.

WARN Medium Recoverable issues - deprecated usage, resource constraints, unexpected but handled conditions. Investigate later.

ERROR High Unrecoverable problems - exceptions, failed operations, missing required data. Requires immediate attention.

FATAL Critical System-level failures - abort conditions, out of memory, unrecoverable state. System may crash.

General Principles

Actionable: Logs should help diagnose problems, not just record events
Contextual: Include enough context to understand what happened without code inspection
Consistent: Use same terminology across codebase for same events
Sparse: Don't log everything - unnecessary noise obscures real issues
Sampling: In high-volume scenarios, sample logs (10%, 1%, etc.) rather than logging everything
Structured: Always use structured format (JSON) for programmatic parsing

Structured Logging Format

Standard Fields

Every log entry should include:

{ "timestamp": "2025-11-17T10:30:45.123Z", "level": "ERROR", "message": "Failed to process user request", "service": "auth-service", "version": "1.2.3", "environment": "production", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7", "parent_span_id": "0af7651916cd43dd", "user_id": "user-12345", "request_id": "req-98765", "path": "/api/users/authenticate", "method": "POST", "status_code": 500, "error": { "type": "InvalidCredentialsError", "message": "Provided credentials do not match", "stack": "Error: InvalidCredentialsError...", "code": "AUTH_INVALID_CREDS" }, "context": { "ip_address": "192.168.1.100", "user_agent": "Mozilla/5.0...", "attempt_number": 3, "rate_limit_remaining": 2 }, "duration_ms": 245, "custom_field": "custom_value" }

Required vs Optional Fields

Always include:

timestamp
level
message
trace_id
service
environment

When applicable:

span_id / parent_span_id (distributed tracing)
user_id (any user action)
request_id (any request)
error (on ERROR/FATAL)
duration_ms (operations)
context (relevant metadata)

What to Log

Application Lifecycle

// Startup {"timestamp": "...", "level": "INFO", "message": "Service starting", "service": "auth-service", "version": "1.2.3"}

// Configuration loaded {"timestamp": "...", "level": "INFO", "message": "Configuration loaded", "config_source": "environment", "environment": "production"}

// Database connection established {"timestamp": "...", "level": "INFO", "message": "Database connected", "host": "db.internal", "pool_size": 20}

// Shutdown {"timestamp": "...", "level": "INFO", "message": "Service shutting down", "reason": "SIGTERM", "uptime_seconds": 3600}

User Actions

// Login attempt {"timestamp": "...", "level": "INFO", "message": "User login attempt", "user_id": "user-123", "method": "password"}

// Data modification {"timestamp": "...", "level": "INFO", "message": "User updated profile", "user_id": "user-123", "fields_changed": ["email", "name"]}

// Permission check {"timestamp": "...", "level": "INFO", "message": "Permission check", "user_id": "user-123", "resource": "report-456", "permission": "read", "granted": true}

External API Calls

// API call started {"timestamp": "...", "level": "DEBUG", "message": "External API call", "service": "my-service", "api": "stripe", "endpoint": "/charges", "method": "POST"}

// API response {"timestamp": "...", "level": "DEBUG", "message": "API response received", "api": "stripe", "endpoint": "/charges", "status_code": 200, "duration_ms": 145}

// API error {"timestamp": "...", "level": "WARN", "message": "External API error", "api": "stripe", "status_code": 429, "error": "rate_limit_exceeded", "retry_after_seconds": 60}

Errors and Exceptions

{ "timestamp": "...", "level": "ERROR", "message": "Payment processing failed", "service": "payment-service", "user_id": "user-456", "error": { "type": "PaymentGatewayError", "message": "Connection timeout", "code": "GATEWAY_TIMEOUT", "stack": "PaymentGatewayError: Connection timeout\n at processPayment (payment.ts:45)\n at ..." }, "context": { "amount": 9999, "currency": "USD", "gateway": "stripe" } }

Performance Metrics

// Slow operation {"timestamp": "...", "level": "WARN", "message": "Slow query detected", "duration_ms": 5234, "threshold_ms": 1000, "query": "SELECT * FROM orders WHERE..."}

// Resource usage {"timestamp": "...", "level": "INFO", "message": "Memory usage high", "memory_used_mb": 2048, "memory_limit_mb": 2560, "percentage": 80}

// Cache statistics {"timestamp": "...", "level": "DEBUG", "message": "Cache stats", "cache_hits": 4521, "cache_misses": 234, "hit_rate": 0.95}

What NOT to Log

NEVER log:

Passwords or authentication tokens
API keys or secrets
Private keys or certificates
Database credentials
OAuth tokens or refresh tokens
Credit card numbers
Social security numbers
Email addresses (without redaction in logs)
Personal identification numbers
Medical records
Raw HTTP request/response bodies (especially with auth headers)

Be careful with:

PII in general (name, phone, address) - redact or use anonymized IDs
Query parameters (may contain secrets)
Request/response headers (often contain authorization)
User input (may contain sensitive data)

Security rule: When in doubt, DON'T log it

BAD - logging credentials

logger.info(f"Login attempt for {username} with password {password}")

GOOD - logging action without sensitive data

logger.info("Login attempt", extra={"username": username, "method": "password"})

BAD - logging full request with auth header

logger.debug(f"Request: {request.headers}")

GOOD - logging request metadata

logger.debug("Incoming request", extra={ "method": request.method, "path": request.path, "user_agent": request.headers.get('user-agent') })

Distributed Tracing

Trace IDs and Span IDs

Trace ID: Unique identifier for entire request flow across services
Span ID: Unique identifier for single operation/service call
Parent Span ID: Span that initiated current span (for tracing parent-child relationships)

Generated once at entry point, propagated through all downstream calls:

Request → [Service A, Trace: abc123] ├─ [Span: span1] Database query ├─ [Span: span2] → Service B, parent: span2 └─ [Span: span3] Cache lookup └─ [Span: span4] External API call

Implementation

Python example with trace context

import uuid

class RequestContext: def init(self, trace_id=None, span_id=None, parent_span_id=None): self.trace_id = trace_id or str(uuid.uuid4()) self.span_id = span_id or str(uuid.uuid4()) self.parent_span_id = parent_span_id

Middleware/decorator

def trace_request(func): def wrapper(*args, **kwargs): ctx = RequestContext() return func(*args, context=ctx, **kwargs) return wrapper

Propagate to downstream services

def call_downstream_service(service_url, data, context): headers = { 'X-Trace-ID': context.trace_id, 'X-Span-ID': context.span_id, 'X-Parent-Span-ID': context.span_id # Current becomes parent } response = requests.post(service_url, json=data, headers=headers) return response

Sampling Strategies

No sampling: Log all traces (high volume services may be expensive)
Rate sampling: Log every Nth request (e.g., 1 in 100)
Adaptive sampling: Sample based on error rate, latency, or traffic volume
Tail sampling: Sample based on trace outcome (errors always sampled)

Adaptive sampling example

def should_sample(trace): # Always sample errors if trace.has_error: return True

# Sample slow requests (>1s)
if trace.duration_ms > 1000:
    return True

# Sample 1% of normal requests
return random.random() &#x3C; 0.01

6. Performance Logging

Execution Time

import time

def log_execution_time(func): def wrapper(*args, **kwargs): start = time.time() try: result = func(*args, **kwargs) duration_ms = (time.time() - start) * 1000 logger.info(f"{func.name} completed", extra={ "duration_ms": duration_ms, "status": "success" }) return result except Exception as e: duration_ms = (time.time() - start) * 1000 logger.error(f"{func.name} failed", extra={ "duration_ms": duration_ms, "error": str(e) }) raise return wrapper

Resource Usage

import psutil import os

def log_resource_usage(): process = psutil.Process(os.getpid()) memory = process.memory_info()

logger.info("Resource usage", extra={
    "memory_rss_mb": memory.rss / 1024 / 1024,
    "memory_vms_mb": memory.vms / 1024 / 1024,
    "cpu_percent": process.cpu_percent(interval=1),
    "num_threads": process.num_threads()
})

Slow Query Logs

Track database query performance

SLOW_QUERY_THRESHOLD_MS = 1000

def execute_query(query, params): start = time.time() cursor.execute(query, params) duration_ms = (time.time() - start) * 1000

if duration_ms > SLOW_QUERY_THRESHOLD_MS:
    logger.warn("Slow query detected", extra={
        "query": query,
        "params_count": len(params),
        "duration_ms": duration_ms,
        "threshold_ms": SLOW_QUERY_THRESHOLD_MS
    })

return cursor.fetchall()

7. Debugging Patterns

Debug Logging

Use DEBUG level for development/troubleshooting only:

logger.debug("Function entry", extra={ "function": "process_payment", "args": {"amount": 100, "currency": "USD"} })

logger.debug("Intermediate state", extra={ "processing_step": "validation", "validation_passed": True, "timestamp": time.time() })

logger.debug("Function exit", extra={ "function": "process_payment", "return_value": {"transaction_id": "txn-123", "status": "pending"} })

Conditional Breakpoints

In IDE debugger (VS Code, PyCharm, etc.):

Set breakpoint with condition

Debugger pauses only when condition is true

if user_id == "debug-user-123": # Breakpoint here with condition: amount > 1000 processor.process(order)

Remote Debugging

Python example:

Start remote debugger (debugpy)

import debugpy

debugpy.listen(("0.0.0.0", 5678)) print("Debugger attached, waiting for connection...") debugpy.wait_for_client()

Then connect from IDE on same port

Log Aggregation for Debugging

Retrieve logs for specific trace

def get_trace_logs(trace_id): query = f"SELECT * FROM logs WHERE trace_id = '{trace_id}' ORDER BY timestamp" # Execute against log storage (ELK, Loki, etc.) return results

Filter by user for debugging user issues

def get_user_logs(user_id, hours=1): query = f"SELECT * FROM logs WHERE user_id = '{user_id}' AND timestamp > now() - {hours}h" return results

Log Management

Log Rotation

Prevent unbounded disk usage:

Python logging with rotation

from logging.handlers import RotatingFileHandler

handler = RotatingFileHandler( filename='app.log', maxBytes=10485760, # 10MB backupCount=5 # Keep 5 rotated files )

Backup naming: app.log, app.log.1, app.log.2, etc.

Retention Policies

{ "retention_policy": { "DEBUG": "7 days", "INFO": "30 days", "WARN": "90 days", "ERROR": "1 year", "FATAL": "indefinite" } }

Log Aggregation Tools

Tool Best For Strengths

ELK Stack (Elasticsearch, Logstash, Kibana) On-premise, complex queries Powerful search, rich dashboards, customizable

Grafana Loki Simple log aggregation, cost-effective Low overhead, integrates with Prometheus

Datadog Cloud-first, all-in-one Agent-based, excellent integrations

Splunk Enterprise, security focus Powerful search, alerting, compliance reports

CloudWatch AWS native Seamless AWS integration, log groups

Stackdriver GCP native Google Cloud integration

CloudLogging Azure native Microsoft ecosystem

Metrics and Monitoring

Application Metrics

from prometheus_client import Counter, Histogram, Gauge

Counter: monotonically increasing

login_attempts = Counter('login_attempts_total', 'Total login attempts', ['status']) login_attempts.labels(status='success').inc()

Histogram: observe value distribution

request_duration = Histogram('request_duration_seconds', 'Request duration') request_duration.observe(0.5)

Gauge: can go up or down

active_connections = Gauge('active_connections', 'Current active connections') active_connections.set(42)

System Metrics

CPU, memory, disk usage

cpu_percent = psutil.cpu_percent(interval=1) memory = psutil.virtual_memory() disk = psutil.disk_usage('/')

logger.info("System metrics", extra={ "cpu_percent": cpu_percent, "memory_percent": memory.percent, "disk_percent": disk.percent })

Alerting Rules

Prometheus alert rules

alert: HighErrorRate expr: rate(requests_total{status="500"}[5m]) > 0.05 for: 5m annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

alert: SlowRequestLatency expr: histogram_quantile(0.95, request_duration_seconds) > 1 for: 10m annotations: summary: "Slow requests detected (p95 > 1s)"

Common Libraries by Language

Python

Standard library logging

import logging

Structured logging with structlog

import structlog

logger = structlog.get_logger() logger.info("user_created", user_id="u123", email_domain="example.com")

For advanced tracing

from opentelemetry import trace, logging from opentelemetry.exporter.jaeger.thrift import JaegerExporter

Libraries:

logging
Built-in, basic structured support
structlog
Structured logging, cleaner API
python-json-logger
JSON formatter for standard logging
OpenTelemetry
Distributed tracing standard
Jaeger
Distributed tracing backend

Node.js / TypeScript

// Winston const winston = require('winston');

const logger = winston.createLogger({ format: winston.format.json(), transports: [new winston.transports.Console()] });

logger.info('User logged in', { userId: 'u123' });

// Pino (lightweight) const pino = require('pino'); const logger = pino(); logger.info({ userId: 'u123' }, 'User logged in');

Libraries:

winston
Full-featured, very popular
pino
Lightweight, high performance
bunyan
JSON logging, stream-based
morgan
HTTP request logger for Express
OpenTelemetry
Distributed tracing
@opentelemetry/api
Standard tracing API

// Structured logging with zap import "go.uber.org/zap"

logger, _ := zap.NewProduction() defer logger.Sync()

logger.Info("user login", zap.String("user_id", "u123"), zap.Duration("duration", time.Second), )

// Or logrus (JSON support) import "github.com/sirupsen/logrus"

logger := logrus.New() logger.SetFormatter(&logrus.JSONFormatter{}) logger.WithFields(logrus.Fields{"user_id": "u123"}).Info("Login")

Libraries:

zap
High performance, structured
logrus
Popular, JSON output
slog
Standard library (Go 1.21+)
OpenTelemetry
Distributed tracing

Java / Kotlin

// Logback with SLF4J import org.slf4j.Logger; import org.slf4j.LoggerFactory; import net.logstash.logback.marker.Markers;

Logger logger = LoggerFactory.getLogger(MyClass.class);

// Structured with logback-json-encoder logger.info(Markers.append("user_id", "u123"), "User logged in");

// Spring Boot with logback (built-in) @RestController public class UserController { private static final Logger logger = LoggerFactory.getLogger(UserController.class); }

Libraries:

SLF4J

Logback

Standard combo
Log4j2
Enterprise feature-rich
Logstash Logback Encoder
Structured output
OpenTelemetry
Distributed tracing

C# / .NET

// Serilog (structured) using Serilog;

Log.Logger = new LoggerConfiguration() .WriteTo.Console() .CreateLogger();

Log.Information("User {UserId} logged in", "u123");

// Built-in ILogger with dependency injection public class UserService { private readonly ILogger<UserService> _logger;

public UserService(ILogger&#x3C;UserService> logger) {
    _logger = logger;
}

}

Libraries:

Serilog
Excellent structured support
NLog
Enterprise logging
log4net
Classic Apache Log4j port
Microsoft.Extensions.Logging
Built-in DI support
OpenTelemetry.Exporter.Console
Tracing

Example Patterns

Complete Request Logging Pipeline (Python)

from datetime import datetime from uuid import uuid4 import json import time import structlog

Configure structlog

structlog.configure( processors=[ structlog.stdlib.ProcessorFormatter.wrap_for_formatter, ], context_class=dict, logger_factory=structlog.PrintLoggerFactory(file=sys.stdout), )

class RequestLogger: def init(self): self.logger = structlog.get_logger()

def log_request_start(self, request):
    trace_id = request.headers.get('X-Trace-ID') or str(uuid4())
    span_id = str(uuid4())

    self.logger.info(
        "request_started",
        trace_id=trace_id,
        span_id=span_id,
        method=request.method,
        path=request.path,
        user_id=request.user_id,
    )

    return trace_id, span_id

def log_request_complete(self, trace_id, span_id, status, duration_ms):
    level = "info" if status &#x3C; 400 else "warn" if status &#x3C; 500 else "error"

    self.logger.log(
        level,
        "request_completed",
        trace_id=trace_id,
        span_id=span_id,
        status_code=status,
        duration_ms=duration_ms,
    )

def log_error(self, trace_id, span_id, error, context=None):
    self.logger.error(
        "request_error",
        trace_id=trace_id,
        span_id=span_id,
        error_type=type(error).__name__,
        error_message=str(error),
        error_context=context or {},
    )

Flask integration

app = Flask(name) req_logger = RequestLogger()

@app.before_request def before_request(): request.trace_id, request.span_id = req_logger.log_request_start(request) request.start_time = time.time()

@app.after_request def after_request(response): duration_ms = (time.time() - request.start_time) * 1000 req_logger.log_request_complete( request.trace_id, request.span_id, response.status_code, duration_ms ) return response

@app.errorhandler(Exception) def handle_error(error): req_logger.log_error( request.trace_id, request.span_id, error, context={"path": request.path} ) return {"error": "Internal server error"}, 500

Distributed Tracing Example (Node.js)

import { trace, context, SpanStatusCode } from '@opentelemetry/api'; import { NodeSDK } from '@opentelemetry/sdk-node'; import { JaegerExporter } from '@opentelemetry/exporter-jaeger-thrift';

const sdk = new NodeSDK({ traceExporter: new JaegerExporter({ host: process.env.JAEGER_HOST || 'localhost', port: parseInt(process.env.JAEGER_PORT || '6831'), }), });

sdk.start();

const tracer = trace.getTracer('my-service');

async function processPayment(userId: string, amount: number) { const span = tracer.startSpan('processPayment', { attributes: { 'user_id': userId, 'amount': amount, 'currency': 'USD', } });

return context.with(trace.setSpan(context.active(), span), async () => { try { // Nested span const validationSpan = tracer.startSpan('validatePayment'); try { await validatePayment(userId, amount); validationSpan.setStatus({ code: SpanStatusCode.OK }); } catch (error) { validationSpan.recordException(error); validationSpan.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { validationSpan.end(); }

  // Call external service with trace propagation
  const result = await callPaymentGateway(amount);

  span.setStatus({ code: SpanStatusCode.OK });
  return result;
} catch (error) {
  span.recordException(error);
  span.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
} finally {
  span.end();
}

}); }

Security-Conscious Logging (Go)

package main

import ( "go.uber.org/zap" "net/http" )

// RedactSensitive removes sensitive fields from log data func RedactSensitive(data map[string]interface{}) map[string]interface{} { sensitiveKeys := []string{"password", "api_key", "token", "credit_card", "ssn"}

for _, key := range sensitiveKeys { if _, exists := data[key]; exists { data[key] = "[REDACTED]" } } return data }

func LogRequest(logger *zap.Logger, r *http.Request) { // Extract safe headers only safeHeaders := map[string]string{ "user-agent": r.Header.Get("User-Agent"), "content-type": r.Header.Get("Content-Type"), }

logger.Info("incoming request", zap.String("method", r.Method), zap.String("path", r.URL.Path), zap.Any("headers", safeHeaders), zap.String("remote_addr", r.RemoteAddr), ) }

func LogError(logger *zap.Logger, err error, context map[string]interface{}) { logger.Error("operation failed", zap.Error(err), zap.Any("context", RedactSensitive(context)), ) }

Quick Reference Checklist

When implementing logging/observability:

Use structured JSON logging
Include trace_id and span_id in all logs
Set appropriate log levels (don't over-log)
Never log passwords, keys, tokens, PII
Add contextual fields (user_id, request_id, etc.)
Implement log rotation to prevent disk overflow
Include stack traces for errors
Log entry/exit for important functions
Track execution time for performance monitoring
Sample high-volume logs to prevent storage/bandwidth issues
Use existing libraries (structlog, pino, zap, etc.)
Set up log aggregation (ELK, Loki, Datadog, etc.)
Create alerting rules for critical errors
Document logging patterns in team guidelines
Review logs regularly to spot issues early

Activate this skill when: working with logging systems, distributed tracing, debugging, monitoring, performance analysis, or observability-related tasks.

Combine with: development-philosophy (fail-fast debugging), security-first-design (never log secrets), testing-workflow (use logs to verify behavior).

logging-observability

Safety Notice

Copy this and send it to your AI assistant to learn

BAD - logging credentials

GOOD - logging action without sensitive data

BAD - logging full request with auth header

GOOD - logging request metadata

Python example with trace context

Middleware/decorator

Propagate to downstream services

Adaptive sampling example

Track database query performance

Set breakpoint with condition

Debugger pauses only when condition is true

Start remote debugger (debugpy)

Then connect from IDE on same port

Retrieve logs for specific trace

Filter by user for debugging user issues

Python logging with rotation

Backup naming: app.log, app.log.1, app.log.2, etc.

Counter: monotonically increasing

Histogram: observe value distribution

Gauge: can go up or down

CPU, memory, disk usage

Prometheus alert rules

Standard library logging

Structured logging with structlog

For advanced tracing

Configure structlog

Flask integration

Source Transparency

Related Skills

frontend-design

javascript-testing-patterns

paul-graham-wisdom

senior-architect