logging-best-practices

Use before implementing logs in a medium to large scale production system.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "logging-best-practices" with this command: npx skills add nbbaier/agent-skills/nbbaier-agent-skills-logging-best-practices

This skill is adpated from "Logging sucks. And here's how to make it better. by Boris Tane.

When helping with logging, observability, or debugging strategies, follow these principles:

Core Philosophy

  • Logs are optimized for querying, not writing — always design with debugging in mind
  • Context is everything — a log without correlation IDs is useless in distributed systems
  • Logs are for humans during incidents, not just for compliance or "just in case"
  • If you can't filter and search your logs effectively, they provide zero value

Structured Logging Requirements

  • Always use key-value pairs (JSON) instead of string interpolation
  • Bad: "Payment failed for user 123"
  • Good: {"event": "payment_failed", "user_id": "123", "reason": "insufficient_funds", "amount": 99.99}
  • Structured logs are machine-parseable, enabling aggregation, alerting, and dashboards

Required Fields for Every Log Event

  • timestamp — ISO 8601 with timezone (e.g., 2025-01-24T20:00:00Z)
  • level — debug, info, warn, error (be consistent, don't invent new levels)
  • event — machine-readable event name, snake_case (e.g., user_login_success)
  • request_id or trace_id — for correlating logs across a single request
  • service — which service/application emitted this log
  • environment — prod, staging, dev

Examples of High-Cardinality Fields (always include when available):

  • user_id, org_id, account_id — who is affected
  • request_id, trace_id, span_id — for distributed tracing
  • order_id, transaction_id, job_id — domain-specific identifiers

These fields are what make logs actually queryable during incidents. Without them, you're grepping through millions of lines blindly.

Look for opportunities for high-cardinality fields that can help you identify the root cause of an issue quickly.

Context Propagation

  • Pass trace/request IDs through all service boundaries (HTTP headers, message queues, etc.)
  • Downstream services must inherit correlation IDs from upstream
  • Use middleware or interceptors to automatically inject context into every log
  • For async jobs, store and restore the original request context

Log Levels — Use Them Correctly:

  • debug — Verbose details for local development, usually disabled in production
  • info — Normal operations worth recording (user actions, job completions, deploys)
  • warn — Something unexpected happened but the system handled it (retries, fallbacks)
  • error — Something failed and likely needs human attention (exceptions, failed requests)

Don't log errors for expected conditions (e.g., user enters wrong password)

What to Log:

  • Request entry and exit points (with duration)
  • State transitions (order created → paid → shipped)
  • External service calls (with latency and response codes)
  • Authentication and authorization events
  • Background job starts, completions, and failures
  • Retry attempts and circuit breaker state changes

What NOT to Log:

  • Sensitive data (passwords, tokens, PII, credit card numbers)
  • Logs inside tight loops (will generate millions of useless entries)
  • Success cases that provide no debugging value
  • Redundant information already captured by infrastructure (load balancer logs, etc.)

Naming Conventions:

  • Be consistent across all services — agree on field names as a team
  • Use snake_case for field names: user_id, not userId or user-id
  • Use past-tense verbs for events: payment_completed, not complete_payment
  • Prefix events by domain when helpful: auth.login_failed, billing.invoice_created

Performance Considerations:

  • Use sampling for high-volume debug logs in production
  • Avoid logging inside hot paths unless absolutely necessary
  • Buffer and batch log writes to reduce I/O overhead
  • Consider log levels that can be changed at runtime without redeploying

During Incidents:

  • Your logs should answer: Who was affected? What failed? When? Why?
  • If you can't answer these within 5 minutes of querying, your logging strategy needs work
  • Post-incident: add the logs you wished you had

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

logging-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

ideation

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

react-effect-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

implementation-guide

No summary provided by upstream source.

Repository SourceNeeds Review