Observability Patterns
Patterns for implementing comprehensive observability including logs, metrics, traces, and their correlation.
When to Use This Skill
-
Designing observability strategy
-
Implementing the three pillars
-
Correlating signals across systems
-
Choosing observability tools
-
Building monitoring dashboards
What is Observability?
Observability = Ability to understand internal state from external outputs
Not just monitoring (known-unknowns) But understanding (unknown-unknowns)
Traditional monitoring: "Is CPU > 80%?" Observability: "Why are users experiencing latency?"
The Three Pillars
Overview
┌─────────────────────────────────────────────────────────┐ │ OBSERVABILITY │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ LOGS │ │ METRICS │ │ TRACES │ │ │ │ │ │ │ │ │ │ │ │ Events │ │ Counters │ │ Requests │ │ │ │ Details │ │ Gauges │ │ Spans │ │ │ │ Context │ │ Trends │ │ Flow │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ └───────────────┼───────────────┘ │ │ │ │ │ ┌────────┴────────┐ │ │ │ CORRELATION │ │ │ │ (trace_id) │ │ │ └─────────────────┘ │ └─────────────────────────────────────────────────────────┘
Each pillar answers different questions:
- Logs: What happened? (events)
- Metrics: How much/many? (aggregates)
- Traces: Where? (request flow)
Logs
Purpose: Discrete events with context
Structure: { "timestamp": "2024-01-15T10:30:00.123Z", "level": "ERROR", "service": "order-service", "message": "Payment failed", "trace_id": "abc123", "span_id": "def456", "user_id": "12345", "order_id": "ORD-789", "error": { "code": "CARD_DECLINED", "message": "Insufficient funds" } }
Best for:
- Debugging specific issues
- Audit trails
- Error details
- Business events
Challenges:
- High volume → storage costs
- Unstructured → hard to query
- No aggregation → not for trends
Metrics
Purpose: Numeric measurements over time
Types: ┌─────────────────────────────────────────────────────────┐ │ Counter: Cumulative, only increases │ │ - http_requests_total │ │ - errors_total │ │ - bytes_transferred │ ├─────────────────────────────────────────────────────────┤ │ Gauge: Point-in-time value, can go up/down │ │ - current_connections │ │ - queue_depth │ │ - temperature │ ├─────────────────────────────────────────────────────────┤ │ Histogram: Distribution of values │ │ - request_duration_seconds │ │ - response_size_bytes │ │ Provides: count, sum, buckets │ ├─────────────────────────────────────────────────────────┤ │ Summary: Similar to histogram, calculates quantiles │ │ - request_latency_seconds (p50, p90, p99) │ └─────────────────────────────────────────────────────────┘
Best for:
- Trends and patterns
- Alerting on thresholds
- Dashboards
- Capacity planning
Challenges:
- No event details
- Cardinality limits
- Not request-level
Traces
Purpose: Request flow across services
Structure: Trace (end-to-end request) ├── Span (API Gateway) - 200ms │ ├── Span (Auth) - 20ms │ └── Span (OrderService) - 150ms │ ├── Span (Database) - 50ms │ └── Span (PaymentService) - 80ms │ └── Span (External API) - 60ms
Best for:
- Understanding request flow
- Finding bottlenecks
- Debugging distributed issues
- Service dependencies
Challenges:
- Storage intensive
- Requires sampling
- Complex to implement
Signal Correlation
Why Correlate?
Without correlation:
- Metrics: "Error rate is high"
- Logs: "Error logs from somewhere"
- Traces: "Some traces show errors" → Hard to connect the dots
With correlation:
- Metrics: "Error rate spike at 10:30" └── Click to see: Exemplar trace └── Click to see: Related logs → Full picture in seconds
Correlation Methods
-
Trace ID injection: All signals include trace_id
Log: {"trace_id": "abc123", "message": "..."} Metric: http_requests{trace_id="abc123"} Trace: TraceID = abc123
-
Exemplars: Metrics point to sample traces
request_latency = 2.5s └── exemplar: trace_id=abc123 → "Show me a slow request"
-
Time correlation: Align signals by timestamp
Metric spike at 10:30 → Query logs around 10:30 → Query traces around 10:30
Unified Query Example
Investigation flow:
-
Dashboard shows latency spike http_request_duration_p99 = 3s
-
Click on spike → exemplar trace trace_id: abc123
-
View trace → slow database span db.query: SELECT * FROM orders... (2.5s)
-
Query logs with trace_id {"trace_id":"abc123","query":"SELECT...","rows":50000}
-
Root cause identified Missing index causing full table scan
OpenTelemetry Unified Approach
OpenTelemetry provides unified API for all signals:
Application Code │ ▼ ┌─────────────────────────────────────────────────────┐ │ OpenTelemetry SDK │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Tracer │ │ Meter │ │ Logger │ │ │ │Provider │ │Provider │ │Provider │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ └────────────┼────────────┘ │ │ │ │ │ ┌───────┴───────┐ │ │ │ Exporters │ │ │ └───────────────┘ │ └─────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Tempo │ │Prometheus│ │ Loki │ │(Traces) │ │(Metrics) │ │ (Logs) │ └─────────┘ └─────────┘ └─────────┘
Logging Patterns
Structured Logging
Unstructured (bad): "User 12345 failed to login: invalid password"
Structured (good): { "event": "login_failed", "user_id": "12345", "reason": "invalid_password", "timestamp": "2024-01-15T10:30:00Z", "trace_id": "abc123" }
Benefits:
- Queryable: user_id:12345 AND event:login_failed
- Parseable: Automated analysis
- Correlatable: trace_id links to traces
Log Levels
| Level | When to use |
|---|---|
| TRACE | Very detailed, development only |
| DEBUG | Development, verbose |
| INFO | Normal operations, audit events |
| WARN | Degraded, recoverable issues |
| ERROR | Failures requiring attention |
| FATAL | Application cannot continue |
Production typically: INFO and above Debug mode: DEBUG and above
Log Aggregation Architecture
┌─────────────────────────────────────────────────────────┐ │ Application Pods │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ App │ │ App │ │ App │ → stdout/stderr │ │ └──────┘ └──────┘ └──────┘ │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Log Collector (Fluentd/Vector/Fluent Bit) │ │ - Parse logs │ │ - Add metadata (pod, namespace, etc.) │ │ - Transform/filter │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Storage (Elasticsearch/Loki/CloudWatch) │ │ - Index for search │ │ - Retention policies │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Query Interface (Kibana/Grafana) │ │ - Search and filter │ │ - Dashboards │ └─────────────────────────────────────────────────────────┘
Metrics Patterns
Naming Conventions
Format: [namespace][subsystem][name]_[unit]
Examples: http_requests_total http_request_duration_seconds http_response_size_bytes process_cpu_seconds_total db_connections_current
Guidelines:
- Use snake_case
- Include unit suffix (_seconds, _bytes, _total)
- Use base units (seconds not milliseconds)
- Be consistent across services
Labels/Dimensions
Metrics with labels:
http_requests_total{ method="GET", path="/api/users", status="200" }
Cardinality warning: http_requests_total{user_id="..."} // BAD: High cardinality
Keep labels low cardinality:
- status: ~5 values (200, 4xx, 5xx...)
- method: ~10 values
- service: ~100 values
- user_id: millions → TOO MANY
RED Method
For request-based services:
R - Rate: Requests per second http_requests_total
E - Errors: Failed requests per second http_requests_total{status=~"5.."}
D - Duration: Latency distribution http_request_duration_seconds
USE Method
For resources (CPU, memory, disk):
U - Utilization: % of resource used cpu_usage_percent
S - Saturation: Queued work thread_pool_queued_tasks
E - Errors: Error count disk_errors_total
Dashboards and Alerts
Dashboard Design
Dashboard hierarchy:
-
Overview (executive level)
- Key SLOs
- Error rates
- Traffic trends
-
Service dashboards
- RED metrics
- Dependencies
- Resource usage
-
Debug dashboards
- Detailed metrics
- Component breakdown
- Query performance
Alert Design
Good alerts:
- Actionable: Someone can do something
- Meaningful: Reflects user impact
- Urgent: Needs attention now
Bad alerts:
- CPU > 80% (maybe fine)
- Disk > 90% (too late?)
- Any single error (noise)
Better approach: SLO-based alerting
- "Error budget burning too fast"
- Directly tied to user impact
Tool Selection
Open Source Stack
Metrics: Prometheus + Grafana Logs: Loki + Grafana Traces: Jaeger/Tempo + Grafana
Alternative: Metrics: VictoriaMetrics + Grafana Logs: Elasticsearch + Kibana Traces: Zipkin
Cloud Native
AWS:
- CloudWatch (metrics, logs)
- X-Ray (traces)
GCP:
- Cloud Monitoring (metrics)
- Cloud Logging (logs)
- Cloud Trace (traces)
Azure:
- Azure Monitor (metrics, logs)
- Application Insights (traces)
Commercial Platforms
Full stack:
- Datadog
- New Relic
- Dynatrace
- Splunk
Benefits: Unified, managed, features Costs: Price, vendor lock-in
Best Practices
-
Structured logging from day one Don't retrofit later
-
Consistent trace context Propagate trace_id everywhere
-
Metric cardinality awareness Monitor and limit label values
-
Correlation by default trace_id in logs, exemplars in metrics
-
Alert on symptoms, not causes "Users affected" not "CPU high"
-
Regular observability review Are we seeing what we need?
Related Skills
-
distributed-tracing
-
Deep dive on traces
-
slo-sli-error-budget
-
SLO-based observability
-
incident-response
-
Using observability in incidents