Log Analysis Methodology

Core Philosophy: Partition-First

NEVER start by reading raw log samples.

Logs can be overwhelming. The partition-first approach prevents:

Missing the forest for the trees
Wasting time on irrelevant data
Overwhelming context with noise

The 4-Step Process

Step 1: Get Statistics

Before ANY log search, understand the landscape:

CloudWatch Insights:

How many errors?

filter @message like /ERROR/ | stats count(*) as total

Error rate over time

filter @message like /ERROR/ | stats count(*) by bin(5m)

What types of errors?

filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc

Datadog:

Error distribution by service

service:* status:error | stats count by service

Error types

service:myapp status:error | stats count by @error.kind

Questions to answer:

What's the total error volume?
Is it increasing, stable, or decreasing?
What are the unique error types?
Which services/hosts are affected?

Step 2: Identify Patterns

Look for correlations:

Temporal patterns:

Did errors start at a specific time?
Is there periodicity (every hour, every day)?
Correlation with deployments or traffic spikes?

Service patterns:

Is one service the source?
Is the error propagating across services?

Error patterns:

What's the most frequent error?
Are errors clustered or distributed?

Step 3: Sample Strategically

Only NOW read actual log samples:

Sample from anomalies:

Get logs from the peak error time
Get logs from normal time for comparison

Sample by error type:

Get examples of each distinct error type
Limit to 5-10 per type

Sample around events:

Logs before/after a deployment
Logs around a specific incident timestamp

Step 4: Correlate with Events

Connect logs to system changes:

Use git_log to find recent deployments

git_log --since="2 hours ago"

Use get_deployment_history for K8s

get_deployment_history deployment=api-server

Compare log patterns before/after changes

Platform-Specific Tips

CloudWatch Insights

Best practices:

Always include time filter

filter @timestamp > ago(1h)

Use parse for structured extraction

parse @message /status=(?<status>\d+)/

Aggregate before displaying

stats count(*) by status | sort count desc | limit 10

Common queries:

Latency distribution

filter @type = "REPORT" | stats avg(@duration) as avg, pct(@duration, 95) as p95, pct(@duration, 99) as p99

Error messages with context

filter @message like /ERROR/ | fields @timestamp, @message | sort @timestamp desc | limit 20

Datadog Logs

Query syntax:

Filter by service and status

service:api-gateway status:error

Field queries

@http.status_code:>=500

Wildcard

@error.message:timeout

Time comparison

service:api (now-1h TO now) vs (now-25h TO now-24h)

Kubernetes Logs

Use get_pod_logs wisely:

Always specify tail_lines (default: 100)
Filter to specific containers in multi-container pods
Use get_pod_events first for crashes/restarts

Anti-Patterns to Avoid

Dumping all logs - Never request unbounded log queries
Starting with samples - Always get statistics first
Ignoring time windows - Narrow to incident window
Missing correlation - Always connect to deployments/changes
Single-service focus - Check upstream/downstream services

Investigation Template

Log Analysis Report

Statistics

Time window: [start] to [end]
Total log volume: X events
Error count: Y events (Z%)
Error rate trend: [increasing/stable/decreasing]

Top Error Types

[ErrorType1]: N occurrences - [description]
[ErrorType2]: M occurrences - [description]

Temporal Pattern

Errors started at: [timestamp]
Correlation: [deployment X / traffic spike / external event]

Sample Errors

[Quote 2-3 representative error messages]

Root Cause Hypothesis

[Based on patterns, what's the likely cause?]

log-analysis

Safety Notice

Copy this and send it to your AI assistant to learn

How many errors?

Error rate over time

What types of errors?

Error distribution by service

Error types

Use git_log to find recent deployments

Use get_deployment_history for K8s

Compare log patterns before/after changes

Always include time filter

Use parse for structured extraction

Aggregate before displaying

Latency distribution

Error messages with context

Filter by service and status

Field queries

Wildcard

Time comparison

Log Analysis Report

Statistics

Top Error Types

Temporal Pattern

Sample Errors

Root Cause Hypothesis

Source Transparency

Related Skills

log-analysis

log-analysis

log-analysis