log-analysis

Log Analysis Methodology

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "log-analysis" with this command: npx skills add incidentfox/incidentfox/incidentfox-incidentfox-log-analysis

Log Analysis Methodology

Core Philosophy: Partition-First

NEVER start by reading raw log samples.

Logs can be overwhelming. The partition-first approach prevents:

  • Missing the forest for the trees

  • Wasting time on irrelevant data

  • Overwhelming context with noise

The 4-Step Process

Step 1: Get Statistics

Before ANY log search, understand the landscape:

CloudWatch Insights:

How many errors?

filter @message like /ERROR/ | stats count(*) as total

Error rate over time

filter @message like /ERROR/ | stats count(*) by bin(5m)

What types of errors?

filter @message like /ERROR/ | parse @message /(?<error_type>[\w.]+Exception)/ | stats count(*) by error_type | sort count desc

Datadog:

Error distribution by service

service:* status:error | stats count by service

Error types

service:myapp status:error | stats count by @error.kind

Questions to answer:

  • What's the total error volume?

  • Is it increasing, stable, or decreasing?

  • What are the unique error types?

  • Which services/hosts are affected?

Step 2: Identify Patterns

Look for correlations:

Temporal patterns:

  • Did errors start at a specific time?

  • Is there periodicity (every hour, every day)?

  • Correlation with deployments or traffic spikes?

Service patterns:

  • Is one service the source?

  • Is the error propagating across services?

Error patterns:

  • What's the most frequent error?

  • Are errors clustered or distributed?

Step 3: Sample Strategically

Only NOW read actual log samples:

Sample from anomalies:

  • Get logs from the peak error time

  • Get logs from normal time for comparison

Sample by error type:

  • Get examples of each distinct error type

  • Limit to 5-10 per type

Sample around events:

  • Logs before/after a deployment

  • Logs around a specific incident timestamp

Step 4: Correlate with Events

Connect logs to system changes:

Use git_log to find recent deployments

git_log --since="2 hours ago"

Use get_deployment_history for K8s

get_deployment_history deployment=api-server

Compare log patterns before/after changes

Platform-Specific Tips

CloudWatch Insights

Best practices:

Always include time filter

filter @timestamp > ago(1h)

Use parse for structured extraction

parse @message /status=(?<status>\d+)/

Aggregate before displaying

stats count(*) by status | sort count desc | limit 10

Common queries:

Latency distribution

filter @type = "REPORT" | stats avg(@duration) as avg, pct(@duration, 95) as p95, pct(@duration, 99) as p99

Error messages with context

filter @message like /ERROR/ | fields @timestamp, @message | sort @timestamp desc | limit 20

Datadog Logs

Query syntax:

Filter by service and status

service:api-gateway status:error

Field queries

@http.status_code:>=500

Wildcard

@error.message:timeout

Time comparison

service:api (now-1h TO now) vs (now-25h TO now-24h)

Kubernetes Logs

Use get_pod_logs wisely:

  • Always specify tail_lines (default: 100)

  • Filter to specific containers in multi-container pods

  • Use get_pod_events first for crashes/restarts

Anti-Patterns to Avoid

  • Dumping all logs - Never request unbounded log queries

  • Starting with samples - Always get statistics first

  • Ignoring time windows - Narrow to incident window

  • Missing correlation - Always connect to deployments/changes

  • Single-service focus - Check upstream/downstream services

Investigation Template

Log Analysis Report

Statistics

  • Time window: [start] to [end]
  • Total log volume: X events
  • Error count: Y events (Z%)
  • Error rate trend: [increasing/stable/decreasing]

Top Error Types

  1. [ErrorType1]: N occurrences - [description]
  2. [ErrorType2]: M occurrences - [description]

Temporal Pattern

  • Errors started at: [timestamp]
  • Correlation: [deployment X / traffic spike / external event]

Sample Errors

[Quote 2-3 representative error messages]

Root Cause Hypothesis

[Based on patterns, what's the likely cause?]

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

log-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

log-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

log-analysis

No summary provided by upstream source.

Repository SourceNeeds Review