Root Cause Analysis with Kopai
Guide for debugging production issues using telemetry data (traces, logs, metrics) via Kopai CLI.
Prerequisites
Ensure access to Kopai app backend. Make sure the services are set up to send their OpenTelemetry data to Kopai. See otel-instrumentation skill for setup.
RCA Workflow Summary
- Find error traces
- Get full trace context
- Correlate logs with trace
- Check related metrics
- Identify root cause
Rules
1. Workflow (CRITICAL)
workflow-find-errors- Find Error Tracesworkflow-get-context- Get Full Trace Contextworkflow-correlate-logs- Correlate Logs with Traceworkflow-check-metrics- Check Related Metricsworkflow-identify-cause- Identify Root Cause & Present Findings
2. Patterns (HIGH)
pattern-http-errors- HTTP Error Debuggingpattern-slow-requests- Slow Request Analysispattern-distributed- Distributed Failure Tracingpattern-log-driven- Log-Driven Investigation
Read rules/<rule-name>.md for details.
Tips
- Always use
--jsonfor programmatic analysis - Pipe to
jqfor filtering/aggregation - Start with errors, then trace backwards
- Check span Duration to find bottlenecks
- Correlate TraceId across traces, logs, metrics
- Use
--severity-min 17instead of--severity-text ERRORto catch all error-level logs regardless of text casing. Fall back to--body "error"for errors logged at INFO or with no severity.
References
- trace-filters - Trace search filter options
- log-filters - Log search filter options
- metric-filters - Metric search filter options