error-monitoring-agent
Catch errors before users report them. Group similar issues, alert on spikes, and auto-resolve known problems — all with zero configuration.
What It Does
- Real-time detection — Monitor logs, APIs, workers for errors
- Smart grouping — Merge similar stack traces, reduce noise 90%
- Rate alerts — Alert when error rate spikes or new error types appear
- Root cause — Correlate errors with deploys, config changes
- Auto-resolve — Apply known fixes automatically (restart, retry, rollback)
Quick Start
# 1. Start monitoring
node monitor.js watch --source logs,api
# 2. Check current errors
node monitor.js status
# 3. Set up alert
node monitor.js alert --rule "error_rate > 10/min" --channel slack
# 4. View top errors
node monitor.js aggregate --top 10
Common Use Cases
🚨 Alert on Error Spikes
# Alert when error rate exceeds threshold
node monitor.js alert --rule "error_rate > 10/min" --channel slack
# Alert on new error types
node monitor.js alert --rule "new_error_type" --channel pagerduty
# Alert on spike vs baseline
node monitor.js alert --rule "error_spike > 3x_baseline" --channel email
🔍 Investigate Incident
# Find all errors in time window
node monitor.js aggregate --time-window 1h --top 20
# Analyze specific error
node monitor.js analyze --error-id err_abc123 --depth 5
# Correlate with recent changes
node monitor.js analyze --correlate deploy-log,config-change
🤖 Auto-Resolve Known Issues
# Enable auto-resolution
node monitor.js auto-resolve --strategy restart,retry,rollback
# Apply approved fixes only
node monitor.js auto-resolve --known-fixes db --apply-approved
📊 Track Error Budget
# Check error rate vs SLO
node monitor.js budget --slo 99.9% --window 30d
# View error budget remaining
node monitor.js budget --remaining
All Commands
| Command | Purpose |
|---|---|
watch --source <src> | Start monitoring |
status | Current error summary |
aggregate --top <n> | Group similar errors |
alert --rule <rule> | Create alert rule |
analyze --error-id <id> | Root cause analysis |
auto-resolve --strategy <s> | Enable auto-fix |
budget --slo <target> | Check error budget |
Configuration
{
"monitoring": {
"sources": ["application", "infrastructure", "api"],
"sampling": 1.0,
"retention": "30d",
"alertRules": [
{ "condition": "error_rate > 10/min", "action": "page-oncall" },
{ "condition": "new_error_type", "action": "notify-channel" }
],
"autoResolve": {
"enabled": true,
"approvedStrategies": ["restart-service", "retry-request"]
}
}
}