Resources
scripts/ validate-error-recovery.sh references/ common-errors.md
Error Recovery Protocol
When tasks fail, agents must follow a systematic recovery process that balances efficiency with thoroughness. This skill defines how to categorize errors, leverage institutional memory, apply multi-source recovery strategies, and know when to escalate.
Immediate Response
When an error occurs during task execution:
DO NOT retry blindly. Read the full error message, stack trace, and any diagnostic output.
Categorize the error into one of six types:
-
TOOL_FAILURE -- Precision tool or MCP tool returned an error
-
BUILD_ERROR -- Build/compile command failed (npm, tsc, vite, etc.)
-
TEST_FAILURE -- Test suite failed (vitest, jest, etc.)
-
TYPE_ERROR -- TypeScript type checking failed
-
RUNTIME_ERROR -- Code crashed during execution
-
EXTERNAL_ERROR -- Third-party service or API failure
Check .goodvibes/memory/failures.json for matching keywords using precision_read:
-
Search for keywords from the error message
-
If a matching failure is found, apply the documented resolution
-
If the resolution doesn't work, note it and proceed to recovery phases
-
If not found, proceed directly to recovery phases
Error Categories: Detailed Guidance
TOOL_FAILURE
Common Causes:
-
Wrong file path (absolute vs relative, typo, file doesn't exist)
-
Bad syntax in tool parameters (malformed JSON, incorrect regex)
-
Sandbox blocking external paths
-
Missing required parameters
-
Tool used incorrectly (wrong extract mode, wrong output format)
Recovery Steps:
-
Re-read the tool's schema and parameter descriptions
-
Check if the file/path exists using precision_glob
-
Verify sandbox settings with precision_config get (check sandbox.enabled )
-
For precision tools, check if you're using the right extract/output mode
-
Try the operation with minimal parameters first, then add complexity
-
If precision tool fails repeatedly, check if it's a user error vs actual tool bug:
-
User error: wrong params, bad path, misunderstood tool behavior
-
Tool bug: correct params but tool crashes or returns wrong result
-
For user errors, fix and retry with precision tools
-
For tool bugs, use native tool fallback ONLY for that specific operation, then return to precision tools
Common Patterns from Memory:
-
Format/mode mismatch: MCP schema sends output.format , handlers read output.mode . Always check both with ?? .
-
Ripgrep glob failures: Patterns like src/**/*.ts fail silently with ripgrep. Use regex to detect literal prefixes and force fast-glob.
-
Path normalization: Ripgrep returns absolute OR relative paths. Always use path.isAbsolute() before path.resolve() .
-
Sandbox opt-in: Only enable when sandbox === true || sandbox === 'true' . Never use === false checks.
BUILD_ERROR
Common Causes:
-
Missing dependencies (not installed, wrong version)
-
Type errors not caught during editing
-
Import errors (wrong path, circular dependency)
-
Configuration errors (tsconfig, vite.config, etc.)
-
Environment variable missing
Recovery Steps:
-
Read the full build output (use precision_exec with expect.exit_code to capture stderr)
-
Identify the first error in the chain (subsequent errors are often cascading)
-
For missing deps: Check package.json, run npm install
-
For type errors: See TYPE_ERROR section below
-
For import errors: Verify file exists, check import path, look for circular deps
-
For config errors: Compare with working examples in the codebase
-
Search first-party docs for the framework/build tool being used
TEST_FAILURE
Common Causes:
-
Wrong test assertions (expected value changed)
-
Missing mocks or stubs
-
Changed API contract (function signature, return type)
-
Test environment not set up correctly
-
Async timing issues
Recovery Steps:
-
Read the test failure output (assertion error, expected vs actual)
-
Identify which test file and specific test case failed
-
Read the test file to understand what it's testing
-
Read the implementation to see if it matches the test expectations
-
Determine if the test is wrong or the implementation is wrong:
-
If implementation changed intentionally -> update the test
-
If implementation broke -> fix the implementation
-
For async issues, check for missing await , improper use of done() , or race conditions
-
For environment issues, check test setup files (vitest.config, jest.setup.js)
TYPE_ERROR
Common Causes:
-
Unsafe member access (obj.prop where obj could be null/undefined)
-
Type mismatch in assignments (string assigned to number )
-
Wrong function call signature (missing params, wrong types)
-
Unsafe any usage escaping type system
-
Generic constraints violated
Recovery Steps:
-
Read the TypeScript error message (it includes file, line, and specific type issue)
-
Use the explain_type_error analysis-engine tool if available
-
For member access: Add optional chaining (?. ) or null check
-
For assignments: Fix the type at the source or use proper type assertion
-
For function calls: Check the function signature and provide correct arguments
-
For any escapes: Replace any with proper types
-
For generics: Ensure type parameters satisfy constraints
Common Patterns:
-
Add type guards: if (obj && 'prop' in obj)
-
Use optional chaining: obj?.prop?.nestedProp
-
Narrow types with discriminated unions
-
Use type predicates for custom guards
RUNTIME_ERROR
Common Causes:
-
Null/undefined access (property of null, calling undefined as function)
-
Unhandled promise rejections
-
Uncaught exceptions
-
Missing environment variables
-
Network failures (fetch, API calls)
-
File system errors (ENOENT, EACCES)
Recovery Steps:
-
Read the stack trace to find the exact line that failed
-
Identify the root cause (null access, missing env var, network, etc.)
-
For null/undefined: Add runtime checks or use optional chaining
-
For promises: Add .catch() handlers or try/catch around await
-
For env vars: Check .env files, verify they're loaded
-
For network: Add retry logic, check credentials, verify URL
-
For file system: Verify paths exist, check permissions
Common Patterns:
-
Add error boundaries in React components
-
Use try/catch around all await expressions
-
Validate env vars at startup
-
Add defensive checks: if (!value) throw new Error('...')
EXTERNAL_ERROR
Common Causes:
-
Authentication expired or invalid (401)
-
Rate limit exceeded (429)
-
Service down or unreachable (503, ECONNREFUSED)
-
Invalid API request (400)
-
Quota exceeded
Recovery Steps:
-
Check the error status code or message
-
For auth errors (401): Check credentials in .goodvibes/secrets/ , verify token hasn't expired
-
For rate limits (429): Implement exponential backoff, check if quota can be increased
-
For service down (503): Retry with backoff, check service status page
-
For bad requests (400): Read API docs, verify request format
-
For quota: Check usage dashboard, request increase, or wait for reset
Common Patterns:
-
Exponential backoff: 1s, 2s, 4s, 8s
-
Check service status via precision_fetch to status endpoint
-
Rotate API keys if multiple are available
-
Cache responses to reduce API calls
Recovery Strategy: One-Shot Multi-Source
After categorizing the error and checking failures.json, use a one-shot strategy where you consult ALL knowledge sources simultaneously (not sequentially) and apply the best solution:
-
Internal knowledge -- your training data, codebase patterns (discover, precision_grep, precision_read), GoodVibes memory
-
First-party docs -- official documentation, API references, changelogs, migration guides
-
Community knowledge -- Stack Overflow, GitHub Issues, forums
-
Open internet -- broader web search for edge cases
Applying the Best Solution
Once you've consulted all sources:
Evaluate solutions based on:
-
Recency (prefer solutions from similar versions)
-
Authority (official docs > community > random blog)
-
Specificity (exact error match > general guidance)
-
Project fit (matches your stack and patterns)
Apply the solution completely:
-
Don't half-apply a fix
-
Make all related changes at once
-
Use precision_edit for changes, not incremental tweaks
Validate the fix:
-
Re-run the failing operation
-
Run related tests with precision_exec
-
Check for new errors introduced
After Resolution
Once you've successfully resolved the error:
- Log to .goodvibes/memory/failures.json
Use precision_edit to append a new entry with:
{ "id": "fail_YYYYMMDD_HHMMSS", "date": "ISO-8601 timestamp", "error": "Brief description of the error", "context": "What was being attempted when error occurred", "root_cause": "Why it happened (technical explanation)", "resolution": "How it was fixed (specific steps taken)", "prevention": "How to avoid this in the future", "keywords": ["relevant", "search", "terms"] }
Example:
{ "id": "fail_20260215_143000", "date": "2026-02-15T14:30:00Z", "error": "precision_read returns 'file not found' for valid path", "context": "Reading component file at src/components/Button.tsx", "root_cause": "Path was relative, but precision tools require absolute paths. Sandbox was enabled, blocking CWD resolution.", "resolution": "RESOLVED - Used absolute path with process.cwd() to create absolute path before calling precision_read.", "prevention": "Always use absolute paths with precision tools. Use path.resolve() to convert relative to absolute.", "keywords": ["precision_read", "path", "absolute", "relative", "sandbox", "file-not-found"] }
- Log to .goodvibes/logs/errors.md
Append a markdown entry using precision_edit:
[YYYY-MM-DD HH:MM] ERROR_CATEGORY: Brief Description
Error: Full error message
Context: What was being done
Resolution: How it was fixed
Time to resolve: X minutes
- Continue with the task
Once logged, return to the original task. Don't wait for confirmation.
After Max Attempts (3 attempts)
If you've tried to fix the error 3 times and it's still failing:
- Log the unresolved failure
Add to failures.json with "resolution": "UNRESOLVED - <what was tried>" :
{ "id": "fail_20260215_143500", "date": "2026-02-15T14:35:00Z", "error": "Vitest tests hang indefinitely on CI", "context": "Running npm run test in CI pipeline", "root_cause": "Unknown - tests pass locally but hang in CI environment", "resolution": "UNRESOLVED - Tried: 1) Added --no-watch flag, 2) Increased timeout to 60s, 3) Disabled coverage collection. Tests still hang.", "prevention": "Need deeper investigation into CI environment differences", "keywords": ["vitest", "ci", "hang", "timeout", "unresolved"] }
- Report to orchestrator
Use structured format in your response:
Task Status: BLOCKED
Error
[ERROR_CATEGORY] Brief description
Attempts Made
- Attempt 1: What was tried -> Result
- Attempt 2: What was tried -> Result
- Attempt 3: What was tried -> Result
Root Cause Analysis
Best guess at why this is failing based on investigation.
Suggested Next Steps
- Option 1: [Describe approach that requires different permissions/access]
- Option 2: [Describe alternative architecture/design decision]
- Option 3: [Describe manual intervention needed]
Files Changed
path/to/file.ts- [what was changed during troubleshooting]
- Do NOT mark task as complete
Leave the task in BLOCKED state. The orchestrator will decide whether to:
-
Escalate to user
-
Try a different approach
-
Assign to a different agent with different capabilities
-
Defer the task
Immediate Escalation (Skip Recovery)
For certain error types, do NOT attempt recovery. Escalate immediately to the orchestrator:
- Permission Errors
EACCES: permission denied EPERM: operation not permitted
Why: These require user intervention to grant permissions. Agents can't fix this.
Escalate with: "Permission error encountered. Need user to: [grant file access / run as sudo / change ownership]."
- Missing Credentials/Secrets
Error: OPENAI_API_KEY is not defined Error: Database connection failed: authentication error 401 Unauthorized
Why: Agents can't create or modify secrets. Only users can provide credentials.
Escalate with: "Missing credential: [ENV_VAR_NAME or service name]. User needs to provide via .env or secrets management."
- Architectural Ambiguity
Context clues indicate multiple valid approaches:
- Use REST API vs GraphQL
- Use Prisma vs Drizzle
- Component composition pattern unclear
Why: These are design decisions that affect the broader system. Agents shouldn't make architectural choices unilaterally.
Escalate with: "Architectural decision required: [describe the choice]. Options: [list 2-3 options with tradeoffs]."
- Scope Change Discovered
Original task: "Add user profile page" Discovered: Requires new database schema, auth changes, API endpoints
Why: The task is larger than originally scoped. Orchestrator needs to re-plan.
Escalate with: "Scope expansion discovered. Original: [task]. Required: [new dependencies/changes]. Recommend: [break into subtasks / get user approval]."
Summary
When errors occur:
-
Categorize immediately (TOOL_FAILURE, BUILD_ERROR, etc.)
-
Check failures.json for known patterns
-
Use one-shot multi-source recovery (internal + docs + community + web)
-
Apply the best solution completely
-
Log resolution to failures.json and errors.md
-
Continue with task
After 3 attempts:
-
Log as UNRESOLVED
-
Report to orchestrator with structured summary
-
Do NOT mark task complete
Escalate immediately for:
-
Permission errors
-
Missing credentials
-
Architectural ambiguity
-
Scope changes
Remember:
-
Precision tools are the default, native tools are the fallback
-
User error != tool failure
-
Recovery attempts should be systematic, not random
-
Memory is institutional knowledge -- use it and contribute to it