error-recovery

Resources

scripts/ validate-error-recovery.sh references/ common-errors.md

Error Recovery Protocol

When tasks fail, agents must follow a systematic recovery process that balances efficiency with thoroughness. This skill defines how to categorize errors, leverage institutional memory, apply multi-source recovery strategies, and know when to escalate.

Immediate Response

When an error occurs during task execution:

DO NOT retry blindly. Read the full error message, stack trace, and any diagnostic output.

Categorize the error into one of six types:

TOOL_FAILURE -- Precision tool or MCP tool returned an error
BUILD_ERROR -- Build/compile command failed (npm, tsc, vite, etc.)
TEST_FAILURE -- Test suite failed (vitest, jest, etc.)
TYPE_ERROR -- TypeScript type checking failed
RUNTIME_ERROR -- Code crashed during execution
EXTERNAL_ERROR -- Third-party service or API failure

Check .goodvibes/memory/failures.json for matching keywords using precision_read:

Search for keywords from the error message
If a matching failure is found, apply the documented resolution
If the resolution doesn't work, note it and proceed to recovery phases
If not found, proceed directly to recovery phases

Error Categories: Detailed Guidance

TOOL_FAILURE

Common Causes:

Wrong file path (absolute vs relative, typo, file doesn't exist)
Bad syntax in tool parameters (malformed JSON, incorrect regex)
Sandbox blocking external paths
Missing required parameters
Tool used incorrectly (wrong extract mode, wrong output format)

Recovery Steps:

Re-read the tool's schema and parameter descriptions
Check if the file/path exists using precision_glob
Verify sandbox settings with precision_config get (check sandbox.enabled )
For precision tools, check if you're using the right extract/output mode
Try the operation with minimal parameters first, then add complexity
If precision tool fails repeatedly, check if it's a user error vs actual tool bug:
User error: wrong params, bad path, misunderstood tool behavior
Tool bug: correct params but tool crashes or returns wrong result
For user errors, fix and retry with precision tools
For tool bugs, use native tool fallback ONLY for that specific operation, then return to precision tools

Common Patterns from Memory:

Format/mode mismatch: MCP schema sends output.format , handlers read output.mode . Always check both with ?? .
Ripgrep glob failures: Patterns like src/**/*.ts fail silently with ripgrep. Use regex to detect literal prefixes and force fast-glob.
Path normalization: Ripgrep returns absolute OR relative paths. Always use path.isAbsolute() before path.resolve() .
Sandbox opt-in: Only enable when sandbox === true || sandbox === 'true' . Never use === false checks.

BUILD_ERROR

Common Causes:

Missing dependencies (not installed, wrong version)
Type errors not caught during editing
Import errors (wrong path, circular dependency)
Configuration errors (tsconfig, vite.config, etc.)
Environment variable missing

Recovery Steps:

Read the full build output (use precision_exec with expect.exit_code to capture stderr)
Identify the first error in the chain (subsequent errors are often cascading)
For missing deps: Check package.json, run npm install
For type errors: See TYPE_ERROR section below
For import errors: Verify file exists, check import path, look for circular deps
For config errors: Compare with working examples in the codebase
Search first-party docs for the framework/build tool being used

TEST_FAILURE

Common Causes:

Wrong test assertions (expected value changed)
Missing mocks or stubs
Changed API contract (function signature, return type)
Test environment not set up correctly
Async timing issues

Recovery Steps:

Read the test failure output (assertion error, expected vs actual)
Identify which test file and specific test case failed
Read the test file to understand what it's testing
Read the implementation to see if it matches the test expectations
Determine if the test is wrong or the implementation is wrong:
If implementation changed intentionally -> update the test
If implementation broke -> fix the implementation
For async issues, check for missing await , improper use of done() , or race conditions
For environment issues, check test setup files (vitest.config, jest.setup.js)

TYPE_ERROR

Common Causes:

Unsafe member access (obj.prop where obj could be null/undefined)
Type mismatch in assignments (string assigned to number )
Wrong function call signature (missing params, wrong types)
Unsafe any usage escaping type system
Generic constraints violated

Recovery Steps:

Read the TypeScript error message (it includes file, line, and specific type issue)
Use the explain_type_error analysis-engine tool if available
For member access: Add optional chaining (?. ) or null check
For assignments: Fix the type at the source or use proper type assertion
For function calls: Check the function signature and provide correct arguments
For any escapes: Replace any with proper types
For generics: Ensure type parameters satisfy constraints

Common Patterns:

Add type guards: if (obj && 'prop' in obj)
Use optional chaining: obj?.prop?.nestedProp
Narrow types with discriminated unions
Use type predicates for custom guards

RUNTIME_ERROR

Common Causes:

Null/undefined access (property of null, calling undefined as function)
Unhandled promise rejections
Uncaught exceptions
Missing environment variables
Network failures (fetch, API calls)
File system errors (ENOENT, EACCES)

Recovery Steps:

Read the stack trace to find the exact line that failed
Identify the root cause (null access, missing env var, network, etc.)
For null/undefined: Add runtime checks or use optional chaining
For promises: Add .catch() handlers or try/catch around await
For env vars: Check .env files, verify they're loaded
For network: Add retry logic, check credentials, verify URL
For file system: Verify paths exist, check permissions

Common Patterns:

Add error boundaries in React components
Use try/catch around all await expressions
Validate env vars at startup
Add defensive checks: if (!value) throw new Error('...')

EXTERNAL_ERROR

Common Causes:

Authentication expired or invalid (401)
Rate limit exceeded (429)
Service down or unreachable (503, ECONNREFUSED)
Invalid API request (400)
Quota exceeded

Recovery Steps:

Check the error status code or message
For auth errors (401): Check credentials in .goodvibes/secrets/ , verify token hasn't expired
For rate limits (429): Implement exponential backoff, check if quota can be increased
For service down (503): Retry with backoff, check service status page
For bad requests (400): Read API docs, verify request format
For quota: Check usage dashboard, request increase, or wait for reset

Common Patterns:

Exponential backoff: 1s, 2s, 4s, 8s
Check service status via precision_fetch to status endpoint
Rotate API keys if multiple are available
Cache responses to reduce API calls

Recovery Strategy: One-Shot Multi-Source

After categorizing the error and checking failures.json, use a one-shot strategy where you consult ALL knowledge sources simultaneously (not sequentially) and apply the best solution:

Internal knowledge -- your training data, codebase patterns (discover, precision_grep, precision_read), GoodVibes memory
First-party docs -- official documentation, API references, changelogs, migration guides
Community knowledge -- Stack Overflow, GitHub Issues, forums
Open internet -- broader web search for edge cases

Applying the Best Solution

Once you've consulted all sources:

Evaluate solutions based on:

Recency (prefer solutions from similar versions)
Authority (official docs > community > random blog)
Specificity (exact error match > general guidance)
Project fit (matches your stack and patterns)

Apply the solution completely:

Don't half-apply a fix
Make all related changes at once
Use precision_edit for changes, not incremental tweaks

Validate the fix:

Re-run the failing operation
Run related tests with precision_exec
Check for new errors introduced

After Resolution

Once you've successfully resolved the error:

Log to .goodvibes/memory/failures.json

Use precision_edit to append a new entry with:

{ "id": "fail_YYYYMMDD_HHMMSS", "date": "ISO-8601 timestamp", "error": "Brief description of the error", "context": "What was being attempted when error occurred", "root_cause": "Why it happened (technical explanation)", "resolution": "How it was fixed (specific steps taken)", "prevention": "How to avoid this in the future", "keywords": ["relevant", "search", "terms"] }

Example:

{ "id": "fail_20260215_143000", "date": "2026-02-15T14:30:00Z", "error": "precision_read returns 'file not found' for valid path", "context": "Reading component file at src/components/Button.tsx", "root_cause": "Path was relative, but precision tools require absolute paths. Sandbox was enabled, blocking CWD resolution.", "resolution": "RESOLVED - Used absolute path with process.cwd() to create absolute path before calling precision_read.", "prevention": "Always use absolute paths with precision tools. Use path.resolve() to convert relative to absolute.", "keywords": ["precision_read", "path", "absolute", "relative", "sandbox", "file-not-found"] }

Log to .goodvibes/logs/errors.md

Append a markdown entry using precision_edit:

[YYYY-MM-DD HH:MM] ERROR_CATEGORY: Brief Description

Error: Full error message

Context: What was being done

Resolution: How it was fixed

Time to resolve: X minutes

Continue with the task

Once logged, return to the original task. Don't wait for confirmation.

After Max Attempts (3 attempts)

If you've tried to fix the error 3 times and it's still failing:

Log the unresolved failure

Add to failures.json with "resolution": "UNRESOLVED - <what was tried>" :

{ "id": "fail_20260215_143500", "date": "2026-02-15T14:35:00Z", "error": "Vitest tests hang indefinitely on CI", "context": "Running npm run test in CI pipeline", "root_cause": "Unknown - tests pass locally but hang in CI environment", "resolution": "UNRESOLVED - Tried: 1) Added --no-watch flag, 2) Increased timeout to 60s, 3) Disabled coverage collection. Tests still hang.", "prevention": "Need deeper investigation into CI environment differences", "keywords": ["vitest", "ci", "hang", "timeout", "unresolved"] }

Report to orchestrator

Use structured format in your response:

Task Status: BLOCKED

Error

[ERROR_CATEGORY] Brief description

Attempts Made

Attempt 1: What was tried -> Result
Attempt 2: What was tried -> Result
Attempt 3: What was tried -> Result

Root Cause Analysis

Best guess at why this is failing based on investigation.

Suggested Next Steps

Option 1: [Describe approach that requires different permissions/access]
Option 2: [Describe alternative architecture/design decision]
Option 3: [Describe manual intervention needed]

Files Changed

path/to/file.ts - [what was changed during troubleshooting]

Do NOT mark task as complete

Leave the task in BLOCKED state. The orchestrator will decide whether to:

Escalate to user
Try a different approach
Assign to a different agent with different capabilities
Defer the task

Immediate Escalation (Skip Recovery)

For certain error types, do NOT attempt recovery. Escalate immediately to the orchestrator:

Permission Errors

EACCES: permission denied EPERM: operation not permitted

Why: These require user intervention to grant permissions. Agents can't fix this.

Escalate with: "Permission error encountered. Need user to: [grant file access / run as sudo / change ownership]."

Missing Credentials/Secrets

Error: OPENAI_API_KEY is not defined Error: Database connection failed: authentication error 401 Unauthorized

Why: Agents can't create or modify secrets. Only users can provide credentials.

Escalate with: "Missing credential: [ENV_VAR_NAME or service name]. User needs to provide via .env or secrets management."

Architectural Ambiguity

Context clues indicate multiple valid approaches:

Use REST API vs GraphQL
Use Prisma vs Drizzle
Component composition pattern unclear

Why: These are design decisions that affect the broader system. Agents shouldn't make architectural choices unilaterally.

Escalate with: "Architectural decision required: [describe the choice]. Options: [list 2-3 options with tradeoffs]."

Scope Change Discovered

Original task: "Add user profile page" Discovered: Requires new database schema, auth changes, API endpoints

Why: The task is larger than originally scoped. Orchestrator needs to re-plan.

Escalate with: "Scope expansion discovered. Original: [task]. Required: [new dependencies/changes]. Recommend: [break into subtasks / get user approval]."

Summary

When errors occur:

Categorize immediately (TOOL_FAILURE, BUILD_ERROR, etc.)
Check failures.json for known patterns
Use one-shot multi-source recovery (internal + docs + community + web)
Apply the best solution completely
Log resolution to failures.json and errors.md
Continue with task

After 3 attempts:

Log as UNRESOLVED
Report to orchestrator with structured summary
Do NOT mark task complete

Escalate immediately for:

Permission errors
Missing credentials
Architectural ambiguity
Scope changes

Remember:

Precision tools are the default, native tools are the fallback
User error != tool failure
Recovery attempts should be systematic, not random
Memory is institutional knowledge -- use it and contribute to it

error-recovery

Safety Notice

Copy this and send it to your AI assistant to learn

[YYYY-MM-DD HH:MM] ERROR_CATEGORY: Brief Description

Task Status: BLOCKED

Error

Attempts Made

Root Cause Analysis

Suggested Next Steps

Files Changed

Source Transparency

Related Skills

task-orchestration

project-onboarding

goodvibes-memory

state-management