Error Recovery Protocol
When an error occurs, stop, think, and try the right recovery strategy. No blind retries — understand the error signal first, then act.
Core principle: Every error carries a signal. Read the signal first, then act.
Error Classification
Classify every error into one of 4 categories — the recovery strategy depends on the category:
Transient Error
Retrying usually fixes it. Infrastructure or network related.
- Examples: timeout, rate limit (429), connection drop, temporary service outage
- Strategy: Wait & Retry with exponential backoff
Configuration Error
Environment or setup issue. Code is correct but setup is wrong.
- Examples: missing env variable, wrong file path, permission denied, missing dependency
- Strategy: Fix & Continue — identify the issue, fix it, re-run
Logic Error
Code or approach is wrong. Retrying produces the same error.
- Examples: KeyError, TypeError, wrong algorithm, expectation mismatch
- Strategy: Alternative Approach — try a different method
Permanent / External Error
Out of control, cannot be fixed. External service or permission boundary.
- Examples: 403 Forbidden, 404 Not Found, quota exceeded, API deprecated
- Strategy: Escalation — inform the user, ask for direction
Retry Strategy
For transient errors, use exponential backoff:
Attempt 1: Retry immediately
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds -> move on or escalate
Maximum retries: 3 attempts. If all 3 fail → re-evaluate the category.
Rate limit (429) special rule:
- If response has
Retry-Afterheader, wait that duration - Otherwise wait 60 seconds, then retry
Decision Tree
Error received
|
Classify the error
|
+------------------------------------+
| Transient? -> Wait & Retry (max 3)|
| Config? -> Fix & Continue |
| Logic? -> Alternative approach|
| Permanent? -> Escalation |
+------------------------------------+
|
Every strategy fails -> Escalation
Escalation Protocol
Escalate to the user when:
- 3 retries failed
- Permanent / external error
- 2 consecutive different strategies failed
- Error category cannot be determined
ERROR ESCALATION
================================
Failed step : [step name]
Error : [error message summary]
Category : [Transient / Config / Logic / Permanent]
Tried : [what was attempted — short list]
Result : All strategies exhausted
================================
Options:
A) [Alternative approach suggestion]
B) [Simpler / partial solution]
C) Skip this step, continue
D) Stop the task
Partial Success
For bulk operations where some items succeed and some fail:
PARTIAL SUCCESS
================================
Successful : N / Total
Failed : M items
================================
Failed items:
- [item]: [reason]
Options:
A) Retry only failed items
B) Continue with successful items, skip failed
C) Cancel all
Error Log
Log every error and recovery attempt:
[ERROR LOG]
Step : [step name / number]
Error : [message]
Category : [type]
Attempt 1: [strategy] -> [result]
Attempt 2: [strategy] -> [result]
Result : Recovered / Escalated
When to Skip
- Error is expected behavior (e.g., "file not found" when checking existence)
- User said "ignore errors, continue"
- One-off, non-repeatable task
Guardrails
- Never blind-retry a logic error — retrying won't help, change the approach.
- Always log every attempt — even successful recoveries need a record.
- Cross-skill: integrates with
checkpoint-guardian(risk assessment before retry),memory-ledger(logs errors and fixes), andagent-reviewer(retrospective analysis).