LangGraph Error Handling

Use This Skill For

Adding RetryPolicy to flaky nodes (API, DB, model/tool calls)
Designing LLM recovery loops (Command + error state + retry counters)
Adding human approval/escalation with interrupt() and resume
Handling prebuilt ToolNode failures
Debugging transactional failure behavior in parallel supersteps

Strategy Selection

Use this order:

Transient/infrastructure issue (429, timeout, 5xx, temporary DB lock) -> RetryPolicy
Recoverable by model/tool args correction -> store error in state and route back with Command
Needs user approval or missing info -> interrupt() + resume
Unknown/programming bug -> let it bubble up and debug

Error Type	Owner	Primary Mechanism
Transient	System	`RetryPolicy`
LLM-recoverable	LLM	State update + `Command(goto=...)`
User-fixable	Human	`interrupt()` + `Command(resume=...)`
Unexpected	Developer	Raise/log/debug

For full taxonomy, load references/error-types.md.

Minimal Patterns

1) Retry Transient Failures

from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)

builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});

Notes:

Python and JS default retry behavior differs by exception type.
Prefer targeted retry_on/retryOn for non-transient domains.

2) LLM Recovery Loop

Use MessagesState in Python for message state.

from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")

import { StateGraph, Command, END } from "@langchain/langgraph";

// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });

3) Human-In-The-Loop Escalation

from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")

# resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})

import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});

Requirements:

Compile with a checkpointer for interrupt flows.
Reuse the same thread_id on resume.

For deep HITL patterns, load references/human-escalation.md.

ToolNode Error Handling

from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))

Use custom handlers when you need deterministic error shaping for model recovery. For broader tool-recovery design, load references/llm-recovery.md.

Critical Behavior (Do Not Skip)

Supersteps are transactional: one failing parallel branch fails the whole superstep state update.
RetryPolicy retries failing branches, not successful siblings.
interrupt() re-runs the node on resume: side effects before interrupt must be idempotent, or moved after interrupt / separate node.
JS Command routing requires ends metadata on addNode(...).
Use explicit retry limits (max_attempts, plus state counters for recovery loops).

Local Assets In This Skill

Scripts

scripts/classify_error.py: classify exception category and recommended handling
scripts/wrap_with_retry.py: generate boilerplate node wrappers with retry/recovery/escalation options

Run from repo root:

uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery

Examples

assets/examples/retry-example/: retry + recovery loop (Python and JS)
assets/examples/human-loop-example/: interrupt/resume approval flow (Python and JS)

Load References On Demand

references/error-types.md: error taxonomy and classification rules
references/retry-strategies.md: retry tuning, backoff, circuit-breaker-style patterns
references/llm-recovery.md: recovery-loop and ToolNode strategies
references/human-escalation.md: human approval, interrupts, and escalation patterns

Common Failure Modes

Symptom	Root Cause	Fix
`interrupt()` fails at runtime	no checkpointer	compile with checkpointer
Resume starts new run	different `thread_id`	reuse same `thread_id`
JS Command route not taken	missing `ends`	add `ends` to `addNode`
Infinite loop	no termination counter/condition	add retry counter + terminal branch
Retry never triggers	exception excluded by retry filter	set explicit `retry_on`/`retryOn`

langgraph-error-handling

Safety Notice

Copy this and send it to your AI assistant to learn