hive-debugger

Hive Debugger

An interactive debugging companion that helps developers identify and fix runtime issues in Hive agents. The debugger analyzes runtime logs at three levels (L1/L2/L3), categorizes issues, and provides actionable fix recommendations.

When to Use This Skill

Use /hive-debugger when:

Your agent is failing or producing unexpected results
You need to understand why a specific node is retrying repeatedly
Tool calls are failing and you need to identify the root cause
Agent execution is stalled or taking too long
You want to monitor agent behavior in real-time during development

This skill works alongside agents running in TUI mode and provides supervisor-level insights into execution behavior.

Forever-Alive Agent Awareness

Some agents use terminal_nodes=[] (the "forever-alive" pattern), meaning they loop indefinitely and never enter a "completed" execution state. For these agents:

Sessions with status "in_progress" or "paused" are normal, not failures
High step counts, long durations, and many node visits are expected behavior
The agent stops only when the user explicitly exits — there is no graph-driven completion
Debug focus should be on quality of individual node visits and iterations, not whether the session reached a terminal state
Conversation memory accumulates across loops — watch for context overflow and stale data issues

How to identify forever-alive agents: Check agent.py or agent.json for terminal_nodes=[] (empty list). If empty, the agent is forever-alive.

Prerequisites

Before using this skill, ensure:

You have an exported agent in exports/{agent_name}/
The agent has been run at least once (logs exist)
Runtime logging is enabled (default in Hive framework)
You have access to the agent's working directory at ~/.hive/agents/{agent_name}/

Workflow

Stage 1: Setup & Context Gathering

Objective: Understand the agent being debugged

What to do:

Ask the developer which agent needs debugging:

Get agent name (e.g., "deep_research_agent", "deep_research_agent")
Confirm the agent exists in exports/{agent_name}/

Determine agent working directory:

Calculate: ~/.hive/agents/{agent_name}/
Verify this directory exists and contains session logs

Read agent configuration:

Read file: exports/{agent_name}/agent.json
Extract goal information from the JSON:
goal.id
The goal identifier
goal.success_criteria
What success looks like
goal.constraints
Rules the agent must follow
Extract graph information:
List of node IDs from graph.nodes
List of edges from graph.edges

Store context for the debugging session:

agent_name
agent_work_dir (e.g., /home/user/.hive/deep_research_agent )
goal_id
success_criteria
constraints
node_ids

Example:

Developer: "My deep_research_agent agent keeps failing"

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context gathered:

Agent: deep_research_agent
Goal: deep-research
Working Directory: /home/user/.hive/deep_research_agent
Success Criteria: ["Produce a comprehensive research report with cited sources"]
Constraints: ["Must cite all sources", "Must cover multiple perspectives"]
Nodes: ["intake", "research", "analysis", "report-writer"]

Stage 2: Mode Selection

Objective: Choose the debugging approach that best fits the situation

What to do:

Ask the developer which debugging mode they want to use. Use AskUserQuestion with these options:

Real-time Monitoring Mode

Description: Monitor active TUI session continuously, poll logs every 5-10 seconds, alert on new issues immediately
Best for: Live debugging sessions where you want to catch issues as they happen
Note: Requires agent to be currently running

Post-Mortem Analysis Mode

Description: Analyze completed or failed runs in detail, deep dive into specific session
Best for: Understanding why a past execution failed
Note: Most common mode for debugging

Historical Trends Mode

Description: Analyze patterns across multiple runs, identify recurring issues
Best for: Finding systemic problems that happen repeatedly
Note: Useful for agents that have run many times

Implementation:

Use AskUserQuestion to present these options and let the developer choose. Store the selected mode for the session.

Stage 3: Triage (L1 Analysis)

Objective: Identify which sessions need attention

What to do:

Query high-level run summaries using the MCP tool:

query_runtime_logs( agent_work_dir="{agent_work_dir}", status="needs_attention", limit=20 )

Analyze the results:

Look for runs with needs_attention: true
Check attention_summary.categories for issue types
Note the run_id of problematic sessions
Check status field: "degraded", "failure", "in_progress"
For forever-alive agents: Sessions with status "in_progress" or "paused" are normal — these agents never reach "completed". Only flag sessions with needs_attention: true or actual error indicators (tool failures, retry loops, missing outputs). High step counts alone do not indicate a problem.

Attention flag triggers to understand: From runtime_logger.py, runs are flagged when:

retry_count > 3
escalate_count > 2
latency_ms > 60000
tokens_used > 100000
total_steps > 20

Present findings to developer:

Summarize how many runs need attention
List the most recent problematic runs
Show attention categories for each
Ask which run they want to investigate (if multiple)

Example Output:

Found 2 runs needing attention:

session_20260206_115718_e22339c5 (30 minutes ago) Status: degraded Categories: missing_outputs, retry_loops
session_20260206_103422_9f8d1b2a (2 hours ago) Status: failure Categories: tool_failures, high_latency

Which run would you like to investigate?

Stage 4: Diagnosis (L2 Analysis)

Objective: Identify which nodes failed and what patterns exist

What to do:

Query per-node details using the MCP tool:

query_runtime_log_details( agent_work_dir="{agent_work_dir}", run_id="{selected_run_id}", needs_attention_only=True )

Categorize issues using the Issue Taxonomy:

10 Issue Categories:

Category Detection Pattern Meaning

Missing Outputs exit_status != "success" , attention_reasons contains "missing_outputs" Node didn't call set_output with required keys

Tool Errors tool_error_count > 0 , attention_reasons contains "tool_failures" Tool calls failed (API errors, timeouts, auth issues)

Retry Loops retry_count > 3 , verdict_counts.RETRY > 5

Judge repeatedly rejecting outputs

Guard Failures guard_reject_count > 0

Output validation failed (wrong types, missing keys)

Stalled Execution total_steps > 20 , verdict_counts.CONTINUE > 10

EventLoopNode not making progress. Caveat: Forever-alive agents may legitimately have high step counts — check if agent is blocked at a client-facing node (normal) vs genuinely stuck in a loop

High Latency latency_ms > 60000 , avg_step_latency > 5000

Slow tool calls or LLM responses

Client-Facing Issues client_input_requested but no user_input_received

Premature set_output before user input

Edge Routing Errors exit_status == "no_valid_edge" , attention_reasons contains "routing_issue" No edges match current state

Memory/Context Issues tokens_used > 100000 , context_overflow_count > 0

Conversation history too long

Constraint Violations Compare output against goal constraints Agent violated goal-level rules

Forever-Alive Agent Caveat: If the agent uses terminal_nodes=[] , sessions will never reach "completed" status. This is by design. When debugging these agents, focus on:

Whether individual node visits succeed (not whether the graph "finishes")
Quality of each loop iteration — are outputs improving or degrading across loops?
Whether client-facing nodes are correctly blocking for user input
Memory accumulation issues: stale data from previous loops, context overflow across many iterations
Conversation compaction behavior: is the conversation growing unbounded?

Analyze each flagged node:

Node ID and name
Exit status
Retry count
Verdict distribution (ACCEPT/RETRY/ESCALATE/CONTINUE)
Attention reasons
Total steps executed

Present diagnosis to developer:

List problematic nodes
Categorize each issue
Highlight the most severe problems
Show evidence (retry counts, error types)

Example Output:

Diagnosis for session_20260206_115718_e22339c5:

Problem Node: research ├─ Exit Status: escalate ├─ Retry Count: 5 (HIGH) ├─ Verdict Counts: {RETRY: 5, ESCALATE: 1} ├─ Attention Reasons: ["high_retry_count", "missing_outputs"] ├─ Total Steps: 8 └─ Categories: Missing Outputs + Retry Loops

Root Issue: The research node is stuck in a retry loop because it's not setting required outputs.

Stage 5: Root Cause Analysis (L3 Analysis)

Objective: Understand exactly what went wrong by examining detailed logs

What to do:

Query detailed tool/LLM logs using the MCP tool:

query_runtime_log_raw( agent_work_dir="{agent_work_dir}", run_id="{run_id}", node_id="{problem_node_id}" )

Analyze based on issue category:

For Missing Outputs:

Check step.tool_calls for set_output usage
Look for conditional logic that skipped set_output
Check if LLM is calling other tools instead

For Tool Errors:

Check step.tool_results for error messages
Identify error types: rate limits, auth failures, timeouts, network errors
Note which specific tool is failing

For Retry Loops:

Check step.verdict_feedback from judge
Look for repeated failure reasons
Identify if it's the same issue every time

For Guard Failures:

Check step.guard_results for validation errors
Identify missing keys or type mismatches
Compare actual output to expected schema

For Stalled Execution:

Check step.llm_response_text for repetition
Look for LLM stuck in same action loop
Check if tool calls are succeeding but not progressing

Extract evidence:

Specific error messages
Tool call arguments and results
LLM response text
Judge feedback
Step-by-step progression

Formulate root cause explanation:

Clearly state what is happening
Explain why it's happening
Show evidence from logs

Example Output:

Root Cause Analysis for research:

Step-by-step breakdown:

Step 3:

Tool Call: web_search(query="latest AI regulations 2026")
Result: Found relevant articles and sources
Verdict: RETRY
Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4:

Tool Call: web_search(query="AI regulation policy 2026")
Result: Found additional policy information
Verdict: RETRY
Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Similar pattern continues...

ROOT CAUSE: The node is successfully finding research sources via web_search, but the LLM is not calling set_output to save the results. It keeps searching for more information instead of completing the task.

Stage 6: Fix Recommendations

Objective: Provide actionable solutions the developer can implement

What to do:

Based on the issue category identified, provide specific fix recommendations using these templates:

Template 1: Missing Outputs (Client-Facing Nodes)

Issue: Premature set_output in Client-Facing Node

Root Cause: Node called set_output before receiving user input

Fix: Use STEP 1/STEP 2 prompt pattern

File to edit: exports/{agent_name}/nodes/{node_name}.py

Changes:

Update the system_prompt to include explicit step guidance:

system_prompt = """
STEP 1: Analyze the user input and decide what action to take.
DO NOT call set_output in this step.

STEP 2: After receiving feedback or completing analysis,
ONLY THEN call set_output with your results.
"""

If some inputs are optional (like feedback on retry edges), add nullable_output_keys: nullable_output_keys=["feedback"]

Verification:

Run the agent with test input
Verify the client-facing node waits for user input before calling set_output

Template 2: Retry Loops

## Issue: Judge Repeatedly Rejecting Outputs

**Root Cause:** {Insert specific reason from verdict_feedback}

**Fix Options:**

**Option A - If outputs are actually correct:** Adjust judge evaluation rules
- File: `exports/{agent_name}/agent.json`
- Update `evaluation_rules` section to accept the current output format
- Example: If judge expects list but gets string, update rule to accept both

**Option B - If prompt is ambiguous:** Clarify node instructions
- File: `exports/{agent_name}/nodes/{node_name}.py`
- Make system_prompt more explicit about output format and requirements
- Add examples of correct outputs

**Option C - If tool is unreliable:** Add retry logic with fallback
- Consider using alternative tools
- Add manual fallback option
- Update prompt to handle tool failures gracefully

**Verification:**
- Run the node with test input
- Confirm judge accepts output on first try
- Check that retry_count stays at 0

Template 3: Tool Errors

## Issue: {tool_name} Failing with {error_type}

**Root Cause:** {Insert specific error message from logs}

**Fix Strategy:**

**If API rate limit:**
1. Add exponential backoff in tool retry logic
2. Reduce API call frequency
3. Consider caching results

**If auth failure:**
1. Check credentials using:
   ```bash
   /hive-credentials --agent {agent_name}

- Verify API key environment variables

- Update mcp_servers.json
 if needed

If timeout:

- Increase timeout in mcp_servers.json
:
{
  "timeout_ms": 60000
}

- Consider using faster alternative tools

- Break large requests into smaller chunks

Verification:

- Test tool call manually

- Confirm successful response

- Monitor for recurring errors

#### Template 4: Edge Routing Errors

```markdown
## Issue: No Valid Edge from Node {node_id}

**Root Cause:** No edge condition matched the current state

**File to edit:** `exports/{agent_name}/agent.json`

**Analysis:**
- Current node output: {show actual output keys}
- Existing edge conditions: {list edge conditions}
- Why no match: {explain the mismatch}

**Fix:**
Add the missing edge to the graph:
```json
{
  "edge_id": "{node_id}_to_{target_node}",
  "source": "{node_id}",
  "target": "{target_node}",
  "condition": "on_success"
}

Alternative: Update existing edge condition to cover this case

Verification:

- Run agent with same input

- Verify edge is traversed successfully

- Check that execution continues to next node

#### Template 5: Stalled Execution

```markdown
## Issue: EventLoopNode Not Making Progress

**Root Cause:** {Insert analysis - e.g., "LLM repeating same failed action"}

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Fix:** Update system_prompt to guide LLM out of loops

**Add this guidance:**
```python
system_prompt = """
{existing prompt}

IMPORTANT: If a tool call fails multiple times:
1. Try an alternative approach or different tool
2. If no alternatives work, call set_output with partial results
3. DO NOT retry the same failed action more than 3 times

Progress is more important than perfection. Move forward even with incomplete data.
"""

Additional fix: Lower max_iterations to prevent infinite loops

# In node configuration
max_node_visits=3  # Prevent getting stuck

Verification:

- Run node with same input that caused stall

- Verify it exits after reasonable attempts (&#x3C; 10 steps)

- Confirm it calls set_output eventually

#### Template 6: Checkpoint Recovery (Post-Fix Resume)

```markdown
## Recovery Strategy: Resume from Last Clean Checkpoint

**Situation:** You've fixed the issue, but the failed session is stuck mid-execution

**Solution:** Resume execution from a checkpoint before the failure

### Option A: Auto-Resume from Latest Checkpoint (Recommended)

Use CLI arguments to auto-resume when launching TUI:

```bash
PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id}

This will:

- Load session state from state.json

- Continue from where it paused/failed

- Apply your fixes immediately

Option B: Resume from Specific Checkpoint (Time-Travel)

If you need to go back to an earlier point:

PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id} \
    --checkpoint {checkpoint_id}

Example:

PYTHONPATH=core:exports python -m deep_research_agent --tui \
    --resume-session session_20260208_143022_abc12345 \
    --checkpoint cp_node_complete_intake_143030

Option C: Use TUI Commands

Alternatively, launch TUI normally and use commands:

# Launch TUI
PYTHONPATH=core:exports python -m {agent_name} --tui

# In TUI, use commands:
/resume {session_id}                    # Resume from session state
/recover {session_id} {checkpoint_id}   # Recover from specific checkpoint

When to Use Each Option:

Use /resume
 (or --resume-session) when:

- You fixed credentials and want to retry

- Agent paused and you want to continue

- Agent failed and you want to retry from last state

Use /recover
 (or --resume-session + --checkpoint) when:

- You need to go back to an earlier checkpoint

- You want to try a different path from a specific point

- Debugging requires time-travel to earlier state

Find Available Checkpoints:

Use MCP tools to programmatically find and inspect checkpoints:

# List all sessions to find the failed one
list_agent_sessions(agent_work_dir="~/.hive/agents/{agent_name}", status="failed")

# Inspect session state
get_agent_session_state(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}")

# Find clean checkpoints to resume from
list_agent_checkpoints(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", is_clean="true")

# Compare checkpoints to understand what changed
compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="{session_id}",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

# Inspect memory at a specific checkpoint
get_agent_checkpoint(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", checkpoint_id="cp_node_complete_intake_143030")

Or in TUI:

/sessions {session_id}

Verification:

- Use --resume-session
 to test your fix immediately

- No need to re-run from the beginning

- Session continues with your code changes applied

**Selecting the right template:**
- Match the issue category from Stage 4
- Customize with specific details from Stage 5
- Include actual error messages and code snippets
- Provide file paths and line numbers when possible
- **Always include recovery commands** (Template 6) after providing fix recommendations

---

### Stage 7: Verification Support

**Objective:** Help the developer confirm their fixes work

**What to do:**

1. **Suggest appropriate tests based on fix type:**

   **For node-level fixes:**
   ```bash
   # Use hive-test to run goal-based tests
   /hive-test --agent {agent_name} --goal {goal_id}

   # Or run specific test scenarios
   /hive-test --agent {agent_name} --scenario {specific_input}

For quick manual tests:

# Launch the interactive TUI dashboard
hive tui

Then use arrow keys to select the agent from the list and press Enter to run it.

- 
Provide MCP tool queries to validate the fix:

Check if issue is resolved:

query_runtime_logs(
    agent_work_dir="~/.hive/agents/{agent_name}",
    status="needs_attention",
    limit=5
)
# Should show 0 results if fully fixed

Verify specific node behavior:

query_runtime_log_details(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="{new_run_id}",
    node_id="{fixed_node_id}"
)
# Should show exit_status="success", retry_count=0

- 
Monitor for regression:

- Run the agent multiple times

- Check for similar issues reappearing

- Verify fix works across different inputs

- 
Provide verification checklist:

Verification Checklist:
□ Applied recommended fix to code
□ Ran agent with test input
□ Checked runtime logs show no attention flags
□ Verified specific node completes successfully
□ Tested with multiple inputs
□ No regression of original issue
□ Agent meets success criteria

Example interaction:

Developer: "I applied the fix to research. How do I verify it works?"

You: "Great! Let's verify the fix with these steps:

1. Launch the TUI dashboard:
   hive tui
   Then select your agent from the list and press Enter to run it.

2. After it completes, check the logs:
   [Use query_runtime_logs to check for attention flags]

3. Verify the specific node:
   [Use query_runtime_log_details for research]

Expected results:
- No 'needs_attention' flags
- research shows exit_status='success'
- retry_count should be 0

Let me know when you've run it and I'll help check the logs!"

MCP Tool Usage Guide

Three Levels of Observability

L1: query_runtime_logs - Session-level summaries

- When to use: Initial triage, identifying problematic runs, monitoring trends

- Returns: List of runs with status, attention flags, timestamps

- Example:
query_runtime_logs(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    status="needs_attention",
    limit=20
)

L2: query_runtime_log_details - Node-level details

- When to use: Diagnosing which nodes failed, understanding retry patterns

- Returns: Per-node completion details, retry counts, verdicts

- Example:
query_runtime_log_details(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    run_id="session_20260206_115718_e22339c5",
    needs_attention_only=True
)

L3: query_runtime_log_raw - Step-level details

- When to use: Root cause analysis, understanding exact failures

- Returns: Full tool calls, LLM responses, judge feedback

- Example:
query_runtime_log_raw(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    run_id="session_20260206_115718_e22339c5",
    node_id="research"
)

Session &#x26; Checkpoint Tools

list_agent_sessions - Browse sessions with filtering

- When to use: Finding resumable sessions, identifying failed sessions, Stage 3 triage

- Returns: Session list with status, timestamps, is_resumable, current_node, quality

- Example:
list_agent_sessions(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    status="failed",
    limit=10
)

get_agent_session_state - Load full session state (excludes memory values)

- When to use: Inspecting session progress, checking is_resumable, examining path

- Returns: Full state with memory_keys/memory_size instead of memory values

- Example:
get_agent_session_state(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345"
)

get_agent_session_memory - Get memory contents from a session

- When to use: Stage 5 root cause analysis, inspecting produced data

- Returns: All memory keys+values, or a single key's value

- Example:
get_agent_session_memory(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    key="twitter_handles"
)

list_agent_checkpoints - List checkpoints for a session

- When to use: Stage 6 recovery, finding clean checkpoints to resume from

- Returns: Checkpoint summaries with type, node, clean status

- Example:
list_agent_checkpoints(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    is_clean="true"
)

get_agent_checkpoint - Load a specific checkpoint with full state

- When to use: Inspecting exact state at a checkpoint, comparing to current state

- Returns: Full checkpoint: memory snapshot, execution path, metrics

- Example:
get_agent_checkpoint(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    checkpoint_id="cp_node_complete_intake_143030"
)

compare_agent_checkpoints - Diff memory between two checkpoints

- When to use: Understanding data flow, finding where state diverged

- Returns: Memory diff (added/removed/changed keys) + execution path diff

- Example:
compare_agent_checkpoints(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

Query Patterns

Pattern 1: Top-Down Investigation (Most common)

1. L1: Find problematic runs
2. L2: Identify failing nodes
3. L3: Analyze specific failures

Pattern 2: Node-Specific Debugging

1. L2: Get details for specific node across all runs
2. L3: Deep dive into worst failures

Pattern 3: Real-time Monitoring

Loop every 10 seconds:
  1. L1: Check for new needs_attention runs
  2. If found: Alert and drill into L2

Pattern 4: Session State + Checkpoint Recovery

1. list_agent_sessions: Find failed/paused sessions
2. get_agent_session_state: Check is_resumable, see execution path
3. get_agent_session_memory: Inspect what data was produced
4. list_agent_checkpoints: Find clean checkpoints before failure
5. compare_agent_checkpoints: Understand what changed between checkpoints
6. Recommend resume command with specific checkpoint

Complete Example Walkthrough

Scenario: deep_research_agent Agent Failing

Developer says: "My deep_research_agent keeps failing. It's not saving research findings correctly."

Stage 1: Setup

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context:
- Agent: deep_research_agent
- Goal: deep-research
- Working Dir: ~/.hive/agents/deep_research_agent
- Success Criteria: ["Successfully send 5 personalized outreach messages"]
- Constraints: ["Must verify handle exists", "Must personalize message"]
- Nodes: research, profile-analyzer, message-composer, outreach-sender

Stage 2: Mode Selection

You: "Which debugging approach would you prefer?"

[Present 3 mode options using AskUserQuestion]

Developer selects: "Post-Mortem Analysis Mode"

Stage 3: Triage (L1)

You: "Querying runtime logs for problematic runs..."

[Call query_runtime_logs with status="needs_attention"]

Results:
- Found 1 run needing attention
- session_20260206_115718_e22339c5 (30 min ago)
- Status: degraded
- Categories: missing_outputs, retry_loops

Stage 4: Diagnosis (L2)

You: "Analyzing node-level details..."

[Call query_runtime_log_details for the session]

Results:
- Problem Node: research
- Exit Status: escalate
- Retry Count: 5
- Verdict Counts: {RETRY: 5, ESCALATE: 1}
- Attention Reasons: ["high_retry_count", "missing_outputs"]
- Total Steps: 8

Diagnosis: research has retry loop + missing outputs issue
Category: Missing Outputs + Retry Loops

Stage 5: Root Cause (L3)

You: "Drilling into detailed logs for research..."

[Call query_runtime_log_raw for node research]

Evidence:
Step 3: web_search("latest AI regulations 2026") → Found articles
  Verdict: RETRY
  Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4: web_search("AI regulation policy 2026") → Found more info
  Verdict: RETRY
  Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Same pattern repeats...

ROOT CAUSE: Node successfully finds handles but LLM doesn't call set_output. It keeps searching instead of completing.

Stage 6: Fix Recommendation

You: "Here's the fix for this issue:

## Issue: research Node Not Setting Outputs

**Root Cause:** Node successfully finds research sources but doesn't call set_output

**Category:** Missing Outputs

**File to edit:** exports/deep_research_agent/nodes/research.py

**Changes:**
Update the system_prompt to emphasize the output requirement:

```python
system_prompt = """
You are the Research node for the deep research agent.

Your task:
1. Search for relevant sources on the research topic
2. Extract key findings from each source
3. **CRITICAL: Call set_output(research_findings=[...]) immediately after gathering sufficient findings**

IMPORTANT: You MUST call set_output even if you found findings on the first search.
Do not continue searching indefinitely - set output and finish once you have enough data.

Output format:
{
  "research_findings": [{"source": "...", "summary": "...", "url": "..."}, ...]
}
"""

Why this works: The explicit instruction to call set_output immediately prevents the LLM from continuing to search after finding sufficient sources.

#### Stage 7: Verification

Developer: "I updated the prompt. How do I test it?"

You: "Let's verify the fix:

- 
Launch the TUI dashboard:
hive tui
Then select your agent and press Enter to run it.

- 
I'll check the logs after it runs..."

[Developer runs agent]

Developer: "It completed! Can you check if it worked?"

You: "Checking now..."

[Call query_runtime_log_details for the new run]

Results:

- Node: research

- Exit Status: success ✓

- Retry Count: 0 ✓

- Total Steps: 2 ✓

- No attention flags ✓

"Perfect! The fix worked. The research now:

- Completes successfully on first try

- No retries needed

- Calls set_output properly

Your agent should now work correctly!"

---

## Tips for Effective Debugging

1. **Always start with L1 logs** - Don't jump straight to detailed logs
2. **Focus on attention flags** - They highlight the real issues
3. **Compare verdict_feedback across steps** - Patterns reveal root causes
4. **Check tool error messages carefully** - They often contain the exact problem
5. **Consider the agent's goal** - Fixes should align with success criteria
6. **Test fixes immediately** - Quick verification prevents wasted effort
7. **Look for patterns across multiple runs** - One-time failures might be transient

## Common Pitfalls to Avoid

1. **Don't recommend code you haven't verified exists** - Always read files first
2. **Don't assume tool capabilities** - Check MCP server configs
3. **Don't ignore edge conditions** - Missing edges cause routing failures
4. **Don't overlook judge configuration** - Mismatched expectations cause retry loops
5. **Don't forget nullable_output_keys** - Optional inputs need explicit marking
6. **Don't diagnose "in_progress" as a failure for forever-alive agents** - Agents with `terminal_nodes=[]` are designed to never enter "completed" state. This is intentional. Focus on quality of individual node visits, not session completion status
7. **Don't ignore conversation memory issues in long-running sessions** - In continuous conversation mode, history grows across node transitions and loop iterations. Watch for context overflow (tokens_used > 100K), stale data from previous loops affecting edge conditions, and compaction failures that cause the LLM to lose important context
8. **Don't confuse "waiting for user" with "stalled"** - Client-facing nodes in forever-alive agents block for user input by design. A session paused at a client-facing node is working correctly, not stalled

---

## Storage Locations Reference

**New unified storage (default):**
- Logs: `~/.hive/agents/{agent_name}/sessions/session_YYYYMMDD_HHMMSS_{uuid}/logs/`
- State: `~/.hive/agents/{agent_name}/sessions/{session_id}/state.json`
- Conversations: `~/.hive/agents/{agent_name}/sessions/{session_id}/conversations/`

**Old storage (deprecated, still supported):**
- Logs: `~/.hive/agents/{agent_name}/runtime_logs/runs/{run_id}/`

The MCP tools automatically check both locations.

---

**Remember:** Your role is to be a debugging companion and thought partner. Guide the developer through the investigation, explain what you find, and provide actionable fixes. Don't just report errors - help understand and solve them.

Safety Notice

Copy this and send it to your AI assistant to learn

Issue: Premature set_output in Client-Facing Node

Template 2: Retry Loops

Source Transparency

Related Skills

hive-test

hive

hive-credentials