Silent Failure Detection
Detect when LLM agents fail silently - appearing to work while producing incorrect results.
Overview
-
Detecting when agents skip expected tool calls
-
Identifying gibberish or degraded output quality
-
Monitoring for infinite loops and token consumption spikes
-
Setting up statistical baselines for anomaly detection
-
Alerting on non-error failures (service up but logic broken)
Quick Reference
Tool Skipping Detection
from langfuse import Langfuse
def check_tool_usage(trace_id: str, expected_tools: list[str]) -> dict: """ Detect when agent skips expected tool calls.
Based on Akamai's middleware bug: agents stopped using tools
when hidden middleware injected unexpected instructions.
"""
langfuse = Langfuse()
trace = langfuse.fetch_trace(trace_id)
# Extract tool calls from trace
actual_tools = [
span.name for span in trace.observations
if span.type == "tool"
]
missing_tools = set(expected_tools) - set(actual_tools)
if missing_tools:
return {
"alert": True,
"type": "tool_skipping",
"missing": list(missing_tools),
"message": f"Agent skipped expected tools: {missing_tools}"
}
return {"alert": False}
Gibberish/Quality Detection
from langfuse.decorators import observe, langfuse_context
@observe(name="quality_check") async def detect_gibberish(response: str) -> dict: """ Detect low-quality or gibberish outputs using LLM-as-judge. """ # Quick heuristics first if len(response) < 10: return {"alert": True, "type": "too_short"}
if len(set(response.split())) / len(response.split()) < 0.3:
return {"alert": True, "type": "repetitive"}
# LLM-as-judge for quality
judge_prompt = f"""
Rate this response quality (0-1):
- 0: Gibberish, nonsensical, or completely wrong
- 0.5: Partially correct but missing key information
- 1: High quality, accurate, complete
Response: {response[:1000]}
Score (just the number):
"""
score = await llm.generate(judge_prompt)
score_value = float(score.strip())
langfuse_context.score(name="quality_check", value=score_value)
if score_value < 0.5:
return {"alert": True, "type": "low_quality", "score": score_value}
return {"alert": False, "score": score_value}
Loop Detection
class LoopDetector: """Detect infinite loops and token consumption spikes."""
def __init__(
self,
max_iterations: int = 10,
token_spike_multiplier: float = 3.0,
baseline_tokens: int = 2000
):
self.max_iterations = max_iterations
self.token_spike_multiplier = token_spike_multiplier
self.baseline_tokens = baseline_tokens
self.iteration_count = 0
self.total_tokens = 0
def check(self, tokens_used: int) -> dict:
self.iteration_count += 1
self.total_tokens += tokens_used
# Check iteration count
if self.iteration_count > self.max_iterations:
return {
"alert": True,
"type": "max_iterations",
"iterations": self.iteration_count,
"message": f"Agent exceeded {self.max_iterations} iterations"
}
# Check token spike
expected_tokens = self.baseline_tokens * self.iteration_count
if self.total_tokens > expected_tokens * self.token_spike_multiplier:
return {
"alert": True,
"type": "token_spike",
"tokens": self.total_tokens,
"expected": expected_tokens,
"message": f"Token consumption spike: {self.total_tokens} vs expected {expected_tokens}"
}
return {"alert": False}
Statistical Baseline Anomaly Detection
import numpy as np
class BaselineAnomalyDetector: """Detect anomalies vs statistical baseline."""
def __init__(self, window_size: int = 100, z_threshold: float = 3.0):
self.window_size = window_size
self.z_threshold = z_threshold
self.history = []
def add_observation(self, value: float) -> dict:
self.history.append(value)
if len(self.history) > self.window_size:
self.history = self.history[-self.window_size:]
if len(self.history) < 10:
return {"alert": False, "reason": "insufficient_data"}
mean = np.mean(self.history[:-1])
std = np.std(self.history[:-1])
if std == 0:
return {"alert": False}
z_score = abs(value - mean) / std
if z_score > self.z_threshold:
return {
"alert": True,
"type": "statistical_anomaly",
"z_score": z_score,
"value": value,
"mean": mean,
"std": std
}
return {"alert": False, "z_score": z_score}
Key Decisions
Decision Recommendation
Detection priority Tool skipping > Gibberish > Loops > Anomalies
Quality check LLM-as-judge with heuristic pre-filter
Loop threshold 10 iterations or 3x baseline tokens
Anomaly threshold Z-score > 3.0 (99.7% confidence)
Alert strategy Alert on silent failure, not just errors
Silent Failure Types
Type Detection Method Alert Priority
Tool Skipping Expected vs actual tool calls Critical
Gibberish Output LLM-as-judge + heuristics High
Infinite Loop Iteration count + token spike Critical
Quality Degradation Score < baseline Medium
Latency Spike p99 > threshold Medium
Anti-Patterns
❌ NEVER assume success if no error raised
result = await agent.run()
Missing: quality check, tool usage check
❌ NEVER ignore abnormal patterns
if len(response) > 0: # "Not empty" is not "correct" return response
✅ ALWAYS validate tool usage
expected_tools = ["search", "calculate"] tool_check = check_tool_usage(trace_id, expected_tools) if tool_check["alert"]: alert(tool_check)
✅ ALWAYS check output quality
quality = await detect_gibberish(response) if quality["alert"]: fallback_to_human_review()
Detailed Documentation
Resource Description
references/tool-skipping-detection.md Agent tool usage monitoring patterns
references/gibberish-detection.md Output quality scoring, LLM-as-judge
references/loop-detection.md Token spikes, retry patterns, circuit breakers
references/baseline-comparison.md Statistical anomaly detection
checklists/silent-failure-setup-checklist.md Implementation checklist
Related Skills
-
langfuse-observability
-
Trace analysis for tool usage
-
quality-gates
-
Quality threshold enforcement
-
observability-monitoring
-
General alerting patterns
-
advanced-guardrails
-
LLM output safety checks
Capability Details
tool-skipping
Keywords: tool skip, missing tool, agent tools, expected behavior Solves:
-
Detect when agents don't use expected tools
-
Monitor agent behavior consistency
-
Debug middleware interference (Akamai scenario)
gibberish-detection
Keywords: gibberish, nonsense, quality check, llm judge Solves:
-
Detect low-quality LLM outputs
-
Identify repetitive or nonsensical responses
-
Quality gate for production outputs
loop-detection
Keywords: infinite loop, retry loop, token spike, stuck agent Solves:
-
Detect agents stuck in loops
-
Monitor token consumption anomalies
-
Prevent runaway costs
baseline-anomaly
Keywords: anomaly, baseline, z-score, statistical, deviation Solves:
-
Detect deviations from normal behavior
-
Statistical anomaly detection
-
Early warning for silent failures
latency-monitoring
Keywords: latency, slow, p99, degraded, performance Solves:
-
Detect degraded but non-failing service
-
Monitor response time anomalies
-
SLO compliance for LLM calls