silent-failure-detection

Silent Failure Detection

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "silent-failure-detection" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-silent-failure-detection

Silent Failure Detection

Detect when LLM agents fail silently - appearing to work while producing incorrect results.

Overview

  • Detecting when agents skip expected tool calls

  • Identifying gibberish or degraded output quality

  • Monitoring for infinite loops and token consumption spikes

  • Setting up statistical baselines for anomaly detection

  • Alerting on non-error failures (service up but logic broken)

Quick Reference

Tool Skipping Detection

from langfuse import Langfuse

def check_tool_usage(trace_id: str, expected_tools: list[str]) -> dict: """ Detect when agent skips expected tool calls.

Based on Akamai's middleware bug: agents stopped using tools
when hidden middleware injected unexpected instructions.
"""
langfuse = Langfuse()
trace = langfuse.fetch_trace(trace_id)

# Extract tool calls from trace
actual_tools = [
    span.name for span in trace.observations
    if span.type == "tool"
]

missing_tools = set(expected_tools) - set(actual_tools)

if missing_tools:
    return {
        "alert": True,
        "type": "tool_skipping",
        "missing": list(missing_tools),
        "message": f"Agent skipped expected tools: {missing_tools}"
    }
return {"alert": False}

Gibberish/Quality Detection

from langfuse.decorators import observe, langfuse_context

@observe(name="quality_check") async def detect_gibberish(response: str) -> dict: """ Detect low-quality or gibberish outputs using LLM-as-judge. """ # Quick heuristics first if len(response) < 10: return {"alert": True, "type": "too_short"}

if len(set(response.split())) / len(response.split()) &#x3C; 0.3:
    return {"alert": True, "type": "repetitive"}

# LLM-as-judge for quality
judge_prompt = f"""
Rate this response quality (0-1):
- 0: Gibberish, nonsensical, or completely wrong
- 0.5: Partially correct but missing key information
- 1: High quality, accurate, complete

Response: {response[:1000]}

Score (just the number):
"""

score = await llm.generate(judge_prompt)
score_value = float(score.strip())

langfuse_context.score(name="quality_check", value=score_value)

if score_value &#x3C; 0.5:
    return {"alert": True, "type": "low_quality", "score": score_value}
return {"alert": False, "score": score_value}

Loop Detection

class LoopDetector: """Detect infinite loops and token consumption spikes."""

def __init__(
    self,
    max_iterations: int = 10,
    token_spike_multiplier: float = 3.0,
    baseline_tokens: int = 2000
):
    self.max_iterations = max_iterations
    self.token_spike_multiplier = token_spike_multiplier
    self.baseline_tokens = baseline_tokens
    self.iteration_count = 0
    self.total_tokens = 0

def check(self, tokens_used: int) -> dict:
    self.iteration_count += 1
    self.total_tokens += tokens_used

    # Check iteration count
    if self.iteration_count > self.max_iterations:
        return {
            "alert": True,
            "type": "max_iterations",
            "iterations": self.iteration_count,
            "message": f"Agent exceeded {self.max_iterations} iterations"
        }

    # Check token spike
    expected_tokens = self.baseline_tokens * self.iteration_count
    if self.total_tokens > expected_tokens * self.token_spike_multiplier:
        return {
            "alert": True,
            "type": "token_spike",
            "tokens": self.total_tokens,
            "expected": expected_tokens,
            "message": f"Token consumption spike: {self.total_tokens} vs expected {expected_tokens}"
        }

    return {"alert": False}

Statistical Baseline Anomaly Detection

import numpy as np

class BaselineAnomalyDetector: """Detect anomalies vs statistical baseline."""

def __init__(self, window_size: int = 100, z_threshold: float = 3.0):
    self.window_size = window_size
    self.z_threshold = z_threshold
    self.history = []

def add_observation(self, value: float) -> dict:
    self.history.append(value)
    if len(self.history) > self.window_size:
        self.history = self.history[-self.window_size:]

    if len(self.history) &#x3C; 10:
        return {"alert": False, "reason": "insufficient_data"}

    mean = np.mean(self.history[:-1])
    std = np.std(self.history[:-1])

    if std == 0:
        return {"alert": False}

    z_score = abs(value - mean) / std

    if z_score > self.z_threshold:
        return {
            "alert": True,
            "type": "statistical_anomaly",
            "z_score": z_score,
            "value": value,
            "mean": mean,
            "std": std
        }
    return {"alert": False, "z_score": z_score}

Key Decisions

Decision Recommendation

Detection priority Tool skipping > Gibberish > Loops > Anomalies

Quality check LLM-as-judge with heuristic pre-filter

Loop threshold 10 iterations or 3x baseline tokens

Anomaly threshold Z-score > 3.0 (99.7% confidence)

Alert strategy Alert on silent failure, not just errors

Silent Failure Types

Type Detection Method Alert Priority

Tool Skipping Expected vs actual tool calls Critical

Gibberish Output LLM-as-judge + heuristics High

Infinite Loop Iteration count + token spike Critical

Quality Degradation Score < baseline Medium

Latency Spike p99 > threshold Medium

Anti-Patterns

❌ NEVER assume success if no error raised

result = await agent.run()

Missing: quality check, tool usage check

❌ NEVER ignore abnormal patterns

if len(response) > 0: # "Not empty" is not "correct" return response

✅ ALWAYS validate tool usage

expected_tools = ["search", "calculate"] tool_check = check_tool_usage(trace_id, expected_tools) if tool_check["alert"]: alert(tool_check)

✅ ALWAYS check output quality

quality = await detect_gibberish(response) if quality["alert"]: fallback_to_human_review()

Detailed Documentation

Resource Description

references/tool-skipping-detection.md Agent tool usage monitoring patterns

references/gibberish-detection.md Output quality scoring, LLM-as-judge

references/loop-detection.md Token spikes, retry patterns, circuit breakers

references/baseline-comparison.md Statistical anomaly detection

checklists/silent-failure-setup-checklist.md Implementation checklist

Related Skills

  • langfuse-observability

  • Trace analysis for tool usage

  • quality-gates

  • Quality threshold enforcement

  • observability-monitoring

  • General alerting patterns

  • advanced-guardrails

  • LLM output safety checks

Capability Details

tool-skipping

Keywords: tool skip, missing tool, agent tools, expected behavior Solves:

  • Detect when agents don't use expected tools

  • Monitor agent behavior consistency

  • Debug middleware interference (Akamai scenario)

gibberish-detection

Keywords: gibberish, nonsense, quality check, llm judge Solves:

  • Detect low-quality LLM outputs

  • Identify repetitive or nonsensical responses

  • Quality gate for production outputs

loop-detection

Keywords: infinite loop, retry loop, token spike, stuck agent Solves:

  • Detect agents stuck in loops

  • Monitor token consumption anomalies

  • Prevent runaway costs

baseline-anomaly

Keywords: anomaly, baseline, z-score, statistical, deviation Solves:

  • Detect deviations from normal behavior

  • Statistical anomaly detection

  • Early warning for silent failures

latency-monitoring

Keywords: latency, slow, p99, degraded, performance Solves:

  • Detect degraded but non-failing service

  • Monitor response time anomalies

  • SLO compliance for LLM calls

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ui-components

No summary provided by upstream source.

Repository SourceNeeds Review
General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review