llm-evaluator

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for observability. Triggers: evaluate trace, score quality, check accuracy, backfill scores, test evaluator, LLM judge.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-evaluator" with this command: npx skills add aiwithabidi/llm-evaluator-pro

LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

Evaluating quality of search results or AI responses
Scoring traces for relevance, accuracy, hallucination detection
Batch scoring recent unscored traces
Quality assurance on agent outputs

Usage

# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20

Evaluators

Evaluator	Measures	Scale
relevance	Response relevance to query	0–1
accuracy	Factual correctness	0–1
hallucination	Made-up information detection	0–1
helpfulness	Overall usefulness	0–1

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open Registry Record Open in ClawHub

Related Skills

Related by shared tags or category signals.

General

Multi-Skill-Eval | 集成化技能评估系统

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。触发词(中文): 评估技...

Registry SourceRecently Updated

911Profile unavailable

Coding

SkillClinic

AI 技能体检诊断，检测Gene结构完整性、触发配置、内容质量

Registry SourceRecently Updated

1250Profile unavailable

Research

Skill Quality Evaluator

Skill Quality Evaluator - Score any skill on 6 dimensions. Catch 30% of skills that look good but fail silently. Based on Tessl Research 2026 findings.

Registry SourceRecently Updated

950Profile unavailable

Research

PluginEval Core

Self-contained PluginEval quality evaluation engine. Measures 6 dimensions, detects anti-patterns, assigns badges. No external dependencies.

Registry SourceRecently Updated

860Profile unavailable