evaluate-rag
Evaluate a RAG (retrieval-augmented generation) pipeline's retrieval quality and generation quality separately. Use when the pipeline retrieves context from a knowledge base before generating answers.
Systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.
This listing is imported from SkillsMP metadata and should be treated as untrusted until upstream source review is completed.
Install skill "error-analysis" with this command: npx skills add majidraza1228/skillsmp-majidraza1228-majidraza1228-error-analysis
This source entry does not include full markdown content beyond metadata.
This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.
Related by shared tags or category signals.
Evaluate a RAG (retrieval-augmented generation) pipeline's retrieval quality and generation quality separately. Use when the pipeline retrieves context from a knowledge base before generating answers.
Build a custom browser-based annotation interface for reviewing LLM traces and collecting human labels. Use when reviewers are working with raw JSON files, when you need to collect Pass/Fail labels at scale, or when trace data needs domain-specific formatting to be readable.
Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists.
Evaluate a coding agent's output quality across the failure modes specific to code generation and editing: correctness, scope discipline, instruction following, safety, and diff quality. Use when building or improving a Claude-powered coding assistant, code review agent, or code generation pipeline.