rag-auditor

Evaluates RAG (Retrieval-Augmented Generation) pipeline quality across retrieval and generation stages. Measures precision, recall, MRR for retrieval; groundedness, completeness, and hallucination rate for generation. Diagnoses failure root causes and recommends chunk, retrieval, and prompt improvements. Triggers on: "audit RAG pipeline", "RAG quality", "evaluate RAG retrieval", "hallucination detection", "retrieval precision", "why is RAG failing", "RAG diagnosis", "retrieval quality", "RAG evaluation", "chunk quality", "RAG pipeline review", "grounding check". Use this skill when diagnosing or evaluating a RAG pipeline's quality. For general architecture or system audits, use architecture-reviewer instead.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-auditor" with this command: npx skills add mathews-tom/praxis-skills/mathews-tom-praxis-skills-rag-auditor

RAG Auditor

Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.

Reference Files

FileContentsLoad When
references/retrieval-metrics.mdPrecision@K, Recall@K, MRR, NDCG definitions and calculationAlways
references/generation-metrics.mdGroundedness, completeness, hallucination detection methodsGeneration evaluation needed
references/failure-taxonomy.mdRAG failure categories: retrieval, generation, chunking, embeddingFailure diagnosis needed
references/diagnostic-queries.mdDesigning evaluation query sets, known-answer questions, difficulty levelsEvaluation setup

Prerequisites

  • Access to the RAG pipeline (or its outputs for post-hoc evaluation)
  • A set of test queries with known-correct answers
  • Understanding of the pipeline components (embedding model, retriever, generator)

Workflow

Phase 1: Pipeline Inventory

Document the RAG pipeline configuration:

  1. Document source — What documents are indexed? Format, count, size.
  2. Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
  3. Embedding — Model name and version, dimensionality.
  4. Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
  5. Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
  6. Generation — Model, prompt template, context window usage.

Phase 2: Design Evaluation Queries

Create a diverse set of test queries:

Query TypePurposeCount
Known-answer (factoid)Measure retrieval + generation accuracy10+
Multi-hopRequire combining info from multiple chunks5+
UnanswerableNot in the corpus — should abstain3+
AmbiguousMultiple valid interpretations3+
Recent/updatedTest freshness2+

For each query, document the expected answer and the source chunk(s).

Phase 3: Evaluate Retrieval

For each test query, measure:

  1. Precision@K — Of the K retrieved chunks, how many are relevant?
  2. Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
  3. MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
  4. Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.

Phase 4: Evaluate Generation

For each test query with retrieved context:

  1. Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
  2. Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
  3. Hallucination detection — Identify specific claims not supported by context.
  4. Abstention — For unanswerable queries, does the model correctly say "I don't know"?

Phase 5: Diagnose Failures

For every incorrect or low-quality response, classify the root cause:

Failure TypeDiagnosisIndicator
Retrieval failureRelevant chunks not retrievedLow Recall@K
Ranking failureRelevant chunk retrieved but ranked lowLow MRR, high Recall
Chunk boundary issueAnswer split across chunk boundariesPartial matches in multiple chunks
Embedding mismatchQuery semantics don't match chunk embeddingsRelevant chunk has low similarity score
Generation failureCorrect context but wrong answerHigh retrieval scores, low groundedness
HallucinationModel invents facts not in contextClaims not traceable to any chunk
Over-abstentionModel refuses to answer when context is sufficientUnanswered with relevant context present

Phase 6: Recommendations

Based on failure analysis, recommend specific improvements:

Failure PatternRecommendation
Chunk boundary issuesIncrease overlap, try semantic chunking
Low Precision@KReduce K, add reranking stage
Low Recall@KIncrease K, try hybrid search
Embedding mismatchTry different embedding model, add query expansion
HallucinationStrengthen grounding instruction in prompt, reduce temperature
Over-abstentionSoften abstention criteria in prompt

Output Format

## RAG Audit Report

### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |

### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}

### Retrieval Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |

### Generation Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |

### Failure Analysis

| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |

### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}

### Sample Failures

#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}

Calibration Rules

  1. Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
  2. Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
  3. Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
  4. Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.

Error Handling

ProblemResolution
No known-answer queries availableHelp design them from the document corpus. Pick 10 facts and formulate questions.
Pipeline access not availableWork from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples.
Corpus is too large to reviewSample-based evaluation. Select representative documents and generate queries from them.
Multiple failure types co-existAddress retrieval failures first. Generation quality cannot exceed retrieval quality.

When NOT to Audit

Push back if:

  • The pipeline hasn't been built yet — design it first, audit after
  • The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
  • The user wants to compare embedding models — that's a benchmark task, not an audit

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

dependency-audit

No summary provided by upstream source.

Repository SourceNeeds Review
General

manuscript-review

No summary provided by upstream source.

Repository SourceNeeds Review
General

html-presentation

No summary provided by upstream source.

Repository SourceNeeds Review
General

concept-to-image

No summary provided by upstream source.

Repository SourceNeeds Review