RAG Engineering — Complete Retrieval-Augmented Generation System

Build production RAG systems that actually work. From chunking strategy to evaluation — the complete methodology.

You are an expert RAG engineer. When the user needs to build, optimize, or debug a RAG system, follow this complete methodology.

Phase 1: RAG Architecture Assessment

Quick Health Check (Existing Systems)

Signal	Healthy	Warning	Critical
Answer relevance	>85% users satisfied	60-85%	<60%
Retrieval precision@5	>70% relevant chunks	40-70%	<40%
Hallucination rate	<5%	5-15%	>15%
Latency (P95)	<3s	3-8s	>8s
Context utilization	>60% of retrieved used	30-60%	<30%
Cost per query	<$0.05	$0.05-0.20	>$0.20

RAG Project Brief

rag_brief:
  project: "[name]"
  date: "YYYY-MM-DD"

  # What problem are we solving?
  use_case: "[customer support / code search / document Q&A / research / legal / medical]"
  user_persona: "[who asks questions]"
  query_types:
    - factual: "[percentage] — direct fact lookup"
    - analytical: "[percentage] — synthesis across documents"
    - procedural: "[percentage] — how-to, step-by-step"
    - comparative: "[percentage] — compare X vs Y"
    - conversational: "[percentage] — multi-turn follow-ups"

  # What data do we have?
  corpus:
    total_documents: "[count]"
    total_size: "[GB/TB]"
    document_types:
      - type: "[PDF/HTML/markdown/code/JSON/CSV]"
        count: "[count]"
        avg_length: "[pages/tokens]"
    update_frequency: "[static / daily / real-time]"
    languages: ["en", "..."]
    quality: "[curated / mixed / noisy]"

  # Requirements
  accuracy_target: "[% — start with 85%]"
  latency_target: "[ms P95]"
    max_cost_per_query: "[$]"
  scale: "[queries/day]"
  multi_turn: "[yes/no]"
  citations_required: "[yes/no]"

  # Constraints
  deployment: "[cloud / on-prem / hybrid]"
  data_sensitivity: "[public / internal / PII / regulated]"
  budget: "[$/month for infrastructure]"

RAG Architecture Decision Tree

Is your corpus < 100 documents AND < 50 pages each?
├─ YES → Consider full-context stuffing (no RAG needed)
│        Use: Long-context model (Gemini 1M, Claude 200K)
│        When: Static docs, low query volume, budget allows
│
└─ NO → RAG is appropriate
         │
         Is real-time freshness critical?
         ├─ YES → Streaming RAG with incremental indexing
         └─ NO → Batch-indexed RAG
                  │
                  Do queries need multi-step reasoning?
                  ├─ YES → Agentic RAG (query planning + tool use)
                  └─ NO → Standard retrieval pipeline
                           │
                           Single document type?
                           ├─ YES → Single-index RAG
                           └─ NO → Multi-index with routing

Architecture Patterns

Pattern	Use Case	Complexity	Quality
Naive RAG	Simple Q&A, prototypes	Low	Medium
Advanced RAG	Production systems	Medium	High
Modular RAG	Complex multi-source	High	Highest
Agentic RAG	Multi-step research	Highest	Highest
Graph RAG	Entity-heavy domains	High	High for relational queries
Hybrid RAG	Mixed query types	Medium-High	High

Phase 2: Data Ingestion & Preprocessing

Document Processing Pipeline

Raw Documents → Extraction → Cleaning → Enrichment → Chunking → Embedding → Indexing

Extraction Strategy by Document Type

Document Type	Extraction Tool	Key Challenges	Quality Tips
PDF	PyMuPDF, Unstructured, Docling	Tables, images, multi-column	Use layout-aware parser; OCR for scanned
HTML	BeautifulSoup, Trafilatura	Boilerplate, navigation	Extract main content only; preserve headers
Markdown	Direct parse	Minimal	Preserve structure; handle frontmatter
Code	Tree-sitter, AST	Context loss	Include file path + imports as metadata
CSV/JSON	pandas, jq	Schema understanding	Convert rows to natural language
DOCX/PPTX	python-docx, python-pptx	Formatting, embedded media	Extract text + table structure
Images	GPT-4V, Claude Vision	OCR accuracy	Generate text descriptions; store as metadata
Audio/Video	Whisper, Assembly	Timestamps, speakers	Chunk by speaker turn or topic segment

Cleaning Checklist

Remove headers/footers/page numbers (PDF artifacts)
Normalize whitespace (collapse multiple spaces/newlines)
Fix encoding issues (UTF-8 normalize)
Remove boilerplate (disclaimers, repeated navigation)
Preserve meaningful formatting (tables, lists, code blocks)
Handle special characters and Unicode consistently
Detect and flag low-quality documents (OCR confidence < 80%)
Deduplicate (exact + near-duplicate detection)

Metadata Enrichment

Always extract and store:

document_metadata:
  source_id: "[unique document identifier]"
  source_url: "[original URL or file path]"
  title: "[document title]"
  author: "[if available]"
  created_date: "[ISO 8601]"
  modified_date: "[ISO 8601]"
  document_type: "[pdf/html/md/code/...]"
  language: "[ISO 639-1]"
  section_hierarchy: ["Chapter", "Section", "Subsection"]
  tags: ["auto-generated", "topic", "tags"]
  access_level: "[public/internal/restricted]"
  quality_score: "[0-100 from cleaning pipeline]"

Enrichment strategies:

Auto-generate summaries per document (for hybrid search)
Extract entities (people, companies, products, dates)
Classify by topic/category
Generate hypothetical questions (HyDE technique at index time)

Phase 3: Chunking Strategy

The Chunking Decision Is Critical

Bad chunking is the #1 cause of poor RAG quality. No amount of model sophistication fixes bad chunks.

Chunking Method Selection

Method	Best For	Chunk Quality	Implementation
Fixed-size	Homogeneous text, quick prototype	Medium	Simple
Recursive character	General purpose, structured docs	Good	LangChain default
Semantic	Varied content, topic shifts	High	Embedding-based
Document-structure	Technical docs, legal, academic	Highest	Custom per doc type
Agentic/LLM	High-value docs, complex structure	Highest	Expensive
Sentence-window	Dense factual content	Good	Sentence + context
Parent-child	Hierarchical docs, manuals	High	Two-level index

Chunking Decision Tree

Is your content highly structured (headers, sections, numbered)?
├─ YES → Document-structure chunking
│        Split on: H1 > H2 > H3 > paragraph boundaries
│        Keep: section title chain as metadata
│
└─ NO → Is content topically diverse within documents?
         ├─ YES → Semantic chunking
         │        Split when: embedding similarity drops below threshold
         │        Typical threshold: cosine similarity < 0.75
         │
         └─ NO → Recursive character splitting
                  With: chunk_size=512, overlap=64 (tokens)
                  Separators: ["\n\n", "\n", ". ", " "]

Chunk Size Guidelines

Use Case	Target Tokens	Overlap	Rationale
Factual Q&A	256-512	32-64	Precise retrieval
Summarization	512-1024	64-128	Broader context
Code search	Function/class level	0	Natural boundaries
Legal/regulatory	Section/clause level	1 sentence	Preserve clause integrity
Conversational	256-512	64	Quick, focused answers
Research/analysis	1024-2048	128-256	Deep context

Chunk Quality Rules

Self-contained: A chunk should make sense on its own (add context headers if needed)
Atomic: One main idea per chunk when possible
Retrievable: Would this chunk be useful if a user searched for its content?
No orphans: Don't create chunks < 50 tokens (merge with neighbors)
Preserve structure: Tables, code blocks, and lists should not be split mid-element
Context prefix: Prepend document title + section hierarchy to each chunk

Parent-Child (Two-Level) Strategy

Parent chunks: 2048 tokens (stored for LLM context)
  └─ Child chunks: 256 tokens (stored for retrieval)

Retrieval: Search child chunks → Return parent chunk to LLM
Benefit: Precise retrieval + rich context

Chunk Quality Scoring

Score each chunk (automated):

Dimension	Weight	0 (Bad)	5 (Good)	10 (Great)
Self-contained	25%	Sentence fragment	Needs context	Standalone meaningful
Information density	25%	Mostly boilerplate	Mixed	Dense, useful content
Boundary quality	20%	Mid-sentence split	Paragraph boundary	Section/topic boundary
Metadata completeness	15%	No metadata	Basic fields	Full enrichment
Size appropriateness	15%	<50 or >2048 tokens	Within range	Optimal for use case

Target: Average chunk quality score > 7.0

Phase 4: Embedding Strategy

Embedding Model Selection

Model	Dimensions	Max Tokens	Quality	Speed	Cost
text-embedding-3-large (OpenAI)	3072 (or 256-3072 via MRL)	8191	Excellent	Fast	$0.13/1M tokens
text-embedding-3-small (OpenAI)	1536 (or 256-1536)	8191	Good	Very fast	$0.02/1M tokens
voyage-3-large (Voyage)	1024	32000	Excellent	Fast	$0.18/1M tokens
voyage-code-3 (Voyage)	1024	32000	Best for code	Fast	$0.18/1M tokens
Cohere embed-v4	1024	128000	Excellent	Fast	$0.10/1M tokens
BGE-M3 (open source)	1024	8192	Very good	Self-host	Free (compute)
nomic-embed-text (open source)	768	8192	Good	Self-host	Free (compute)
GTE-Qwen2 (open source)	1024-1792	8192	Excellent	Self-host	Free (compute)

Model Selection Rules

Start with: text-embedding-3-small (best cost/quality for prototypes)
Production default: text-embedding-3-large or voyage-3-large
Code search: voyage-code-3 or domain-fine-tuned
Multilingual: Cohere embed-v4 or BGE-M3
Privacy/on-prem: BGE-M3 or GTE-Qwen2
Budget constrained: MRL (Matryoshka) — reduce dimensions (e.g., 3072→256) for 10x storage savings with ~5% quality loss

Embedding Best Practices

Prefix queries differently from documents: Some models (Nomic, E5) need task-specific prefixes
- Document: "search_document: {text}"
- Query: "search_query: {text}"
Normalize embeddings: L2 normalize for cosine similarity
Batch embedding: Process in batches of 100-500 for throughput
Cache embeddings: Store and reuse; don't re-embed unchanged documents
Benchmark on YOUR data: Generic benchmarks (MTEB) don't predict domain-specific performance

Embedding Quality Test

Before committing to a model, run this:

Create 50 query-document pairs from your actual data
Embed all queries and documents
Calculate recall@5 and recall@10
Compare 2-3 models
Pick the one with highest recall on YOUR domain

Target: recall@5 > 0.7 on your domain test set

Phase 5: Vector Store & Indexing

Vector Database Selection

Database	Type	Scale	Features	Best For
Pinecone	Managed	Billions	Serverless, metadata filter	Production SaaS
Weaviate	Managed/Self-host	Millions-Billions	Hybrid search, modules	Feature-rich apps
Qdrant	Managed/Self-host	Billions	Filtering, quantization	High-performance
ChromaDB	Embedded	Thousands-Millions	Simple API	Prototypes, local
pgvector	Extension	Millions	SQL integration	Postgres-native apps
Milvus	Self-host	Billions	GPU support	Large scale
LanceDB	Embedded	Millions	Serverless, multimodal	Cost-sensitive

Selection Decision

Scale < 100K chunks AND simple use case?
├─ YES → ChromaDB or pgvector
└─ NO → Need managed service?
         ├─ YES → Pinecone (simplest) or Weaviate (feature-rich)
         └─ NO → Qdrant (performance) or Milvus (scale)

Indexing Strategy

Index Type	Recall	Speed	Memory	Use When
Flat/Brute	100%	Slow	High	<50K vectors, accuracy critical
IVF	95-99%	Fast	Medium	50K-10M vectors
HNSW	95-99%	Very fast	High	Default choice for quality
PQ (Product Quantization)	90-95%	Fast	Low	Memory constrained
HNSW+PQ	93-98%	Fast	Medium	Scale + quality balance

Default recommendation: HNSW with ef_construction=200, M=16

Hybrid Search Architecture

Query → [Sparse Search (BM25)] → Top K₁ results
      → [Dense Search (Vector)] → Top K₂ results
      → [Reciprocal Rank Fusion] → Final Top K results → LLM

Why hybrid?

Dense (vector) excels at semantic similarity
Sparse (BM25/keyword) excels at exact term matching, acronyms, IDs
Hybrid captures both — 5-15% improvement over either alone

RRF Formula: score = Σ 1/(k + rank_i) where k=60 (default)

Metadata Filtering

Always support these filters:

filterable_fields:
  - source_type: "[document type]"
  - created_after: "[date filter]"
  - access_level: "[permission-based filtering]"
  - language: "[language filter]"
  - tags: "[topic/category filter]"
  - quality_score_min: "[minimum quality threshold]"

Rule: Filter BEFORE vector search, not after — reduces search space and improves relevance.

Phase 6: Retrieval Optimization

Query Processing Pipeline

User Query → Query Understanding → Query Transformation → Retrieval → Reranking → Context Assembly → LLM

Query Transformation Techniques

Technique	What It Does	When to Use	Quality Boost
Query rewriting	LLM rewrites query for clarity	Vague/conversational queries	+10-15%
HyDE	Generate hypothetical answer, embed that	Factual Q&A	+5-15%
Multi-query	Generate 3-5 query variants	Complex questions	+10-20%
Step-back	Abstract to higher-level question	Complex reasoning	+5-10%
Query decomposition	Break into sub-questions	Multi-part questions	+15-25%
Query routing	Route to different indexes	Multi-source systems	+10-20%

Recommended: Multi-Query + Reranking

# Pseudocode for production retrieval
def retrieve(user_query: str, top_k: int = 5) -> list[Chunk]:
    # Step 1: Generate query variants
    queries = generate_query_variants(user_query, n=3)  # LLM generates 3 variants
    queries.append(user_query)  # Include original

    # Step 2: Retrieve candidates from each query
    candidates = set()
    for q in queries:
        results = hybrid_search(q, top_k=20)  # Over-retrieve
        candidates.update(results)

    # Step 3: Rerank
    reranked = rerank(user_query, list(candidates), top_k=top_k)

    return reranked

Reranking

Why rerank? Embedding similarity is a rough filter. Cross-encoder rerankers are 10-30% more accurate but too slow for initial retrieval.

Reranker	Quality	Speed	Cost
Cohere Rerank 3.5	Excellent	Fast	$2/1M queries
Voyage Reranker 2	Excellent	Fast	API pricing
BGE-reranker-v2-m3	Very good	Medium	Free (self-host)
ColBERT v2	Excellent	Medium	Free (self-host)
LLM-as-reranker	Best	Slow	Expensive

Default: Cohere Rerank 3.5 (best quality/cost ratio)

Retrieval Parameters

Parameter	Default	Range	Impact
top_k (initial retrieval)	20	10-50	Higher = better recall, more noise
top_k (after reranking)	5	3-10	Higher = more context, more cost
similarity threshold	0.3	0.2-0.5	Filter low-relevance results
MMR diversity	λ=0.7	0.5-1.0	Lower = more diverse results

Context Assembly

context_assembly:
  ordering: "relevance_descending"  # Most relevant first
  deduplication: true  # Remove near-duplicate chunks
  max_context_tokens: 4000  # Leave room for system prompt + answer
  include_metadata: true  # Source, date, section as inline citations
  separator: "\n---\n"  # Clear chunk boundaries

  # Citation format
  citation_style: |
    [Source: {title} | Section: {section} | Date: {date}]
    {chunk_text}

Phase 7: Generation & Prompting

System Prompt Template

You are a helpful assistant that answers questions based on the provided context.

## Rules
1. Answer ONLY based on the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer this question."
2. Always cite your sources using [Source: X] notation.
3. If the context contains conflicting information, acknowledge the conflict and present both perspectives.
4. Never make up information or fill gaps with your training data.
5. If the question is ambiguous, ask for clarification.
6. Keep answers concise but complete.

## Context
{retrieved_context}

## Conversation History (if multi-turn)
{conversation_history}

## User Question
{user_query}

Prompt Engineering for RAG

Grounding rules (prevent hallucination):

Explicitly instruct: "Only use the provided context"
Add: "If you're unsure, say so rather than guessing"
Include: "Quote relevant passages to support your answer"
Test: Ask questions NOT in the context — model should decline

Citation instructions:

Inline: "Based on [Document Title, Section X]..."
Footnote: "...the process involves three steps.[1]"
Both: Use inline for key claims, footnotes for supporting details

Model Selection for Generation

Model	Context Window	Quality	Cost	Best For
GPT-4o	128K	Excellent	Medium	General production
GPT-4o-mini	128K	Good	Low	High-volume, cost-sensitive
Claude Sonnet	200K	Excellent	Medium	Nuanced answers, long context
Claude Haiku	200K	Good	Low	Fast, cost-sensitive
Gemini 1.5 Pro	1M	Excellent	Medium	Very large context needs
Llama 3.1 70B	128K	Very good	Self-host	Privacy, on-prem

Multi-Turn Conversation

conversation_strategy:
  # How to handle follow-up questions
  query_reformulation: true  # Rewrite follow-ups as standalone queries
  context_carry_forward: "last_2_turns"  # How much history to include
  memory:
    type: "sliding_window"  # or "summary" for long conversations
    window_size: 5  # Number of turns to keep

  # Example reformulation
  # Turn 1: "What is RAG?" → search as-is
  # Turn 2: "How does it handle updates?" → reformulate: "How does RAG handle document updates?"

Phase 8: Evaluation Framework

RAG Evaluation is Non-Negotiable

If you're not measuring, you're guessing. Every production RAG system needs automated evaluation.

Evaluation Dimensions

Dimension	What It Measures	Method
Retrieval Precision	Are retrieved chunks relevant?	Human or LLM judge
Retrieval Recall	Are all relevant chunks found?	Gold set comparison
Answer Faithfulness	Does answer match context? (no hallucination)	LLM-as-judge
Answer Relevance	Does answer address the question?	LLM-as-judge
Answer Completeness	Are all aspects of the question addressed?	LLM-as-judge
Citation Accuracy	Are citations correct and sufficient?	Automated + human
Latency	End-to-end response time	Instrumentation
Cost	Per-query cost	Instrumentation

Evaluation Dataset

Build a golden test set (minimum 100 examples):

eval_example:
  query: "What is the refund policy for enterprise customers?"
  expected_sources: ["policy-doc-v3.pdf", "enterprise-agreement.md"]
  expected_answer_contains:
    - "30-day refund window"
    - "written notice required"
    - "prorated for annual plans"
  answer_type: "factual"
  difficulty: "easy"  # easy / medium / hard

Test set composition:

40% easy (single document, direct answer)
35% medium (multiple documents, synthesis needed)
15% hard (requires reasoning, edge cases)
10% unanswerable (answer NOT in corpus — must detect)

LLM-as-Judge Prompts

Faithfulness (hallucination detection):

Given the context and the answer, determine if the answer is faithful to the context.

Context: {context}
Question: {question}
Answer: {answer}

Score 1-5:
1 = Contains fabricated information not in context
2 = Mostly faithful but includes unsupported claims
3 = Faithful with minor extrapolation
4 = Faithful, well-supported
5 = Perfectly faithful, every claim traceable to context

Score: [1-5]
Reasoning: [explain]

Answer Relevance:

Does this answer address the user's question?

Question: {question}
Answer: {answer}

Score 1-5:
1 = Completely irrelevant
2 = Partially relevant, misses key aspects
3 = Relevant but incomplete
4 = Relevant and mostly complete
5 = Perfectly addresses all aspects of the question

Score: [1-5]
Reasoning: [explain]

Evaluation Tools

Tool	Type	Best For
RAGAS	Open source	Comprehensive RAG metrics
DeepEval	Open source	LLM-as-judge + classic metrics
Arize Phoenix	Open source	Tracing + evaluation
LangSmith	Managed	LangChain ecosystem
Braintrust	Managed	Eval + logging + monitoring
Custom	DIY	Maximum control

Evaluation Cadence

Frequency	What to Evaluate
Every PR	Run golden test set (automated CI)
Weekly	Sample 50 production queries for human review
Monthly	Full evaluation suite + benchmark comparison
Quarterly	Revisit golden test set, add new examples

Phase 9: Production Deployment

Production Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Client     │────▶│  API Gateway  │────▶│  RAG Service  │
│   (App/API)  │     │  (Rate limit) │     │              │
└─────────────┘     └──────────────┘     │  Query Proc.  │
                                          │  Retrieval    │
                                          │  Reranking    │
                                          │  Generation   │
                                          └───────┬───────┘
                                                  │
                    ┌──────────────┐     ┌────────▼───────┐
                    │  Ingestion    │────▶│  Vector Store   │
                    │  Pipeline     │     │  + Metadata     │
                    └──────────────┘     └────────────────┘

Production Checklist

Pre-Launch (Mandatory):

Security:

No prompt injection vectors (user input sanitized)
Access control on documents (user sees only authorized content)
No PII leakage across user boundaries
API authentication required
Rate limiting per user/API key
Audit logging for compliance

Caching Strategy

caching:
  query_cache:
    type: "semantic"  # Cache semantically similar queries
    ttl: 3600  # 1 hour
    similarity_threshold: 0.95
    expected_hit_rate: "20-40%"

  embedding_cache:
    type: "exact"  # Cache document embeddings
    ttl: 86400  # 24 hours (or until document changes)

  llm_response_cache:
    type: "exact_query_context"
    ttl: 1800  # 30 minutes
    invalidate_on: "source_document_update"

Scaling Considerations

Scale	Architecture	Notes
<1K queries/day	Single instance, managed vector DB	Keep it simple
1K-100K/day	Horizontal scaling, caching	Add semantic cache
100K-1M/day	Microservices, async, CDN	Separate ingestion/retrieval
>1M/day	Distributed, multi-region	Custom infrastructure

Phase 10: Monitoring & Observability

Production Monitoring Dashboard

rag_dashboard:
  real_time:
    - query_volume: "[queries/min]"
    - latency_p50: "[ms]"
    - latency_p95: "[ms]"
    - latency_p99: "[ms]"
    - error_rate: "[%]"
    - cache_hit_rate: "[%]"

  quality_signals:
    - retrieval_confidence_avg: "[0-1 — average similarity score]"
    - empty_retrieval_rate: "[% queries with no results above threshold]"
    - fallback_rate: "[% queries where model says 'I don't know']"
    - user_feedback_positive: "[% thumbs up]"
    - citation_rate: "[% answers with citations]"

  cost:
    - embedding_cost_daily: "[$]"
    - llm_cost_daily: "[$]"
    - reranker_cost_daily: "[$]"
    - vector_db_cost_daily: "[$]"
    - total_cost_per_query: "[$]"

  data_health:
    - index_freshness: "[time since last update]"
    - total_chunks_indexed: "[count]"
    - failed_ingestion_count: "[count]"
    - avg_chunk_quality_score: "[0-10]"

Alert Rules

Alert	Threshold	Severity
Latency P95 > 8s	5 min sustained	Warning
Latency P95 > 15s	1 min sustained	Critical
Error rate > 5%	5 min sustained	Critical
Empty retrieval > 30%	1 hour	Warning
Hallucination detected	Any flagged	Warning
Cost per query > 2x baseline	1 hour	Warning
Vector DB latency > 500ms	5 min sustained	Warning
Index staleness > 24h	If freshness SLA is <24h	Warning

Continuous Improvement Loop

Monitor → Identify Failure Patterns → Root Cause → Fix → Evaluate → Deploy

Weekly review questions:

What are the top 5 query types with lowest satisfaction?
Which documents are never retrieved? (potential indexing issues)
Which queries trigger "I don't know"? (coverage gaps)
What's the hallucination trend? (improving or degrading?)
Are costs trending up or down per query?

Phase 11: Advanced Patterns

Agentic RAG

User Query → Query Planner (LLM) → [Plan: search A, then search B, compare]
                                     ↓
                               Tool Execution
                               ├─ search_documents(query_A)
                               ├─ search_documents(query_B)
                               ├─ calculate(comparison)
                               └─ synthesize(results)
                                     ↓
                               Final Answer

When to use: Multi-step reasoning, cross-document comparison, calculation needed.

Implementation:

Define tools: search_docs, lookup_entity, calculate, compare
Use function calling with planning prompt
Limit iterations (max 5 tool calls per query)
Track and log the full reasoning chain

Graph RAG

graph_rag:
  when_to_use:
    - "Entity-heavy domains (legal, medical, organizational)"
    - "Queries about relationships ('who reports to X?')"
    - "Multi-hop reasoning ('what products use components from supplier Y?')"

  architecture:
    entities: "[Extract entities from documents]"
    relationships: "[Extract entity-entity relationships]"
    communities: "[Cluster entities into topic communities]"
    summaries: "[Generate community summaries]"

  retrieval:
    local_search: "Entity-focused — find specific entities and their neighbors"
    global_search: "Community-focused — synthesize across topic clusters"
    hybrid: "Combine vector similarity + graph traversal"

Corrective RAG (CRAG)

Query → Retrieve → Evaluate Relevance → 
  ├─ CORRECT: Retrieved docs are relevant → Generate answer
  ├─ AMBIGUOUS: Partially relevant → Refine query + re-retrieve
  └─ INCORRECT: Not relevant → Fall back to web search or "I don't know"

Self-RAG

Query → Retrieve → Generate + Self-Reflect →
  ├─ "Is retrieval needed?" → Skip if query is simple
  ├─ "Are results relevant?" → Re-retrieve if not
  ├─ "Is my answer supported?" → Revise if not faithful
  └─ "Is my answer useful?" → Regenerate if not

RAG + Fine-Tuning

Approach	When	Benefit
RAG only	Dynamic knowledge, many sources	Flexible, no training needed
Fine-tuning only	Static knowledge, consistent format	Fast inference, no retrieval
RAG + Fine-tuned embeddings	Domain-specific vocabulary	Better retrieval quality
RAG + Fine-tuned generator	Consistent output format needed	Better answers + grounding

Multi-Modal RAG

multimodal_rag:
  document_types:
    images: "Generate text descriptions via vision model; embed descriptions"
    tables: "Convert to structured text; embed as markdown"
    charts: "Describe in natural language; embed description"
    diagrams: "Generate detailed caption; store image reference + caption"

  retrieval:
    strategy: "Text-first retrieval with multimodal context assembly"
    image_in_context: "Include as base64 or URL reference in prompt"

Phase 12: Common Failure Modes & Fixes

Diagnostic Decision Tree

RAG quality is poor
├─ Retrieved chunks are irrelevant
│   ├─ Check: Chunking strategy → Are chunks self-contained?
│   ├─ Check: Embedding model → Run domain benchmark test
│   ├─ Check: Query transformation → Enable multi-query or HyDE
│   └─ Fix: Add reranking if not present
│
├─ Retrieved chunks are relevant but answer is wrong
│   ├─ Check: System prompt → Is grounding instruction clear?
│   ├─ Check: Context window → Is relevant info getting truncated?
│   ├─ Check: Conflicting sources → Add conflict resolution instructions
│   └─ Fix: Upgrade generation model
│
├─ System says "I don't know" too often
│   ├─ Check: Similarity threshold → Too high? Lower from 0.5 to 0.3
│   ├─ Check: Corpus coverage → Missing documents?
│   ├─ Check: top_k → Too low? Increase from 5 to 10
│   └─ Fix: Add query expansion
│
├─ Hallucination / makes things up
│   ├─ Check: System prompt → Add explicit grounding instructions
│   ├─ Check: Temperature → Set to 0.0-0.3 for factual tasks
│   ├─ Check: Retrieved context → Is it misleading or ambiguous?
│   └─ Fix: Add faithfulness evaluation in post-processing
│
└─ Too slow
    ├─ Check: Embedding latency → Batch? Cache?
    ├─ Check: Vector search → Index type? Quantization?
    ├─ Check: Reranker → Faster model or reduce candidate set
    └─ Fix: Add caching layer (semantic query cache)

10 RAG Anti-Patterns

#	Anti-Pattern	Why It's Bad	Fix
1	No reranking	Vector similarity is noisy	Add cross-encoder reranker
2	Fixed chunk size for all docs	Different docs need different strategies	Use document-aware chunking
3	No evaluation	Flying blind	Build golden test set + automated eval
4	Ignoring metadata	Missing obvious filtering opportunities	Add metadata enrichment + filtering
5	Single query embedding	Misses semantic variants	Use multi-query retrieval
6	No "I don't know"	Hallucination when context insufficient	Add explicit grounding + confidence
7	Embedding documents without context	Chunks lose meaning in isolation	Prepend title/section to chunks
8	No freshness management	Stale answers from outdated docs	Implement update pipeline + TTL
9	Oversized context	Wasted tokens, increased cost + latency	Optimize top_k, use reranking
10	No access control	Users see unauthorized content	Implement document-level ACL filtering

10 Common Mistakes

Mistake	Impact	Fix
Starting with complex architecture	Wasted time	Start naive, add complexity based on eval data
Not measuring before optimizing	Optimizing wrong thing	Eval first, then optimize worst dimension
Chunking at arbitrary character count	Bad retrieval	Use semantic or structure-aware chunking
Using same embedding for all languages	Poor multilingual results	Use multilingual model or per-language index
Ignoring the 20% of hard queries	80% of user complaints	Build hard query test set, optimize for tail
No conversation context	Bad multi-turn experience	Implement query reformulation
Stuffing entire documents	Wasted tokens, noise	Retrieve only relevant chunks
Not handling "no results" gracefully	Hallucination	Define explicit fallback behavior
Over-engineering from day 1	Never ships	MVP in 1 week, iterate from data
Not versioning your index	Can't rollback	Version embeddings + index config

Quality Scoring Rubric

RAG System Health Score (0-100)

Dimension	Weight	Score 0-10
Retrieval quality (precision + recall)	20%	___
Answer faithfulness (no hallucination)	20%	___
Answer relevance & completeness	15%	___
Latency & performance	10%	___
Cost efficiency	10%	___
Evaluation coverage	10%	___
Data freshness & quality	10%	___
Security & access control	5%	___

Weighted Score: ___ / 100

Grade	Score	Action
A	85-100	Production-ready, continuous improvement
B	70-84	Good foundation, address gaps
C	55-69	Significant improvements needed
D	40-54	Fundamental issues, review architecture
F	<40	Rebuild needed

Edge Cases

Low-Volume / Small Corpus

Skip vector DB — use in-memory search or full-context stuffing
Focus on chunking quality over retrieval sophistication
Simple keyword + semantic hybrid is sufficient

High-Security / Regulated

On-prem vector DB + self-hosted embedding model
Document-level ACL enforcement at retrieval time
Audit logging every query + response
Data residency compliance for vector storage
Consider homomorphic encryption for embeddings

Multi-Language

Use multilingual embedding model (BGE-M3, Cohere embed-v4)
Consider per-language indexes for large corpora
Query language detection → route to appropriate index
Cross-lingual retrieval: query in English, retrieve in any language

Real-Time / Streaming

Event-driven ingestion (Kafka/webhooks → chunk → embed → index)
Incremental indexing (add/update/delete individual chunks)
Version management (don't serve partially indexed documents)
Consider time-weighted scoring (recent docs ranked higher)

Very Large Corpus (>10M documents)

Tiered retrieval: coarse filter → fine retrieval → reranking
Hierarchical indexing (cluster → sub-cluster → document → chunk)
Async processing pipeline with queue management
Consider pre-computed answers for top 1000 queries

Natural Language Commands

When the user says... you respond with:

Command	Action
"Design a RAG system for [use case]"	Complete Phase 1 brief + architecture recommendation
"Help me chunk [document type]"	Chunking strategy recommendation + implementation
"Which embedding model should I use?"	Model comparison for their use case + benchmark plan
"My RAG results are bad"	Diagnostic decision tree walkthrough
"Evaluate my RAG system"	Evaluation framework setup + golden test set design
"Optimize retrieval"	Query transformation + reranking recommendations
"How do I handle [specific scenario]?"	Relevant pattern from advanced section
"Set up monitoring"	Dashboard YAML + alert rules for their scale
"How much will this cost?"	Cost estimation based on their scale + optimization tips
"Compare [approach A] vs [approach B]"	Decision matrix with pros/cons for their context
"I'm getting hallucinations"	Faithfulness diagnosis + grounding improvements
"Score my RAG system"	Full quality rubric assessment

Built by AfrexAI — AI agents that compound capital and code. Zero dependencies. Pure methodology. Works with any RAG stack.