rag-architecture

When to Use This Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-architecture" with this command: npx skills add melodic-software/claude-code-plugins/melodic-software-claude-code-plugins-rag-architecture

RAG Architecture

When to Use This Skill

Use this skill when:

  • Designing RAG pipelines for LLM applications

  • Choosing chunking and embedding strategies

  • Optimizing retrieval quality and relevance

  • Building knowledge-grounded AI systems

  • Implementing hybrid search (dense + sparse)

  • Designing multi-stage retrieval pipelines

Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐ │ RAG Pipeline │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Ingestion │ │ Indexing │ │ Vector Store │ │ │ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ Documents Chunks + Indexed │ │ Embeddings Vectors │ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Query │ │ Retrieval │ │ Context Assembly │ │ │ │ Processing │───▶│ Engine │───▶│ + Generation │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ User Query Top-K Chunks LLM Response │ │ │ └─────────────────────────────────────────────────────────────────────┘

Document Ingestion Pipeline

Document Processing Steps

Raw Documents │ ▼ ┌─────────────┐ │ Extract │ ← PDF, HTML, DOCX, Markdown │ Content │ └─────────────┘ │ ▼ ┌─────────────┐ │ Clean & │ ← Remove boilerplate, normalize │ Normalize │ └─────────────┘ │ ▼ ┌─────────────┐ │ Chunk │ ← Split into retrievable units │ Documents │ └─────────────┘ │ ▼ ┌─────────────┐ │ Generate │ ← Create vector representations │ Embeddings │ └─────────────┘ │ ▼ ┌─────────────┐ │ Store │ ← Persist vectors + metadata │ in Index │ └─────────────┘

Chunking Strategies

Strategy Comparison

Strategy Description Best For Chunk Size

Fixed-size Split by token/character count Simple documents 256-512 tokens

Sentence-based Split at sentence boundaries Narrative text Variable

Paragraph-based Split at paragraph boundaries Structured docs Variable

Semantic Split by topic/meaning Long documents Variable

Recursive Hierarchical splitting Mixed content Configurable

Document-specific Custom per doc type Specialized (code, tables) Variable

Chunking Decision Tree

What type of content? ├── Code │ └── AST-based or function-level chunking ├── Tables/Structured │ └── Keep tables intact, chunk surrounding text ├── Long narrative │ └── Semantic or recursive chunking ├── Short documents (<1 page) │ └── Whole document as chunk └── Mixed content └── Recursive with type-specific handlers

Chunk Overlap

Without Overlap: [Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"] ↑ Information lost at boundary

With Overlap (20%): [Chunk 1: "The quick brown fox"] [Chunk 2: "brown fox jumps over"] ↑ Context preserved across boundaries

Recommended overlap: 10-20% of chunk size

Chunk Size Trade-offs

Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens) ├── More precise retrieval ├── More context per chunk ├── Less context per chunk ├── May include irrelevant content ├── More chunks to search ├── Fewer chunks to search ├── Better for factoid Q&A ├── Better for summarization └── Higher retrieval recall └── Higher retrieval precision

Embedding Models

Model Comparison

Model Dimensions Context Strengths

OpenAI text-embedding-3-large 3072 8K High quality, expensive

OpenAI text-embedding-3-small 1536 8K Good quality/cost ratio

Cohere embed-v3 1024 512 Multilingual, fast

BGE-large 1024 512 Open source, competitive

E5-large-v2 1024 512 Open source, instruction-tuned

GTE-large 1024 512 Alibaba, good for Chinese

Sentence-BERT 768 512 Classic, well-understood

Embedding Selection

Need best quality, cost OK? ├── Yes → OpenAI text-embedding-3-large └── No └── Need self-hosted/open source? ├── Yes → BGE-large or E5-large-v2 └── No └── Need multilingual? ├── Yes → Cohere embed-v3 └── No → OpenAI text-embedding-3-small

Embedding Optimization

Technique Description When to Use

Matryoshka embeddings Truncatable to smaller dims Memory-constrained

Quantized embeddings INT8/binary embeddings Large-scale search

Instruction-tuned Prefix with task instruction Specialized retrieval

Fine-tuned embeddings Domain-specific training Specialized domains

Retrieval Strategies

Dense Retrieval (Semantic Search)

Query: "How to deploy containers" │ ▼ ┌─────────┐ │ Embed │ │ Query │ └─────────┘ │ ▼ ┌─────────────────────────────────┐ │ Vector Similarity Search │ │ (Cosine, Dot Product, L2) │ └─────────────────────────────────┘ │ ▼ Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

Query: "Kubernetes pod deployment YAML" │ ▼ ┌─────────┐ │Tokenize │ │ + Score │ └─────────┘ │ ▼ ┌─────────────────────────────────┐ │ BM25 Ranking │ │ (Term frequency × IDF) │ └─────────────────────────────────┘ │ ▼ Top-K lexically matching chunks

Hybrid Search (Best of Both)

Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking │ │ │ └──▶ Sparse Search ─┘ │ │ Fusion Methods: ▼ • RRF (Reciprocal Rank Fusion) • Linear combination • Learned reranking

Reciprocal Rank Fusion (RRF)

RRF Score = Σ 1 / (k + rank_i)

Where:

  • k = constant (typically 60)
  • rank_i = rank in each retrieval result

Example: Doc A: Dense rank=1, Sparse rank=5 RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1 RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

Two-Stage Pipeline

┌─────────────────────────────────────────────────────────┐ │ Stage 1: Recall (Fast, High Recall) │ │ • ANN search (HNSW, IVF) │ │ • Retrieve top-100 candidates │ │ • Latency: 10-50ms │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 2: Rerank (Slow, High Precision) │ │ • Cross-encoder or LLM reranking │ │ • Score top-100 → return top-10 │ │ • Latency: 100-500ms │ └─────────────────────────────────────────────────────────┘

Reranking Options

Reranker Latency Quality Cost

Cross-encoder (local) Medium High Compute

Cohere Rerank Fast High API cost

LLM-based rerank Slow Highest High API cost

BGE-reranker Fast Good Compute

Context Assembly

Context Window Management

Context Budget: 128K tokens ├── System prompt: 500 tokens (fixed) ├── Conversation history: 4K tokens (sliding window) ├── Retrieved context: 8K tokens (dynamic) └── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

Context Assembly Strategies

Strategy Description When to Use

Simple concatenation Join top-K chunks Small context, simple Q&A

Relevance-ordered Most relevant first General retrieval

Chronological Time-ordered Temporal queries

Hierarchical Summary + details Long-form generation

Interleaved Mix sources Multi-source queries

Lost-in-the-Middle Problem

LLM Attention Pattern: ┌─────────────────────────────────────────────────────────┐ │ Beginning Middle End │ │ ████ ░░░░ ████ │ │ High attention Low attention High attention │ └─────────────────────────────────────────────────────────┘

Mitigation:

  1. Put most relevant at beginning AND end
  2. Use shorter context windows when possible
  3. Use hierarchical summarization
  4. Fine-tune for long-context attention

Advanced RAG Patterns

Query Transformation

Original Query: "Tell me about the project" │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ HyDE │ │ Query │ │ Sub-query│ │ (Hypo │ │ Expansion│ │ Decomp. │ │ Doc) │ │ │ │ │ └─────────┘ └──────────┘ └──────────┘ │ │ │ ▼ ▼ ▼ Hypothetical "project, "What is the answer to goals, project scope?" embed timeline, "What are the deliverables" deliverables?"

HyDE (Hypothetical Document Embeddings)

Query: "How does photosynthesis work?" │ ▼ ┌───────────────┐ │ LLM generates │ │ hypothetical │ │ answer │ └───────────────┘ │ ▼ "Photosynthesis is the process by which plants convert sunlight into energy..." │ ▼ ┌───────────────┐ │ Embed hypo │ │ document │ └───────────────┘ │ ▼ Search with hypothetical embedding (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

┌─────────────────────────────────────────────────────────┐ │ 1. Generate initial response │ │ 2. Decide: Need more retrieval? (critique token) │ │ ├── Yes → Retrieve more, regenerate │ │ └── No → Check factuality (isRel, isSup tokens) │ │ 3. Verify claims against sources │ │ 4. Regenerate if needed │ │ 5. Return verified response │ └─────────────────────────────────────────────────────────┘

Agentic RAG

Query: "Compare Q3 revenue across regions" │ ▼ ┌───────────────┐ │ Query Agent │ │ (Plan steps) │ └───────────────┘ │ ┌───────────┼───────────┐ ▼ ▼ ▼ ┌───────┐ ┌───────┐ ┌───────┐ │Search │ │Search │ │Search │ │ EMEA │ │ APAC │ │ AMER │ │ docs │ │ docs │ │ docs │ └───────┘ └───────┘ └───────┘ │ │ │ └───────────┼───────────┘ ▼ ┌───────────────┐ │ Synthesize │ │ Comparison │ └───────────────┘

Evaluation Metrics

Retrieval Metrics

Metric Description Target

Recall@K % relevant docs in top-K

80%

Precision@K % of top-K that are relevant

60%

MRR (Mean Reciprocal Rank) 1/rank of first relevant

0.5

NDCG Graded relevance ranking

0.7

End-to-End Metrics

Metric Description Target

Answer correctness Is the answer factually correct?

90%

Faithfulness Is the answer grounded in context?

95%

Answer relevance Does it answer the question?

90%

Context relevance Is retrieved context relevant?

80%

Evaluation Framework

┌─────────────────────────────────────────────────────────┐ │ RAG Evaluation Pipeline │ ├─────────────────────────────────────────────────────────┤ │ 1. Query Set: Representative questions │ │ 2. Ground Truth: Expected answers + source docs │ │ 3. Metrics: │ │ • Retrieval: Recall@K, MRR, NDCG │ │ • Generation: Correctness, Faithfulness │ │ 4. A/B Testing: Compare configurations │ │ 5. Error Analysis: Identify failure patterns │ └─────────────────────────────────────────────────────────┘

Common Failure Modes

Failure Mode Cause Mitigation

Retrieval miss Query-doc mismatch Hybrid search, query expansion

Wrong chunk Poor chunking Better segmentation, overlap

Hallucination Poor grounding Faithfulness training, citations

Lost context Long-context issues Hierarchical, summarization

Stale data Outdated index Incremental updates, TTL

Scaling Considerations

Index Scaling

Scale Approach

<1M docs Single node, exact search

1-10M docs Single node, HNSW

10-100M docs Distributed, sharded

100M docs Distributed + aggressive filtering

Latency Budget

Typical RAG Pipeline Latency:

Query embedding: 10-50ms Vector search: 20-100ms Reranking: 100-300ms LLM generation: 500-2000ms ──────────────────────────── Total: 630-2450ms

Target p95: <3 seconds for interactive use

Related Skills

  • llm-serving-patterns

  • LLM inference infrastructure

  • vector-databases

  • Vector store selection and optimization

  • ml-system-design

  • End-to-end ML pipeline design

  • estimation-techniques

  • Capacity planning for RAG systems

Version History

  • v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design

Last Updated

Date: 2025-12-26

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

design-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

plantuml-syntax

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

system-prompt-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

resume-optimization

No summary provided by upstream source.

Repository SourceNeeds Review