rag-architecture

RAG Architecture

When to Use This Skill

Use this skill when:

Designing RAG pipelines for LLM applications
Choosing chunking and embedding strategies
Optimizing retrieval quality and relevance
Building knowledge-grounded AI systems
Implementing hybrid search (dense + sparse)
Designing multi-stage retrieval pipelines

Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐ │ RAG Pipeline │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Ingestion │ │ Indexing │ │ Vector Store │ │ │ │ Pipeline │───▶│ Pipeline │───▶│ (Embeddings) │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ Documents Chunks + Indexed │ │ Embeddings Vectors │ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ │ │ Query │ │ Retrieval │ │ Context Assembly │ │ │ │ Processing │───▶│ Engine │───▶│ + Generation │ │ │ └──────────────┘ └──────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ User Query Top-K Chunks LLM Response │ │ │ └─────────────────────────────────────────────────────────────────────┘

Document Ingestion Pipeline

Document Processing Steps

Raw Documents │ ▼ ┌─────────────┐ │ Extract │ ← PDF, HTML, DOCX, Markdown │ Content │ └─────────────┘ │ ▼ ┌─────────────┐ │ Clean & │ ← Remove boilerplate, normalize │ Normalize │ └─────────────┘ │ ▼ ┌─────────────┐ │ Chunk │ ← Split into retrievable units │ Documents │ └─────────────┘ │ ▼ ┌─────────────┐ │ Generate │ ← Create vector representations │ Embeddings │ └─────────────┘ │ ▼ ┌─────────────┐ │ Store │ ← Persist vectors + metadata │ in Index │ └─────────────┘

Chunking Strategies

Strategy Comparison

Strategy Description Best For Chunk Size

Fixed-size Split by token/character count Simple documents 256-512 tokens

Sentence-based Split at sentence boundaries Narrative text Variable

Paragraph-based Split at paragraph boundaries Structured docs Variable

Semantic Split by topic/meaning Long documents Variable

Recursive Hierarchical splitting Mixed content Configurable

Document-specific Custom per doc type Specialized (code, tables) Variable

Chunking Decision Tree

What type of content? ├── Code │ └── AST-based or function-level chunking ├── Tables/Structured │ └── Keep tables intact, chunk surrounding text ├── Long narrative │ └── Semantic or recursive chunking ├── Short documents (<1 page) │ └── Whole document as chunk └── Mixed content └── Recursive with type-specific handlers

Chunk Overlap

Without Overlap: [Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"] ↑ Information lost at boundary

With Overlap (20%): [Chunk 1: "The quick brown fox"] [Chunk 2: "brown fox jumps over"] ↑ Context preserved across boundaries

Recommended overlap: 10-20% of chunk size

Chunk Size Trade-offs

Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens) ├── More precise retrieval ├── More context per chunk ├── Less context per chunk ├── May include irrelevant content ├── More chunks to search ├── Fewer chunks to search ├── Better for factoid Q&A ├── Better for summarization └── Higher retrieval recall └── Higher retrieval precision

Embedding Models

Model Comparison

Model Dimensions Context Strengths

OpenAI text-embedding-3-large 3072 8K High quality, expensive

OpenAI text-embedding-3-small 1536 8K Good quality/cost ratio

Cohere embed-v3 1024 512 Multilingual, fast

BGE-large 1024 512 Open source, competitive

E5-large-v2 1024 512 Open source, instruction-tuned

GTE-large 1024 512 Alibaba, good for Chinese

Sentence-BERT 768 512 Classic, well-understood

Embedding Selection

Need best quality, cost OK? ├── Yes → OpenAI text-embedding-3-large └── No └── Need self-hosted/open source? ├── Yes → BGE-large or E5-large-v2 └── No └── Need multilingual? ├── Yes → Cohere embed-v3 └── No → OpenAI text-embedding-3-small

Embedding Optimization

Technique Description When to Use

Matryoshka embeddings Truncatable to smaller dims Memory-constrained

Quantized embeddings INT8/binary embeddings Large-scale search

Instruction-tuned Prefix with task instruction Specialized retrieval

Fine-tuned embeddings Domain-specific training Specialized domains

Retrieval Strategies

Dense Retrieval (Semantic Search)

Query: "How to deploy containers" │ ▼ ┌─────────┐ │ Embed │ │ Query │ └─────────┘ │ ▼ ┌─────────────────────────────────┐ │ Vector Similarity Search │ │ (Cosine, Dot Product, L2) │ └─────────────────────────────────┘ │ ▼ Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

Query: "Kubernetes pod deployment YAML" │ ▼ ┌─────────┐ │Tokenize │ │ + Score │ └─────────┘ │ ▼ ┌─────────────────────────────────┐ │ BM25 Ranking │ │ (Term frequency × IDF) │ └─────────────────────────────────┘ │ ▼ Top-K lexically matching chunks

Hybrid Search (Best of Both)

Query ──┬──▶ Dense Search ──┬──▶ Fusion ──▶ Final Ranking │ │ │ └──▶ Sparse Search ─┘ │ │ Fusion Methods: ▼ • RRF (Reciprocal Rank Fusion) • Linear combination • Learned reranking

Reciprocal Rank Fusion (RRF)

RRF Score = Σ 1 / (k + rank_i)

Where:

k = constant (typically 60)
rank_i = rank in each retrieval result

Example: Doc A: Dense rank=1, Sparse rank=5 RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1 RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

Two-Stage Pipeline

┌─────────────────────────────────────────────────────────┐ │ Stage 1: Recall (Fast, High Recall) │ │ • ANN search (HNSW, IVF) │ │ • Retrieve top-100 candidates │ │ • Latency: 10-50ms │ └─────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────┐ │ Stage 2: Rerank (Slow, High Precision) │ │ • Cross-encoder or LLM reranking │ │ • Score top-100 → return top-10 │ │ • Latency: 100-500ms │ └─────────────────────────────────────────────────────────┘

Reranking Options

Reranker Latency Quality Cost

Cross-encoder (local) Medium High Compute

Cohere Rerank Fast High API cost

LLM-based rerank Slow Highest High API cost

BGE-reranker Fast Good Compute

Context Assembly

Context Window Management

Context Budget: 128K tokens ├── System prompt: 500 tokens (fixed) ├── Conversation history: 4K tokens (sliding window) ├── Retrieved context: 8K tokens (dynamic) └── Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

Context Assembly Strategies

Strategy Description When to Use

Simple concatenation Join top-K chunks Small context, simple Q&A

Relevance-ordered Most relevant first General retrieval

Chronological Time-ordered Temporal queries

Hierarchical Summary + details Long-form generation

Interleaved Mix sources Multi-source queries

Lost-in-the-Middle Problem

LLM Attention Pattern: ┌─────────────────────────────────────────────────────────┐ │ Beginning Middle End │ │ ████ ░░░░ ████ │ │ High attention Low attention High attention │ └─────────────────────────────────────────────────────────┘

Mitigation:

Put most relevant at beginning AND end
Use shorter context windows when possible
Use hierarchical summarization
Fine-tune for long-context attention

Advanced RAG Patterns

Query Transformation

Original Query: "Tell me about the project" │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ HyDE │ │ Query │ │ Sub-query│ │ (Hypo │ │ Expansion│ │ Decomp. │ │ Doc) │ │ │ │ │ └─────────┘ └──────────┘ └──────────┘ │ │ │ ▼ ▼ ▼ Hypothetical "project, "What is the answer to goals, project scope?" embed timeline, "What are the deliverables" deliverables?"

HyDE (Hypothetical Document Embeddings)

Query: "How does photosynthesis work?" │ ▼ ┌───────────────┐ │ LLM generates │ │ hypothetical │ │ answer │ └───────────────┘ │ ▼ "Photosynthesis is the process by which plants convert sunlight into energy..." │ ▼ ┌───────────────┐ │ Embed hypo │ │ document │ └───────────────┘ │ ▼ Search with hypothetical embedding (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

┌─────────────────────────────────────────────────────────┐ │ 1. Generate initial response │ │ 2. Decide: Need more retrieval? (critique token) │ │ ├── Yes → Retrieve more, regenerate │ │ └── No → Check factuality (isRel, isSup tokens) │ │ 3. Verify claims against sources │ │ 4. Regenerate if needed │ │ 5. Return verified response │ └─────────────────────────────────────────────────────────┘

Agentic RAG

Query: "Compare Q3 revenue across regions" │ ▼ ┌───────────────┐ │ Query Agent │ │ (Plan steps) │ └───────────────┘ │ ┌───────────┼───────────┐ ▼ ▼ ▼ ┌───────┐ ┌───────┐ ┌───────┐ │Search │ │Search │ │Search │ │ EMEA │ │ APAC │ │ AMER │ │ docs │ │ docs │ │ docs │ └───────┘ └───────┘ └───────┘ │ │ │ └───────────┼───────────┘ ▼ ┌───────────────┐ │ Synthesize │ │ Comparison │ └───────────────┘

Evaluation Metrics

Retrieval Metrics

Metric Description Target

Recall@K % relevant docs in top-K

80%

Precision@K % of top-K that are relevant

60%

MRR (Mean Reciprocal Rank) 1/rank of first relevant

0.5

NDCG Graded relevance ranking

0.7

End-to-End Metrics

Metric Description Target

Answer correctness Is the answer factually correct?

90%

Faithfulness Is the answer grounded in context?

95%

Answer relevance Does it answer the question?

90%

Context relevance Is retrieved context relevant?

80%

Evaluation Framework

┌─────────────────────────────────────────────────────────┐ │ RAG Evaluation Pipeline │ ├─────────────────────────────────────────────────────────┤ │ 1. Query Set: Representative questions │ │ 2. Ground Truth: Expected answers + source docs │ │ 3. Metrics: │ │ • Retrieval: Recall@K, MRR, NDCG │ │ • Generation: Correctness, Faithfulness │ │ 4. A/B Testing: Compare configurations │ │ 5. Error Analysis: Identify failure patterns │ └─────────────────────────────────────────────────────────┘

Common Failure Modes

Failure Mode Cause Mitigation

Retrieval miss Query-doc mismatch Hybrid search, query expansion

Wrong chunk Poor chunking Better segmentation, overlap

Hallucination Poor grounding Faithfulness training, citations

Lost context Long-context issues Hierarchical, summarization

Stale data Outdated index Incremental updates, TTL

Scaling Considerations

Index Scaling

Scale Approach

<1M docs Single node, exact search

1-10M docs Single node, HNSW

10-100M docs Distributed, sharded

100M docs Distributed + aggressive filtering

Latency Budget

Typical RAG Pipeline Latency:

Query embedding: 10-50ms Vector search: 20-100ms Reranking: 100-300ms LLM generation: 500-2000ms ──────────────────────────── Total: 630-2450ms

Target p95: <3 seconds for interactive use

Related Skills

llm-serving-patterns
LLM inference infrastructure
vector-databases
Vector store selection and optimization
ml-system-design
End-to-end ML pipeline design
estimation-techniques
Capacity planning for RAG systems

Version History

v1.0.0 (2025-12-26): Initial release - RAG architecture patterns for systems design

Last Updated

Date: 2025-12-26

rag-architecture

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering

resume-optimization