RAG Implementation
Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.
Use this skill when
-
Building Q&A systems over proprietary documents
-
Creating chatbots with current, factual information
-
Implementing semantic search with natural language queries
-
Reducing hallucinations with grounded responses
-
Enabling LLMs to access domain-specific knowledge
-
Building documentation assistants
-
Creating research tools with source citation
Do not use this skill when
-
You only need purely generative writing without retrieval
-
The dataset is too small to justify embeddings
-
You cannot store or process the source data safely
Instructions
-
Define the corpus, update cadence, and evaluation targets.
-
Choose embedding models and vector store based on scale.
-
Build ingestion, chunking, and retrieval with reranking.
-
Evaluate with grounded QA metrics and monitor drift.
Safety
-
Redact sensitive data and enforce access controls.
-
Avoid exposing source documents in responses when restricted.
Core Components
- Vector Databases
Purpose: Store and retrieve document embeddings efficiently
Options:
-
Pinecone: Managed, scalable, fast queries
-
Weaviate: Open-source, hybrid search
-
Milvus: High performance, on-premise
-
Chroma: Lightweight, easy to use
-
Qdrant: Fast, filtered search
-
FAISS: Meta's library, local deployment
- Embeddings
Purpose: Convert text to numerical vectors for similarity search
Models:
-
text-embedding-ada-002 (OpenAI): General purpose, 1536 dims
-
all-MiniLM-L6-v2 (Sentence Transformers): Fast, lightweight
-
e5-large-v2: High quality, multilingual
-
Instructor: Task-specific instructions
-
bge-large-en-v1.5: SOTA performance
- Retrieval Strategies
Approaches:
-
Dense Retrieval: Semantic similarity via embeddings
-
Sparse Retrieval: Keyword matching (BM25, TF-IDF)
-
Hybrid Search: Combine dense + sparse
-
Multi-Query: Generate multiple query variations
-
HyDE: Generate hypothetical documents
- Reranking
Purpose: Improve retrieval quality by reordering results
Methods:
-
Cross-Encoders: BERT-based reranking
-
Cohere Rerank: API-based reranking
-
Maximal Marginal Relevance (MMR): Diversity + relevance
-
LLM-based: Use LLM to score relevance
Quick Start
from langchain.document_loaders import DirectoryLoader from langchain.text_splitters import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.llms import OpenAI
1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt") documents = loader.load()
2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_documents(documents)
3. Create embeddings and vector store
embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(chunks, embeddings)
4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type( llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), return_source_documents=True )
5. Query
result = qa_chain({"query": "What are the main features?"}) print(result['result']) print(result['source_documents'])
Advanced RAG Patterns
Pattern 1: Hybrid Search
from langchain.retrievers import BM25Retriever, EnsembleRetriever
Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks) bm25_retriever.k = 5
Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
Combine with weights
ensemble_retriever = EnsembleRetriever( retrievers=[bm25_retriever, embedding_retriever], weights=[0.3, 0.7] )
Pattern 2: Multi-Query Retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever
Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(), llm=OpenAI() )
Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")
Pattern 3: Contextual Compression
from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever() )
Returns only relevant parts of documents
compressed_docs = compression_retriever.get_relevant_documents("query")
Pattern 4: Parent Document Retriever
from langchain.retrievers import ParentDocumentRetriever from langchain.storage import InMemoryStore
Store for parent documents
store = InMemoryStore()
Small chunks for retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever( vectorstore=vectorstore, docstore=store, child_splitter=child_splitter, parent_splitter=parent_splitter )
Document Chunking Strategies
Recursive Character Text Splitter
from langchain.text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, separators=["\n\n", "\n", " ", ""] # Try these in order )
Token-Based Splitting
from langchain.text_splitters import TokenTextSplitter
splitter = TokenTextSplitter( chunk_size=512, chunk_overlap=50 )
Semantic Chunking
from langchain.text_splitters import SemanticChunker
splitter = SemanticChunker( embeddings=OpenAIEmbeddings(), breakpoint_threshold_type="percentile" )
Markdown Header Splitter
from langchain.text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
Vector Store Configurations
Pinecone
import pinecone from langchain.vectorstores import Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("your-index-name")
vectorstore = Pinecone(index, embeddings.embed_query, "text")
Weaviate
import weaviate from langchain.vectorstores import Weaviate
client = weaviate.Client("http://localhost:8080")
vectorstore = Weaviate(client, "Document", "content", embeddings)
Chroma (Local)
from langchain.vectorstores import Chroma
vectorstore = Chroma( collection_name="my_collection", embedding_function=embeddings, persist_directory="./chroma_db" )
Retrieval Optimization
- Metadata Filtering
Add metadata during indexing
chunks_with_metadata = [] for i, chunk in enumerate(chunks): chunk.metadata = { "source": chunk.metadata.get("source"), "page": i, "category": determine_category(chunk.page_content) } chunks_with_metadata.append(chunk)
Filter during retrieval
results = vectorstore.similarity_search( "query", filter={"category": "technical"}, k=5 )
- Maximal Marginal Relevance
Balance relevance with diversity
results = vectorstore.max_marginal_relevance_search( "query", k=5, fetch_k=20, # Fetch 20, return top 5 diverse lambda_mult=0.5 # 0=max diversity, 1=max relevance )
- Reranking with Cross-Encoder
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
Get initial results
candidates = vectorstore.similarity_search("query", k=20)
Rerank
pairs = [[query, doc.page_content] for doc in candidates] scores = reranker.predict(pairs)
Sort by score and take top k
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
Prompt Engineering for RAG
Contextual Prompt
prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
Context: {context}
Question: {question}
Answer:"""
With Citations
prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
Context: {context}
Question: {question}
Answer (with citations):"""
With Confidence
prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
Context: {context}
Question: {question}
Answer: Confidence:"""
Evaluation Metrics
def evaluate_rag_system(qa_chain, test_cases): metrics = { 'accuracy': [], 'retrieval_quality': [], 'groundedness': [] }
for test in test_cases:
result = qa_chain({"query": test['question']})
# Check if answer matches expected
accuracy = calculate_accuracy(result['result'], test['expected'])
metrics['accuracy'].append(accuracy)
# Check if relevant docs were retrieved
retrieval_quality = evaluate_retrieved_docs(
result['source_documents'],
test['relevant_docs']
)
metrics['retrieval_quality'].append(retrieval_quality)
# Check if answer is grounded in context
groundedness = check_groundedness(
result['result'],
result['source_documents']
)
metrics['groundedness'].append(groundedness)
return {k: sum(v)/len(v) for k, v in metrics.items()}
Resources
-
references/vector-databases.md: Detailed comparison of vector DBs
-
references/embeddings.md: Embedding model selection guide
-
references/retrieval-strategies.md: Advanced retrieval techniques
-
references/reranking.md: Reranking methods and when to use them
-
references/context-window.md: Managing context limits
-
assets/vector-store-config.yaml: Configuration templates
-
assets/retriever-pipeline.py: Complete RAG pipeline
-
assets/embedding-models.md: Model comparison and benchmarks
Best Practices
-
Chunk Size: Balance between context and specificity (500-1000 tokens)
-
Overlap: Use 10-20% overlap to preserve context at boundaries
-
Metadata: Include source, page, timestamp for filtering and debugging
-
Hybrid Search: Combine semantic and keyword search for best results
-
Reranking: Improve top results with cross-encoder
-
Citations: Always return source documents for transparency
-
Evaluation: Continuously test retrieval quality and answer accuracy
-
Monitoring: Track retrieval metrics in production
Common Issues
-
Poor Retrieval: Check embedding quality, chunk size, query formulation
-
Irrelevant Results: Add metadata filtering, use hybrid search, rerank
-
Missing Information: Ensure documents are properly indexed
-
Slow Queries: Optimize vector store, use caching, reduce k
-
Hallucinations: Improve grounding prompt, add verification step