rag-pipeline-builder

Design end-to-end RAG pipelines for accurate document retrieval and generation.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "rag-pipeline-builder" with this command: npx skills add monkey1sai/openai-cli/monkey1sai-openai-cli-rag-pipeline-builder

RAG Pipeline Builder

Design end-to-end RAG pipelines for accurate document retrieval and generation.

Pipeline Architecture

Documents → Chunking → Embedding → Vector Store → Retrieval → Reranking → Generation

Chunking Strategy

Semantic chunking (recommended)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Characters per chunk chunk_overlap=200, # Overlap between chunks separators=["\n\n", "\n", ". ", " ", ""], length_function=len, )

chunks = splitter.split_text(document.text)

Add metadata to each chunk

for i, chunk in enumerate(chunks): chunks[i] = { "text": chunk, "metadata": { "source": document.filename, "page": calculate_page(i), "chunk_id": f"{document.id}chunk{i}", } }

Metadata Schema

interface ChunkMetadata { // Source information document_id: string; source: string; url?: string;

// Location page?: number; section?: string; chunk_index: number;

// Content classification content_type: "text" | "code" | "table" | "list"; language?: string;

// Timestamps created_at: Date; updated_at: Date;

// Retrieval optimization keywords: string[]; summary?: string; importance_score?: number; }

Vector Store Setup

Pinecone example

import pinecone from langchain.vectorstores import Pinecone from langchain.embeddings import OpenAIEmbeddings

pinecone.init(api_key="...", environment="...")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Pinecone.from_documents( documents=chunks, embedding=embeddings, index_name="knowledge-base", namespace="production", )

Retrieval Strategies

Hybrid search (dense + sparse)

def hybrid_retrieval(query: str, k: int = 5): # Dense retrieval (semantic) dense_results = vectorstore.similarity_search(query, k=k*2)

# Sparse retrieval (keyword - BM25)
sparse_results = bm25_search(query, k=k*2)

# Combine and rerank
combined = reciprocal_rank_fusion(dense_results, sparse_results)

return combined[:k]

Metadata filtering

results = vectorstore.similarity_search( query, k=5, filter={ "content_type": "code", "language": "python", } )

Reranking

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, results: List[Document], top_k: int = 3): # Score each result against query pairs = [(query, doc.page_content) for doc in results] scores = reranker.predict(pairs)

# Sort by score
scored_results = list(zip(results, scores))
scored_results.sort(key=lambda x: x[1], reverse=True)

return [doc for doc, score in scored_results[:top_k]]

Query Enhancement

Query expansion

def expand_query(query: str) -> str: expansion_prompt = f""" Generate 3 alternative phrasings of this query: "{query}"

Return as JSON array of strings.
"""
alternatives = llm(expansion_prompt)
return [query] + alternatives

Multi-query retrieval

def multi_query_retrieval(query: str, k: int = 5): queries = expand_query(query) all_results = []

for q in queries:
    results = vectorstore.similarity_search(q, k=k)
    all_results.extend(results)

# Deduplicate and rerank
unique_results = deduplicate(all_results)
return rerank_results(query, unique_results, k)

Evaluation Plan

Define golden dataset

golden_dataset = [ { "query": "How do I authenticate users?", "expected_docs": ["auth_guide.md", "user_management.md"], "relevant_chunks": ["chunk_123", "chunk_456"], }, ]

Metrics

def evaluate_retrieval(dataset): results = { "precision": [], "recall": [], "mrr": [], # Mean Reciprocal Rank "ndcg": [] # Normalized Discounted Cumulative Gain }

for item in dataset:
    retrieved = retrieval_fn(item["query"])
    retrieved_ids = [doc.metadata["chunk_id"] for doc in retrieved]

    # Calculate metrics
    relevant = set(item["relevant_chunks"])
    retrieved_set = set(retrieved_ids)

    precision = len(relevant & retrieved_set) / len(retrieved_set)
    recall = len(relevant & retrieved_set) / len(relevant)

    results["precision"].append(precision)
    results["recall"].append(recall)

return {k: sum(v)/len(v) for k, v in results.items()}

Context Window Management

def fit_context_window(chunks: List[Document], max_tokens: int = 4000): """Select chunks that fit in context window""" total_tokens = 0 selected_chunks = []

for chunk in chunks:
    chunk_tokens = count_tokens(chunk.page_content)
    if total_tokens + chunk_tokens <= max_tokens:
        selected_chunks.append(chunk)
        total_tokens += chunk_tokens
    else:
        break

return selected_chunks

Best Practices

  • Chunk size: 500-1000 chars for general text

  • Overlap: 10-20% overlap between chunks

  • Metadata: Rich metadata for filtering

  • Hybrid search: Combine semantic + keyword

  • Reranking: Cross-encoder for final ranking

  • Evaluation: Golden dataset with metrics

  • Context management: Don't exceed model limits

Output Checklist

  • Chunking strategy defined

  • Metadata schema documented

  • Vector store configured

  • Retrieval algorithm implemented

  • Reranking pipeline added

  • Query enhancement (optional)

  • Context window management

  • Evaluation dataset created

  • Metrics implementation

  • Performance baseline established

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

modal-drawer-system

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

secrets-scanner

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

eslint-prettier-config

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

bruno-collection-generator

No summary provided by upstream source.

Repository SourceNeeds Review