Chunking Strategies
Purpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.
Activation Triggers:
-
Implementing document chunking for RAG
-
Optimizing chunk size and overlap
-
Comparing different chunking strategies
-
Benchmarking chunking performance
-
Processing different document types (markdown, code, PDFs)
-
Evaluating retrieval quality with different chunk strategies
Key Resources:
-
scripts/chunk-fixed-size.py
-
Fixed-size chunking implementation
-
scripts/chunk-semantic.py
-
Semantic chunking with paragraph preservation
-
scripts/chunk-recursive.py
-
Recursive chunking for hierarchical documents
-
scripts/benchmark-chunking.py
-
Benchmark and compare chunking strategies
-
templates/chunking-config.yaml
-
Chunking configuration template
-
templates/custom-splitter.py
-
Template for custom chunking logic
-
examples/chunk-markdown.py
-
Markdown-specific chunking
-
examples/chunk-code.py
-
Source code chunking
-
examples/chunk-pdf.py
-
PDF document chunking
Chunking Strategy Overview
Strategy Selection Guide
Fixed-Size Chunking:
-
Best for: Uniform documents, simple content, consistent structure
-
Pros: Fast, predictable, simple implementation
-
Cons: May split semantic units, no context awareness
-
Use when: Speed matters more than semantic coherence
Semantic Chunking:
-
Best for: Natural language documents, articles, books
-
Pros: Preserves semantic boundaries, better context
-
Cons: Slower, variable chunk sizes
-
Use when: Content has clear paragraph/section structure
Recursive Chunking:
-
Best for: Hierarchical documents, technical docs, code
-
Pros: Preserves structure, handles nested content
-
Cons: Most complex, requires structure detection
-
Use when: Documents have clear hierarchical organization
Sentence-Based Chunking:
-
Best for: Q&A pairs, chatbots, precise retrieval
-
Pros: Natural boundaries, good for citations
-
Cons: Small chunks may lack context
-
Use when: Need precise attribution and citations
Implementation Scripts
- Fixed-Size Chunking
Script: scripts/chunk-fixed-size.py
Usage:
python scripts/chunk-fixed-size.py
--input document.txt
--chunk-size 1000
--overlap 200
--output chunks.json
Parameters:
-
chunk-size : Number of characters per chunk (default: 1000)
-
overlap : Character overlap between chunks (default: 200)
-
split-on : Split on sentences, words, or characters (default: sentences)
Best Practices:
-
Use 500-1000 character chunks for most RAG applications
-
Set overlap to 10-20% of chunk size
-
Split on sentences for better coherence
- Semantic Chunking
Script: scripts/chunk-semantic.py
Usage:
python scripts/chunk-semantic.py
--input document.txt
--max-chunk-size 1500
--output chunks.json
How it works:
-
Detects natural boundaries (paragraphs, headings, line breaks)
-
Groups content while respecting max chunk size
-
Preserves semantic units (paragraphs stay together)
-
Adds context headers for nested sections
Best for: Articles, blog posts, documentation, books
- Recursive Chunking
Script: scripts/chunk-recursive.py
Usage:
python scripts/chunk-recursive.py
--input document.md
--chunk-size 1000
--separators '["\n## ", "\n### ", "\n\n", "\n", " "]'
--output chunks.json
How it works:
-
Tries to split on first separator (e.g., ## headings)
-
If chunks still too large, recursively splits on next separator
-
Continues until all chunks are within size limit
-
Preserves hierarchical context
Separator hierarchy examples:
-
Markdown: ["\n## ", "\n### ", "\n\n", "\n", " "]
-
Python: ["\nclass ", "\ndef ", "\n\n", "\n", " "]
-
General: ["\n\n", "\n", ". ", " "]
Best for: Structured documents, source code, technical manuals
- Benchmark Chunking Strategies
Script: scripts/benchmark-chunking.py
Usage:
python scripts/benchmark-chunking.py
--input document.txt
--strategies fixed,semantic,recursive
--chunk-sizes 500,1000,1500
--output benchmark-results.json
Metrics Evaluated:
-
Processing time: Speed of chunking
-
Chunk count: Total chunks generated
-
Chunk size variance: Consistency of chunk sizes
-
Context preservation: Semantic unit integrity (scored)
-
Retrieval quality: Simulated query performance
Output:
{ "fixed-1000": { "time_ms": 45, "chunk_count": 127, "avg_size": 982, "size_variance": 12.3, "context_score": 0.72 }, "semantic-1000": { "time_ms": 156, "chunk_count": 114, "avg_size": 1087, "size_variance": 234.5, "context_score": 0.91 } }
Configuration Template
Template: templates/chunking-config.yaml
Complete configuration:
chunking:
Global defaults
default_strategy: semantic default_chunk_size: 1000 default_overlap: 200
Strategy-specific configs
strategies: fixed_size: chunk_size: 1000 overlap: 200 split_on: sentence # sentence, word, character
semantic:
max_chunk_size: 1500
min_chunk_size: 200
preserve_paragraphs: true
add_headers: true # Include section headers
recursive:
chunk_size: 1000
overlap: 100
separators:
markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
text: ["\\n\\n", ". ", " "]
Document type mappings
document_types: ".md": semantic ".py": recursive ".txt": fixed_size ".pdf": semantic
Custom Splitter Template
Template: templates/custom-splitter.py
Create your own chunking logic:
from typing import List, Dict import re
class CustomChunker: def init(self, chunk_size: int = 1000, overlap: int = 200): self.chunk_size = chunk_size self.overlap = overlap
def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
"""
Implement custom chunking logic here.
Returns:
List of chunks with metadata:
[
{
"text": "chunk content",
"metadata": {
"chunk_id": 0,
"source": "document.txt",
"start_char": 0,
"end_char": 1000
}
}
]
"""
chunks = []
# Your custom chunking logic here
# Example: Split on custom pattern
sections = self._split_sections(text)
for i, section in enumerate(sections):
chunks.append({
"text": section,
"metadata": {
"chunk_id": i,
"source": metadata.get("source", "unknown"),
"chunk_size": len(section)
}
})
return chunks
def _split_sections(self, text: str) -> List[str]:
# Implement your splitting logic
pass
Document-Specific Examples
Markdown Chunking
Example: examples/chunk-markdown.py
Features:
-
Preserves heading hierarchy
-
Keeps code blocks together
-
Maintains list structure
-
Adds parent section context to chunks
Usage:
python examples/chunk-markdown.py README.md --output readme-chunks.json
Code Chunking
Example: examples/chunk-code.py
Features:
-
Splits on class and function boundaries
-
Preserves complete functions
-
Includes docstrings with implementations
-
Language-aware separator selection
Supported languages: Python, JavaScript, TypeScript, Java, Go
Usage:
python examples/chunk-code.py src/main.py --language python --output code-chunks.json
PDF Chunking
Example: examples/chunk-pdf.py
Features:
-
Extracts text from PDF
-
Preserves page boundaries
-
Maintains formatting clues
-
Handles multi-column layouts
Dependencies: pypdf , pdfminer.six
Usage:
python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json
Optimization Guidelines
Chunk Size Selection
General recommendations:
Content Type Chunk Size Overlap Strategy
Q&A / FAQs 200-400 50 Sentence
Articles 500-1000 100-200 Semantic
Documentation 1000-1500 200-300 Recursive
Books 1000-2000 300-400 Semantic
Source code 500-1000 100 Recursive
Test with your data: Use benchmark-chunking.py to find optimal settings
Overlap Strategies
Why overlap matters:
-
Prevents information loss at boundaries
-
Improves retrieval of cross-boundary information
-
Balances redundancy vs. coverage
Overlap guidelines:
-
10-15%: Minimal overlap for speed
-
15-20%: Standard overlap for most use cases
-
20-30%: High overlap for critical applications
Performance Optimization
Fast chunking (large documents):
Use fixed-size for speed
python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000
Quality chunking (smaller documents):
Use semantic for better context
python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500
Batch processing:
Process multiple files
for file in documents/*.txt; do python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json" done
Evaluation Workflow
Step 1: Benchmark Strategies
python scripts/benchmark-chunking.py
--input sample-document.txt
--strategies fixed,semantic,recursive
--chunk-sizes 500,1000,1500
Step 2: Analyze Results
Review metrics:
-
Processing time (prefer < 100ms per document)
-
Context preservation score (target > 0.85)
-
Chunk size variance (lower is more predictable)
Step 3: A/B Test Retrieval
Compare retrieval quality:
-
Chunk same corpus with different strategies
-
Run identical test queries against each
-
Measure precision@k and recall@k
-
Select strategy with best retrieval metrics
Step 4: Production Deployment
Use configuration file:
import yaml from chunking_strategies import get_chunker
config = yaml.safe_load(open('chunking-config.yaml')) chunker = get_chunker(config['chunking']['default_strategy'], config)
chunks = chunker.chunk(document_text)
Common Issues & Solutions
Issue: Chunks too small/large
-
Adjust chunk_size parameter
-
Check document structure (may need different strategy)
-
Verify separator selection for recursive chunking
Issue: Lost context at boundaries
-
Increase overlap
-
Switch to semantic chunking
-
Add parent context to metadata
Issue: Slow processing
-
Use fixed-size chunking for large batches
-
Reduce overlap
-
Process files in parallel
Issue: Poor retrieval quality
-
Benchmark different strategies
-
Increase chunk size
-
Try hybrid approach (semantic + fixed fallback)
Dependencies
Core libraries:
pip install tiktoken # Token counting pip install nltk # Sentence splitting pip install spacy # Advanced NLP (optional)
For PDF support:
pip install pypdf pdfminer.six
For benchmarking:
pip install pandas numpy scikit-learn
Best Practices Summary
-
Start with semantic chunking for most documents
-
Use recursive chunking for structured/hierarchical content
-
Benchmark on your actual data before production
-
Set overlap to 15-20% of chunk size
-
Include metadata (source, page, section) in chunks
-
Test retrieval quality, not just chunking speed
-
Use appropriate chunk size for your embedding model token limit
-
Document your chunking strategy choice and parameters
Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0