Chunking Strategies

Purpose: Provide production-ready document chunking implementations, benchmarking tools, and strategy selection guidance for RAG pipelines.

Activation Triggers:

Implementing document chunking for RAG
Optimizing chunk size and overlap
Comparing different chunking strategies
Benchmarking chunking performance
Processing different document types (markdown, code, PDFs)
Evaluating retrieval quality with different chunk strategies

Key Resources:

scripts/chunk-fixed-size.py
Fixed-size chunking implementation
scripts/chunk-semantic.py
Semantic chunking with paragraph preservation
scripts/chunk-recursive.py
Recursive chunking for hierarchical documents
scripts/benchmark-chunking.py
Benchmark and compare chunking strategies
templates/chunking-config.yaml
Chunking configuration template
templates/custom-splitter.py
Template for custom chunking logic
examples/chunk-markdown.py
Markdown-specific chunking
examples/chunk-code.py
Source code chunking
examples/chunk-pdf.py
PDF document chunking

Chunking Strategy Overview

Strategy Selection Guide

Fixed-Size Chunking:

Best for: Uniform documents, simple content, consistent structure
Pros: Fast, predictable, simple implementation
Cons: May split semantic units, no context awareness
Use when: Speed matters more than semantic coherence

Semantic Chunking:

Best for: Natural language documents, articles, books
Pros: Preserves semantic boundaries, better context
Cons: Slower, variable chunk sizes
Use when: Content has clear paragraph/section structure

Recursive Chunking:

Best for: Hierarchical documents, technical docs, code
Pros: Preserves structure, handles nested content
Cons: Most complex, requires structure detection
Use when: Documents have clear hierarchical organization

Sentence-Based Chunking:

Best for: Q&A pairs, chatbots, precise retrieval
Pros: Natural boundaries, good for citations
Cons: Small chunks may lack context
Use when: Need precise attribution and citations

Implementation Scripts

Fixed-Size Chunking

Script: scripts/chunk-fixed-size.py

Usage:

python scripts/chunk-fixed-size.py
--input document.txt
--chunk-size 1000
--overlap 200
--output chunks.json

Parameters:

chunk-size : Number of characters per chunk (default: 1000)
overlap : Character overlap between chunks (default: 200)
split-on : Split on sentences, words, or characters (default: sentences)

Best Practices:

Use 500-1000 character chunks for most RAG applications
Set overlap to 10-20% of chunk size
Split on sentences for better coherence

Semantic Chunking

Script: scripts/chunk-semantic.py

Usage:

python scripts/chunk-semantic.py
--input document.txt
--max-chunk-size 1500
--output chunks.json

How it works:

Detects natural boundaries (paragraphs, headings, line breaks)
Groups content while respecting max chunk size
Preserves semantic units (paragraphs stay together)
Adds context headers for nested sections

Best for: Articles, blog posts, documentation, books

Recursive Chunking

Script: scripts/chunk-recursive.py

Usage:

python scripts/chunk-recursive.py
--input document.md
--chunk-size 1000
--separators '["\n## ", "\n### ", "\n\n", "\n", " "]'
--output chunks.json

How it works:

Tries to split on first separator (e.g., ## headings)
If chunks still too large, recursively splits on next separator
Continues until all chunks are within size limit
Preserves hierarchical context

Separator hierarchy examples:

Markdown: ["\n## ", "\n### ", "\n\n", "\n", " "]
Python: ["\nclass ", "\ndef ", "\n\n", "\n", " "]
General: ["\n\n", "\n", ". ", " "]

Best for: Structured documents, source code, technical manuals

Benchmark Chunking Strategies

Script: scripts/benchmark-chunking.py

Usage:

python scripts/benchmark-chunking.py
--input document.txt
--strategies fixed,semantic,recursive
--chunk-sizes 500,1000,1500
--output benchmark-results.json

Metrics Evaluated:

Processing time: Speed of chunking
Chunk count: Total chunks generated
Chunk size variance: Consistency of chunk sizes
Context preservation: Semantic unit integrity (scored)
Retrieval quality: Simulated query performance

Output:

{ "fixed-1000": { "time_ms": 45, "chunk_count": 127, "avg_size": 982, "size_variance": 12.3, "context_score": 0.72 }, "semantic-1000": { "time_ms": 156, "chunk_count": 114, "avg_size": 1087, "size_variance": 234.5, "context_score": 0.91 } }

Configuration Template

Template: templates/chunking-config.yaml

Complete configuration:

chunking:

Global defaults

default_strategy: semantic default_chunk_size: 1000 default_overlap: 200

Strategy-specific configs

strategies: fixed_size: chunk_size: 1000 overlap: 200 split_on: sentence # sentence, word, character

semantic:
  max_chunk_size: 1500
  min_chunk_size: 200
  preserve_paragraphs: true
  add_headers: true  # Include section headers

recursive:
  chunk_size: 1000
  overlap: 100
  separators:
    markdown: ["\\n## ", "\\n### ", "\\n\\n", "\\n", " "]
    code: ["\\nclass ", "\\ndef ", "\\n\\n", "\\n", " "]
    text: ["\\n\\n", ". ", " "]

Document type mappings

document_types: ".md": semantic ".py": recursive ".txt": fixed_size ".pdf": semantic

Custom Splitter Template

Template: templates/custom-splitter.py

Create your own chunking logic:

from typing import List, Dict import re

class CustomChunker: def init(self, chunk_size: int = 1000, overlap: int = 200): self.chunk_size = chunk_size self.overlap = overlap

def chunk(self, text: str, metadata: Dict = None) -> List[Dict]:
    """
    Implement custom chunking logic here.

    Returns:
        List of chunks with metadata:
        [
            {
                "text": "chunk content",
                "metadata": {
                    "chunk_id": 0,
                    "source": "document.txt",
                    "start_char": 0,
                    "end_char": 1000
                }
            }
        ]
    """
    chunks = []

    # Your custom chunking logic here
    # Example: Split on custom pattern
    sections = self._split_sections(text)

    for i, section in enumerate(sections):
        chunks.append({
            "text": section,
            "metadata": {
                "chunk_id": i,
                "source": metadata.get("source", "unknown"),
                "chunk_size": len(section)
            }
        })

    return chunks

def _split_sections(self, text: str) -> List[str]:
    # Implement your splitting logic
    pass

Document-Specific Examples

Markdown Chunking

Example: examples/chunk-markdown.py

Features:

Preserves heading hierarchy
Keeps code blocks together
Maintains list structure
Adds parent section context to chunks

Usage:

python examples/chunk-markdown.py README.md --output readme-chunks.json

Code Chunking

Example: examples/chunk-code.py

Features:

Splits on class and function boundaries
Preserves complete functions
Includes docstrings with implementations
Language-aware separator selection

Supported languages: Python, JavaScript, TypeScript, Java, Go

Usage:

python examples/chunk-code.py src/main.py --language python --output code-chunks.json

PDF Chunking

Example: examples/chunk-pdf.py

Features:

Extracts text from PDF
Preserves page boundaries
Maintains formatting clues
Handles multi-column layouts

Dependencies: pypdf , pdfminer.six

Usage:

python examples/chunk-pdf.py research-paper.pdf --strategy semantic --output pdf-chunks.json

Optimization Guidelines

Chunk Size Selection

General recommendations:

Content Type Chunk Size Overlap Strategy

Q&A / FAQs 200-400 50 Sentence

Articles 500-1000 100-200 Semantic

Documentation 1000-1500 200-300 Recursive

Books 1000-2000 300-400 Semantic

Source code 500-1000 100 Recursive

Test with your data: Use benchmark-chunking.py to find optimal settings

Overlap Strategies

Why overlap matters:

Prevents information loss at boundaries
Improves retrieval of cross-boundary information
Balances redundancy vs. coverage

Overlap guidelines:

10-15%: Minimal overlap for speed
15-20%: Standard overlap for most use cases
20-30%: High overlap for critical applications

Performance Optimization

Fast chunking (large documents):

Use fixed-size for speed

python scripts/chunk-fixed-size.py --input large-doc.txt --chunk-size 1000

Quality chunking (smaller documents):

Use semantic for better context

python scripts/chunk-semantic.py --input article.txt --max-chunk-size 1500

Batch processing:

Process multiple files

for file in documents/*.txt; do python scripts/chunk-semantic.py --input "$file" --output "chunks/$(basename $file .txt).json" done

Evaluation Workflow

Step 1: Benchmark Strategies

python scripts/benchmark-chunking.py
--input sample-document.txt
--strategies fixed,semantic,recursive
--chunk-sizes 500,1000,1500

Step 2: Analyze Results

Review metrics:

Processing time (prefer < 100ms per document)
Context preservation score (target > 0.85)
Chunk size variance (lower is more predictable)

Step 3: A/B Test Retrieval

Compare retrieval quality:

Chunk same corpus with different strategies
Run identical test queries against each
Measure precision@k and recall@k
Select strategy with best retrieval metrics

Step 4: Production Deployment

Use configuration file:

import yaml from chunking_strategies import get_chunker

config = yaml.safe_load(open('chunking-config.yaml')) chunker = get_chunker(config['chunking']['default_strategy'], config)

chunks = chunker.chunk(document_text)

Common Issues & Solutions

Issue: Chunks too small/large

Adjust chunk_size parameter
Check document structure (may need different strategy)
Verify separator selection for recursive chunking

Issue: Lost context at boundaries

Increase overlap
Switch to semantic chunking
Add parent context to metadata

Issue: Slow processing

Use fixed-size chunking for large batches
Reduce overlap
Process files in parallel

Issue: Poor retrieval quality

Benchmark different strategies
Increase chunk size
Try hybrid approach (semantic + fixed fallback)

Dependencies

Core libraries:

pip install tiktoken # Token counting pip install nltk # Sentence splitting pip install spacy # Advanced NLP (optional)

For PDF support:

pip install pypdf pdfminer.six

For benchmarking:

pip install pandas numpy scikit-learn

Best Practices Summary

Start with semantic chunking for most documents
Use recursive chunking for structured/hierarchical content
Benchmark on your actual data before production
Set overlap to 15-20% of chunk size
Include metadata (source, page, section) in chunks
Test retrieval quality, not just chunking speed
Use appropriate chunk size for your embedding model token limit
Document your chunking strategy choice and parameters

Supported Strategies: Fixed-Size, Semantic, Recursive, Sentence-Based, Custom Output Format: JSON with text and metadata Version: 1.0.0

chunking-strategies

Safety Notice

Copy this and send it to your AI assistant to learn

Global defaults

Strategy-specific configs

Document type mappings

Use fixed-size for speed

Use semantic for better context

Process multiple files

Source Transparency

Related Skills

document-parsers

stt-integration

model-routing-patterns

react-email-templates