Large Document Processing & Intelligent Text Chunking
Overview
Two tightly related concerns combined here:
-
Large document parsing — DOCX/PDF/EPUB ingestion with structure preservation
-
Intelligent text chunking — splitting parsed text into semantically coherent pieces for AI training or RAG
Source Files
File Purpose
src/utils/nwt_epub_parser.py
EPUB parser for NWT Bible (English + Chuukese)
scripts/extract_jwpub.py
Extract JW publication .jwpub archives
scripts/setup_large_document_processing.py
One-time document pipeline setup
output/processed_document/
Output directory for processed content
Document Processing
Supported Formats
-
DOCX via python-docx
-
PDF via PyMuPDF (import as fitz ) — note: fitz==0.0.1.dev2 is NOT in requirements; use PyMuPDF only
-
EPUB via ebooklib
- NWTEpubParser
- Plain text / CSV — direct read
EPUB Pattern (NWT Bible)
from src.utils.nwt_epub_parser import NWTEpubParser
parser = NWTEpubParser('data/bible/nwt_E.epub') verse_text = parser.get_verse('John', 3, 16) chapter_verses = parser.get_chapter('Genesis', 1)
PDF/DOCX Pattern
import fitz # PyMuPDF — installed as PyMuPDF, exposed as fitz
doc = fitz.open('large_document.pdf') for page_num, page in enumerate(doc): text = page.get_text() # process text...
Intelligent Text Chunking
Strategy Selection
Strategy Use case
Semantic AI training data — respect topic/paragraph boundaries
Structural Documents with clear headings/sections
Fixed-size RAG systems needing predictable chunk sizes
Sliding window QA tasks needing context overlap
Implementation Pattern
Sentence-boundary-aware chunking
def chunk_text(text: str, max_chars: int = 1024, overlap: int = 100) -> list[str]: sentences = re.split(r'(?<=[.!?])\s+', text) chunks, current = [], '' for sent in sentences: if len(current) + len(sent) > max_chars and current: chunks.append(current.strip()) current = current[-overlap:] + ' ' + sent # overlap else: current += ' ' + sent if current.strip(): chunks.append(current.strip()) return chunks
Chuukese-aware chunking
Chuukese uses the same sentence terminators as English
SENTENCE_ENDINGS = re.compile(r'(?<=[.!?])\s+')
def detect_language(text: str) -> str: has_accents = bool(re.search(r'[áéíóú]', text)) return 'chuukese' if has_accents else 'english'
Memory Efficiency
-
Process large PDFs page-by-page, not loading the full DOM into memory
-
Stream EPUB chapters — do not load the entire book at once
-
Write chunk output incrementally to JSONL files rather than accumulating in RAM
Output Formats
-
JSONL: one JSON object per line — best for large training datasets
-
JSON array: for smaller batches consumed by the frontend
-
Plain text: cleaned extracted text for inspection
Dependencies
-
PyMuPDF==1.23.8 — PDF processing (do NOT add fitz==0.0.1.dev2 )
-
python-docx>=1.2.0
-
ebooklib>=0.18
-
beautifulsoup4>=4.12.0