Smart Document Pipeline
Quick Reference
Convert document (auto-caches, auto-summarizes if >100KB)
python ~/.claude/lib/document-converter.py "/path/to/file.pdf"
Force regenerate
python ~/.claude/lib/document-converter.py "/path/to/file.pdf" --force
List cached documents
python ~/.claude/lib/document-converter.py --list
Cleanup old cache (>1 week)
python ~/.claude/lib/document-converter.py --cleanup
Supported Formats
Format Extension Tool Notes
PDF .pdf PyMuPDF Text extraction, page-by-page
Word .docx, .doc pandoc/python-docx Full markdown
PowerPoint .pptx, .ppt python-pptx Slide-by-slide with notes
Excel .xlsx, .xls openpyxl Tables as markdown
RTF .rtf pandoc Rich text
Output Structure
{ "cache_path": "/path/to/cached/file.md", "summary_path": "/path/to/cached/file_summary.md", // if >100KB "from_cache": false, "original_size": 26744198, "converted_size": 129844, "summary_size": 30638, "savings_percent": 99.5, "recommendation": "summary" // "summary" or "full" }
Auto-Summary
Documents >100KB automatically get a summary version:
Version Purpose Size Target
Full Complete content As converted
Summary Quick overview ~30KB
The summary preserves:
-
All headers and structure
-
First portion of each section
-
Metadata and source reference
Automatic Integration
The smart-read-interceptor hook automatically triggers when you read:
-
PDF, Word, PowerPoint, Excel files
-
Any file >200KB
It will suggest:
-
Use summary - If summary exists (best for overview)
-
Use cache - If full cached version exists
-
Convert first - If no cache exists
-
Delegate - For very large files, use subagent
Subagent Delegation Pattern
For very large documents, delegate to isolated context:
Task( subagent_type="Explore", prompt="Read and summarize key points from: /path/to/large-file.pdf. Focus on: [specific topics]. Max 500 words summary." )
This keeps the large content OUT of main context.
Cache Location
~/.claude/cache/documents/ ├── filename_hash.md # Full converted version ├── filename_hash_summary.md # Summary (if >100KB) └── ...
Cache expires after 1 week. Run --cleanup to remove old files.
Real-World Results
Document Original Converted Summary Savings
Google AI Guide (PDF) 26.7 MB 127 KB 30 KB 99.9%
Debatt (Word) 206 KB 5.4 KB
97%
Övning (PowerPoint) 7.2 MB 3.1 KB
99.96%
Workflow Examples
Reading a PDF for research
- User asks to analyze a PDF
- Hook detects: "📄 DOCUMENT FILE: .PDF"
- Convert: python ~/.claude/lib/document-converter.py "file.pdf"
- Read the summary for overview
- Read specific sections from full version if needed
Processing multiple documents
-
Convert all documents first (batch): for f in *.pdf; do python ~/.claude/lib/document-converter.py "$f"; done
-
Read summaries in main context
-
Delegate deep analysis to subagents