Smart Document Pipeline

Quick Reference

Convert document (auto-caches, auto-summarizes if >100KB)

python ~/.claude/lib/document-converter.py "/path/to/file.pdf"

Force regenerate

python ~/.claude/lib/document-converter.py "/path/to/file.pdf" --force

List cached documents

python ~/.claude/lib/document-converter.py --list

Cleanup old cache (>1 week)

python ~/.claude/lib/document-converter.py --cleanup

Supported Formats

Format Extension Tool Notes

PDF .pdf PyMuPDF Text extraction, page-by-page

Word .docx, .doc pandoc/python-docx Full markdown

PowerPoint .pptx, .ppt python-pptx Slide-by-slide with notes

Excel .xlsx, .xls openpyxl Tables as markdown

RTF .rtf pandoc Rich text

Output Structure

{ "cache_path": "/path/to/cached/file.md", "summary_path": "/path/to/cached/file_summary.md", // if >100KB "from_cache": false, "original_size": 26744198, "converted_size": 129844, "summary_size": 30638, "savings_percent": 99.5, "recommendation": "summary" // "summary" or "full" }

Auto-Summary

Documents >100KB automatically get a summary version:

Version Purpose Size Target

Full Complete content As converted

Summary Quick overview ~30KB

The summary preserves:

All headers and structure
First portion of each section
Metadata and source reference

Automatic Integration

The smart-read-interceptor hook automatically triggers when you read:

PDF, Word, PowerPoint, Excel files
Any file >200KB

It will suggest:

Use summary - If summary exists (best for overview)
Use cache - If full cached version exists
Convert first - If no cache exists
Delegate - For very large files, use subagent

Subagent Delegation Pattern

For very large documents, delegate to isolated context:

Task( subagent_type="Explore", prompt="Read and summarize key points from: /path/to/large-file.pdf. Focus on: [specific topics]. Max 500 words summary." )

This keeps the large content OUT of main context.

Cache Location

~/.claude/cache/documents/ ├── filename_hash.md # Full converted version ├── filename_hash_summary.md # Summary (if >100KB) └── ...

Cache expires after 1 week. Run --cleanup to remove old files.

Real-World Results

Document Original Converted Summary Savings

Google AI Guide (PDF) 26.7 MB 127 KB 30 KB 99.9%

Debatt (Word) 206 KB 5.4 KB

97%

Övning (PowerPoint) 7.2 MB 3.1 KB

99.96%

Workflow Examples

Reading a PDF for research

User asks to analyze a PDF
Hook detects: "📄 DOCUMENT FILE: .PDF"
Convert: python ~/.claude/lib/document-converter.py "file.pdf"
Read the summary for overview
Read specific sections from full version if needed

Processing multiple documents

Convert all documents first (batch): for f in *.pdf; do python ~/.claude/lib/document-converter.py "$f"; done
Read summaries in main context
Delegate deep analysis to subagents

convert-doc

Safety Notice

Copy this and send it to your AI assistant to learn

Convert document (auto-caches, auto-summarizes if >100KB)

Force regenerate

List cached documents

Cleanup old cache (>1 week)

Debatt (Word) 206 KB 5.4 KB

Övning (PowerPoint) 7.2 MB 3.1 KB

Source Transparency

Related Skills

geolocation-skill

architecture

pm-policy

document-factory