Word Document Parsing
Parse Word documents (.docx) into markdown, JSON, and image artifacts using multi-method extraction.
Usage
Run the parsing script directly:
./scripts/parse_docx.py <path_to_file.docx> <output_dir>
Example:
./scripts/parse_docx.py ~/documents/report.docx ./parsed/
The script uses 4 extraction methods:
- python-docx (basic) - Fast text extraction
- python-docx (detailed) - Full structure with tables
- docx2txt - Simple text-only fallback
- markitdown - Microsoft's markdown converter
Output Structure
output_dir/
├── file.docx/
│ ├── parsing_summary.json
│ ├── python_docx_basic/
│ │ └── content.md
│ ├── python_docx_detailed/
│ │ ├── content.md
│ │ ├── tables.json
│ │ └── images/
│ ├── docx2txt/
│ │ └── content.txt
│ └── markitdown/
│ └── content.md
Script Features
- Self-contained Python script with inline uv metadata
- Handles multiple extraction methods for redundancy
- Creates JSON metadata for tables and document structure
- Extracts images with dimensions and metadata
- Continues on errors (one method failure doesn't stop others)