Word Document Parsing

Parse Word documents (.docx) into markdown, JSON, and image artifacts using multi-method extraction.

Usage

Run the parsing script directly:

./scripts/parse_docx.py <path_to_file.docx> <output_dir>

Example:

./scripts/parse_docx.py ~/documents/report.docx ./parsed/

The script uses 4 extraction methods:

python-docx (basic) - Fast text extraction
python-docx (detailed) - Full structure with tables
docx2txt - Simple text-only fallback
markitdown - Microsoft's markdown converter

Output Structure

output_dir/
├── file.docx/
│   ├── parsing_summary.json
│   ├── python_docx_basic/
│   │   └── content.md
│   ├── python_docx_detailed/
│   │   ├── content.md
│   │   ├── tables.json
│   │   └── images/
│   ├── docx2txt/
│   │   └── content.txt
│   └── markitdown/
│       └── content.md

Script Features

Self-contained Python script with inline uv metadata
Handles multiple extraction methods for redundancy
Creates JSON metadata for tables and document structure
Extracts images with dimensions and metadata
Continues on errors (one method failure doesn't stop others)

docs-docx

Safety Notice

Copy this and send it to your AI assistant to learn

Word Document Parsing

Usage

Output Structure

Script Features

Source Transparency

Related Skills

tools-catppuccin

productivity-todoist

docs-pdf

docs-pbix