office-doc-extractor

Convert Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown without any external dependencies. Use when the user needs to extract text from Word documents, Excel spreadsheets, or PowerPoint presentations for analysis, indexing, or LLM processing. Pure Python implementation — no pip install, no subprocess calls, no network downloads required. Works offline.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "office-doc-extractor" with this command: npx skills add michealxie001/office-doc-extractor

Office Document Extractor

Zero-dependency converter for Microsoft Office documents. Extracts text and structure from DOCX, XLSX, and PPTX files into clean Markdown.

Quick Start

# Single file
python3 scripts/main.py report.docx -o report.md

# Batch convert a directory
python3 scripts/main.py ./documents --batch -o ./markdown

Supported Formats

FormatExtensionOutput
Word.docxHeadings, paragraphs
Excel.xlsxTables (one per sheet)
PowerPoint.pptxSlides as sections

How It Works

  • DOCX: Parses the ZIP archive's XML directly using Python's zipfile and xml.etree
  • XLSX: Uses bundled openpyxl (pure Python, no C extensions)
  • PPTX: Parses the ZIP archive's slide XML directly

No external commands, no network calls, no pip install required.

Usage

Single File

python3 scripts/main.py <input_file> [-o <output.md>]

Auto-detects format from file extension. If -o is omitted, outputs to <input>.md.

Batch Conversion

python3 scripts/main.py <input_directory> --batch [-o <output_directory>]

Converts all .docx, .xlsx, .pptx files in the directory. Results saved to markdown_output/ by default.

Resources

scripts/

  • main.py — Unified CLI for single-file and batch conversion
  • docx_extractor.py — DOCX → Markdown (standard library only)
  • xlsx_extractor.py — XLSX → Markdown tables (bundled openpyxl)
  • pptx_extractor.py — PPTX → Markdown (standard library only)

Bundled Dependencies

  • openpyxl/ — Pure Python Excel library (v3.1.5)
  • et_xmlfile/ — openpyxl dependency (pure Python)

Limitations

  • Does not extract images or embedded objects (text only)
  • Does not preserve complex formatting (colors, fonts, layouts)
  • Does not handle encrypted/password-protected files
  • No OCR for scanned documents (use OpenClaw's native pdf tool for that)

Why This Skill?

Existing markitdown-based skills require pip install or external CLI tools, which triggers ClawHub security warnings. This skill is 100% self-contained — install it and use it immediately, even offline.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Word Converter

Universal Word document converter powered by MinerU API. Convert .docx and .doc files to Markdown, HTML, LaTeX, DOCX, or JSON using mineru-open-api CLI. Supp...

Registry SourceRecently Updated
1320Profile unavailable
General

Office Toolkit

A comprehensive toolkit for Microsoft Office documents (Word, Excel, PowerPoint) and PDF files. Supports reading, writing, format conversion, and batch proce...

Registry SourceRecently Updated
3160Profile unavailable
Coding

文件总结 File Summary & Analysis

Local document summary tool. Activate when user mentions "总结文件", "帮我总结", "总结文档", "分析文档" or provides a local file path (txt/docx/pdf/xlsx/xls).

Registry SourceRecently Updated
2.7K1Profile unavailable
Coding

Joe's Markdown to DOCX Converter

Convert Markdown files to fully formatted Word DOCX documents with support for tables, images, code blocks, and GitHub Flavored Markdown features.

Registry SourceRecently Updated
3220Profile unavailable