document_parser

Parse and extract content from .docx, .pdf, and .txt documents. Extracts plain text and tables for analysis. Use when the user uploads a document file or asks to analyze/extract/read content from Word documents, PDFs, or text files. Also use when the user asks questions about document content that requires parsing first.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "document_parser" with this command: npx skills add mjk39966-glitch/mjk39966-document-parser

Document Parser

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

  • Linux/macOS: bash scripts/install_dependencies.sh
  • Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

FormatTextTablesNotes
.txtDirect text extraction
.docxParagraphs + structured tables
.pdfPage-by-page extraction

Workflow

  1. Parse the document using scripts/parse_document.py
  2. Analyze the output (text and tables in JSON)
  3. Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

  1. Run: python scripts/parse_document.py quarterly_report.docx
  2. Locate tables in output
  3. Find revenue column and calculate total
  4. Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...

==========================================================
TABLES FOUND: 2
==========================================================

Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

  • Large PDFs: Processing may take time for documents >50 pages
  • Scanned PDFs: OCR not supported; text must be selectable
  • Complex tables: PDF table extraction works best with clear borders

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

OpenDataLoader PDF

Parse PDFs into Markdown, JSON, or HTML with OCR, table extraction, and AI-enriched descriptions for building RAG pipelines and knowledge bases.

Registry SourceRecently Updated
900Profile unavailable
Coding

文档内容总结 Summary & Analysis txt/docx/pdf/xlsx/xls

local document summary & analysis tool. triggers: 帮我总结, 总结文件, 分析文档, 分析总结, 总结一下, 分析一下 summarize for me, analyze for me, summarize the file, analyze the docume...

Registry Source
5030Profile unavailable
Coding

文件总结 File Summary & Analysis

Local document summary tool. Activate when user mentions "总结文件", "帮我总结", "总结文档", "分析文档" or provides a local file path (txt/docx/pdf/xlsx/xls).

Registry Source
2.6K1Profile unavailable
General

DOCX Formatter

生成符合中国公文格式规范的Word文档,支持标题、正文样式、自动格式排版和中文引号配对。

Registry SourceRecently Updated
9431Profile unavailable