document_parser

Parse and extract content from .docx, .pdf, and .txt documents. Extracts plain text and tables for analysis. Use when the user uploads a document file or asks to analyze/extract/read content from Word documents, PDFs, or text files. Also use when the user asks questions about document content that requires parsing first.

Document Parser

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

Linux/macOS: bash scripts/install_dependencies.sh
Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

Format	Text	Tables	Notes
.txt	✅	❌	Direct text extraction
.docx	✅	✅	Paragraphs + structured tables
.pdf	✅	✅	Page-by-page extraction

Workflow

Parse the document using scripts/parse_document.py
Analyze the output (text and tables in JSON)
Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

Run: python scripts/parse_document.py quarterly_report.docx
Locate tables in output
Find revenue column and calculate total
Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...

==========================================================
TABLES FOUND: 2
==========================================================

Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

Large PDFs: Processing may take time for documents >50 pages
Scanned PDFs: OCR not supported; text must be selectable
Complex tables: PDF table extraction works best with clear borders

document_parser

Safety Notice

Copy this and send it to your AI assistant to learn

Document Parser

Quick Start

Installation

Supported Formats

Workflow

Example: Answering questions about a document

Output Format

Advanced Usage

Error Handling

Notes

Source Transparency

Related Skills

OpenDataLoader PDF

文档内容总结 Summary & Analysis txt/docx/pdf/xlsx/xls

文件总结 File Summary & Analysis

DOCX Formatter