document-parsers

Purpose: Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "document-parsers" with this command: npx skills add vanman2024/ai-dev-marketplace/vanman2024-ai-dev-marketplace-document-parsers

Document Parsers

Purpose: Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.

Activation Triggers:

  • Building RAG (Retrieval-Augmented Generation) pipelines

  • Extracting text, tables, or metadata from documents

  • Processing large document collections

  • Converting documents to structured formats

  • Handling complex PDFs with tables and layouts

  • OCR for scanned documents

  • Chunking documents for vector embeddings

  • Building document search systems

Key Resources:

  • scripts/setup-llamaparse.sh

  • Install and configure LlamaParse (AI-powered parsing)

  • scripts/setup-unstructured.sh

  • Install Unstructured.io library

  • scripts/parse-pdf.py

  • Functional PDF parser with multiple backend options

  • scripts/parse-docx.py

  • DOCX document parser

  • scripts/parse-html.py

  • HTML to structured text parser

  • templates/multi-format-parser.py

  • Universal document parser template

  • templates/table-extraction.py

  • Specialized table extraction template

  • examples/parse-research-paper.py

  • Research paper parsing with citations

  • examples/parse-legal-document.py

  • Legal document parsing with sections

Parser Comparison & Selection Guide

  1. LlamaParse (AI-Powered Premium)

Best For:

  • Complex PDFs with tables, charts, and mixed layouts

  • Scanned documents requiring OCR

  • Documents where accuracy is critical

  • Multi-column layouts and scientific papers

  • Financial reports and invoices

Pros:

  • AI-powered layout understanding

  • Excellent table extraction accuracy

  • Built-in OCR support

  • Handles complex formatting

  • Structured output (Markdown/JSON)

Cons:

  • Requires API key (paid service)

  • API rate limits

  • Network dependency

  • Slower than local parsers

Documentation: https://docs.cloud.llamaindex.ai/llamaparse

Setup:

./scripts/setup-llamaparse.sh

Usage Pattern:

from llama_parse import LlamaParse

parser = LlamaParse( api_key="llx-...", result_type="markdown", # or "text" language="en", verbose=True )

documents = parser.load_data("document.pdf") for doc in documents: print(doc.text)

  1. Unstructured.io (Local Processing)

Best For:

  • Batch processing many documents

  • Multiple format support (PDF, DOCX, HTML, PPTX, Images)

  • Local processing without API dependencies

  • Structured element extraction

  • Production RAG pipelines

Pros:

  • Open-source and free

  • Multi-format support

  • Runs locally (no API keys)

  • Good table detection

  • Element-based chunking

Cons:

  • Requires system dependencies (poppler, tesseract)

  • Complex installation

  • Less accurate than LlamaParse for complex layouts

Documentation: https://unstructured-io.github.io/unstructured/

Setup:

./scripts/setup-unstructured.sh

Usage Pattern:

from unstructured.partition.auto import partition

elements = partition("document.pdf") for element in elements: print(f"{element.category}: {element.text}")

  1. PyPDF2 (Simple PDF Text Extraction)

Best For:

  • Simple text-based PDFs

  • Quick prototyping

  • Metadata extraction

  • PDF manipulation (merge, split)

Pros:

  • Pure Python (no dependencies)

  • Fast and lightweight

  • Good for simple PDFs

  • Active maintenance

Cons:

  • Poor table extraction

  • Struggles with complex layouts

  • No OCR support

  • Limited formatting preservation

Documentation: https://github.com/py-pdf/pypdf2

Setup:

pip install pypdf2

Usage Pattern:

from PyPDF2 import PdfReader

reader = PdfReader("document.pdf") for page in reader.pages: print(page.extract_text())

  1. PDFPlumber (Advanced PDF Analysis)

Best For:

  • Table extraction from PDFs

  • PDF with tabular data

  • Financial statements and reports

  • Coordinate-based extraction

Pros:

  • Excellent table extraction

  • Visual debugging tools

  • Coordinate-level control

  • Metadata and layout info

Cons:

  • Slower than PyPDF2

  • Requires pdfminer.six dependency

  • No OCR support

Documentation: https://github.com/jsvine/pdfplumber

Setup:

pip install pdfplumber

Usage Pattern:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() text = page.extract_text()

  1. python-docx (Word Documents)

Best For:

  • Microsoft Word (.docx) documents

  • Extracting paragraphs, tables, headers

  • Document metadata

  • Template-based document generation

Pros:

  • Native DOCX support

  • Preserves structure (paragraphs, tables, sections)

  • Access to styles and formatting

  • Can also write/modify DOCX

Cons:

  • Only works with .docx (not .doc)

  • Limited image extraction

Documentation: https://github.com/python-openxml/python-docx

Setup:

pip install python-docx

Usage Pattern:

from docx import Document

doc = Document("document.docx") for para in doc.paragraphs: print(para.text) for table in doc.tables: for row in table.rows: print([cell.text for cell in row.cells])

Decision Matrix

Use Case Recommended Parser Alternative

Simple PDF text extraction PyPDF2 Unstructured

Complex PDFs with tables LlamaParse PDFPlumber

Scanned documents (OCR) LlamaParse Unstructured + Tesseract

Word documents (.docx) python-docx Unstructured

HTML to text parse-html.py Unstructured

Multi-format batch processing Unstructured Multi-format-parser

Table extraction PDFPlumber LlamaParse

Research papers LlamaParse Unstructured

Legal documents LlamaParse PDFPlumber

Production RAG pipeline Unstructured LlamaParse

Functional Scripts

  1. Parse PDF (scripts/parse-pdf.py )

Command-line PDF parser supporting multiple backends:

Using PyPDF2 (default)

python scripts/parse-pdf.py document.pdf

Using PDFPlumber (better for tables)

python scripts/parse-pdf.py document.pdf --backend pdfplumber

Using LlamaParse (AI-powered)

python scripts/parse-pdf.py document.pdf --backend llamaparse --api-key llx-...

Output to file

python scripts/parse-pdf.py document.pdf --output output.txt

Extract tables as JSON

python scripts/parse-pdf.py document.pdf --backend pdfplumber --tables-only --output tables.json

Features:

  • Multiple backend support (PyPDF2, PDFPlumber, LlamaParse)

  • Table extraction

  • Metadata extraction

  • Page range selection

  • JSON/Text output formats

  1. Parse DOCX (scripts/parse-docx.py )

Word document parser with structure preservation:

Basic extraction

python scripts/parse-docx.py document.docx

Extract with structure

python scripts/parse-docx.py document.docx --preserve-structure

Extract tables only

python scripts/parse-docx.py document.docx --tables-only

Output as JSON

python scripts/parse-docx.py document.docx --output output.json --format json

Features:

  • Paragraph extraction with styles

  • Table extraction

  • Header/footer extraction

  • Metadata (author, created date, etc.)

  • Structured JSON output

  1. Parse HTML (scripts/parse-html.py )

HTML to clean text converter:

Basic HTML parsing

python scripts/parse-html.py document.html

From URL

python scripts/parse-html.py https://example.com/article

Preserve links

python scripts/parse-html.py document.html --preserve-links

Extract specific selector

python scripts/parse-html.py document.html --selector "article.content"

Features:

  • Clean text extraction (removes scripts, styles)

  • Link preservation

  • CSS selector support

  • URL fetching

  • Markdown output option

Templates

Multi-Format Parser (templates/multi-format-parser.py )

Universal parser handling multiple formats with automatic format detection:

from multi_format_parser import MultiFormatParser

parser = MultiFormatParser( llamaparse_api_key="llx-...", # Optional use_ocr=True, chunk_size=1000 )

Automatic format detection

result = parser.parse_file("document.pdf") print(result.text) print(result.metadata) print(result.tables)

Batch processing

results = parser.parse_directory("./documents/") for filename, result in results.items(): print(f"{filename}: {len(result.text)} characters")

Supports:

  • PDF, DOCX, HTML, Markdown, TXT

  • Automatic chunking for RAG

  • Metadata extraction

  • Table extraction across all formats

  • Error handling and fallbacks

Table Extraction (templates/table-extraction.py )

Specialized table extraction with multiple strategies:

from table_extraction import TableExtractor

extractor = TableExtractor( prefer_llamaparse=True, fallback_to_pdfplumber=True )

Extract all tables from document

tables = extractor.extract_tables("financial_report.pdf")

for i, table in enumerate(tables): print(f"Table {i + 1}:") print(table.to_markdown()) # or .to_csv(), .to_json() print(f"Confidence: {table.confidence}")

Features:

  • Multiple extraction strategies

  • Automatic fallback

  • Table validation

  • Format conversion (CSV, JSON, Markdown, DataFrame)

  • Confidence scoring

Examples

Research Paper Parsing (examples/parse-research-paper.py )

Complete example for parsing academic papers:

Extracts title, abstract, sections, citations, tables, figures

python examples/parse-research-paper.py paper.pdf --output paper.json

Extracts:

  • Title and authors

  • Abstract

  • Section structure (Introduction, Methods, Results, etc.)

  • Citations and references

  • Tables and figures with captions

  • Metadata (DOI, publication date, journal)

Legal Document Parsing (examples/parse-legal-document.py )

Specialized parser for legal documents:

Extracts clauses, sections, definitions, parties

python examples/parse-legal-document.py contract.pdf --output contract.json

Extracts:

  • Document type (contract, agreement, etc.)

  • Parties involved

  • Definitions section

  • Numbered clauses and sections

  • Signature blocks

  • Dates and deadlines

RAG Pipeline Integration

Document Chunking for Embeddings

from multi_format_parser import MultiFormatParser

parser = MultiFormatParser(chunk_size=512, chunk_overlap=50) result = parser.parse_file("document.pdf")

Chunks ready for embedding

for chunk in result.chunks: print(f"Chunk {chunk.id}: {chunk.text[:100]}...") print(f"Metadata: {chunk.metadata}") # Send to embedding model

Batch Processing Pipeline

import glob from multi_format_parser import MultiFormatParser

parser = MultiFormatParser()

Process all documents in directory

for filepath in glob.glob("./documents/**/*", recursive=True): try: result = parser.parse_file(filepath) # Store in vector database store_embeddings(result.chunks) print(f"✓ Processed {filepath}") except Exception as e: print(f"✗ Failed {filepath}: {e}")

Best Practices

Parser Selection:

  • Start with PyPDF2 for simple PDFs, upgrade if needed

  • Use LlamaParse for complex layouts (budget permitting)

  • Use Unstructured for multi-format production systems

  • Use PDFPlumber specifically for table extraction

Performance:

  • Cache parsed results to avoid re-processing

  • Use batch processing for multiple documents

  • Consider async processing for large collections

  • Monitor API rate limits for LlamaParse

Accuracy:

  • Validate table extraction results

  • Implement fallback strategies

  • Log parsing errors for debugging

  • Use confidence scores when available

RAG Optimization:

  • Chunk size: 512-1024 tokens for embeddings

  • Overlap: 10-20% for context preservation

  • Preserve metadata (page numbers, sections) for retrieval

  • Clean extracted text (remove headers/footers)

Troubleshooting

PyPDF2 returns garbled text:

  • Try PDFPlumber or LlamaParse

  • PDF may have non-standard encoding

  • Check if PDF is scanned (needs OCR)

Unstructured installation fails:

  • Install system dependencies: sudo apt-get install poppler-utils tesseract-ocr

  • On macOS: brew install poppler tesseract

LlamaParse API errors:

  • Verify API key is correct

  • Check rate limits in dashboard

  • Ensure document size is within limits

Table extraction misses columns:

  • Try different parser (PDFPlumber vs LlamaParse)

  • Adjust table detection settings

  • Validate table structure manually

DOCX parsing fails:

  • Ensure file is .docx not .doc

  • Check file is not corrupted

  • Try converting to .docx with LibreOffice

Dependencies

Core:

pip install pypdf2 pdfplumber python-docx beautifulsoup4 lxml markdown

Optional (Unstructured):

pip install unstructured[local-inference] sudo apt-get install poppler-utils tesseract-ocr # Linux brew install poppler tesseract # macOS

Optional (LlamaParse):

pip install llama-parse

Requires API key from https://cloud.llamaindex.ai

Supported Formats: PDF, DOCX, HTML, Markdown, TXT Parsers: LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, python-docx Version: 1.0.0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

stt-integration

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

model-routing-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

react-email-templates

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

rag-implementation

No summary provided by upstream source.

Repository SourceNeeds Review