ocr-and-documents

PDF & Document Extraction

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ocr-and-documents" with this command: npx skills add nousresearch/hermes-agent/nousresearch-hermes-agent-ocr-and-documents

PDF & Document Extraction

For DOCX: use python-docx (parses actual document structure, far better than OCR). For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support). This skill covers PDFs and scanned documents.

Step 1: Remote URL Available?

If the document has a URL, always try web_extract first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

Step 2: Choose Local Extractor

Feature pymupdf (~25MB) marker-pdf (~3-5GB)

Text-based PDF ✅ ✅

Scanned PDF (OCR) ❌ ✅ (90+ languages)

Tables ✅ (basic) ✅ (high accuracy)

Equations / LaTeX ❌ ✅

Code blocks ❌ ✅

Forms ❌ ✅

Headers/footers removal ❌ ✅

Reading order detection ❌ ✅

Images extraction ✅ (embedded) ✅ (with context)

Images → text (OCR) ❌ ✅

EPUB ✅ ✅

Markdown output ✅ (via pymupdf4llm) ✅ (native, higher quality)

Install size ~25MB ~3-5GB (PyTorch + models)

Speed Instant ~1-14s/page (CPU), ~0.2s/page (GPU)

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

pymupdf (lightweight)

pip install pymupdf pymupdf4llm

Via helper script:

python scripts/extract_pymupdf.py document.pdf # Plain text python scripts/extract_pymupdf.py document.pdf --markdown # Markdown python scripts/extract_pymupdf.py document.pdf --tables # Tables python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages

Inline:

python3 -c " import pymupdf doc = pymupdf.open('document.pdf') for page in doc: print(page.get_text()) "

marker-pdf (high-quality OCR)

Check disk space first

python scripts/extract_marker.py --check

pip install marker-pdf

Via helper script:

python scripts/extract_marker.py document.pdf # Markdown python scripts/extract_marker.py document.pdf --json # JSON with metadata python scripts/extract_marker.py document.pdf --output_dir out/ # Save images python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR) python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy

CLI (installed with marker-pdf):

marker_single document.pdf --output_dir ./output marker /path/to/folder --workers 4 # Batch

Arxiv Papers

Abstract only (fast)

web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Search

web_search(query="arxiv GRPO reinforcement learning 2026")

Notes

  • web_extract is always first choice for URLs

  • pymupdf is the safe default — instant, no models, works everywhere

  • marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed

  • Both helper scripts accept --help for full usage

  • marker-pdf downloads ~2.5GB of models to ~/.cache/huggingface/ on first use

  • For Word docs: pip install python-docx (better than OCR — parses actual structure)

  • For PowerPoint: see the powerpoint skill (uses python-pptx)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

dogfood

No summary provided by upstream source.

Repository SourceNeeds Review
Research

OPC Cashflow Manager

Cash flow decision system for solo founders. Probability-weighted forecasting, runway calculation, burn rate analysis, and survival alerts. Integrates with o...

Registry SourceRecently Updated
Research

APIClaw Analysis

Find winning Amazon products with 14 battle-tested selection strategies & 6-dimension risk assessment. Backed by 200M+ product database. Use when user asks a...

Registry SourceRecently Updated
Research

ExpertPack Eval

Measure ExpertPack EK (Esoteric Knowledge) ratio and run automated quality evals. Use when: (1) Measuring what percentage of a pack's content frontier LLMs c...

Registry SourceRecently Updated