pdf-reader

Extract text from PDF files with automatic OCR fallback for scanned/image-based PDFs. Use when: (1) a user sends a PDF file and the framework did not auto-inject text content, (2) the injected text is empty or garbled, (3) a PDF file exists on disk and needs text extraction, (4) user mentions "read PDF", "extract PDF", "PDF content", "scan PDF", "OCR". Handles both text-layer PDFs (fast pdftotext) and scanned/image PDFs (tesseract OCR). Supports Chinese + English by default, configurable languages.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pdf-reader" with this command: npx skills add panpeter2024/pdf-reader

PDF Reader

Extract text from any PDF — text-layer or scanned image.

How It Works

PDF received
  ├─ Has text layer? ──→ pdftotext (fast, high quality)
  │     └─ Text too sparse? ──→ Fall back to OCR
  └─ Detected as scan? ──→ Skip text, go straight to OCR
                               pdftoppm → tesseract

Quick Start

Run the bundled script via exec:

bash <skill-dir>/scripts/pdf-extract.sh /path/to/file.pdf

Save to file:

bash <skill-dir>/scripts/pdf-extract.sh /path/to/file.pdf --output /tmp/result.txt

Then read /tmp/result.txt with the read tool.

When This Skill Triggers

  1. User sends a PDF in chat but no <file> text content was injected (only file path visible)
  2. Injected content is empty, garbled, or truncated
  3. User explicitly asks to read/extract/OCR a PDF file
  4. A PDF on disk needs text extraction for downstream processing

Typical Workflow

  1. Identify the PDF file path (usually /root/.openclaw/media/inbound/...)
  2. Run the extraction script
  3. Read the output and respond to the user

Example:

# Extract and save
bash <skill-dir>/scripts/pdf-extract.sh "/root/.openclaw/media/inbound/document.pdf" -o /tmp/pdf-text.txt

# Then use read tool on /tmp/pdf-text.txt

Script Options

FlagDescriptionDefault
--langTesseract languages (validated against allowlist)chi_sim+eng
--dpiImage resolution for OCR300
--output / -oSave to file instead of stdoutstdout
--ocr-onlyForce OCR, skip text extractionoff
--text-onlyText extraction only, no OCR fallbackoff
--auto-installAuto-install missing tools (poppler, tesseract)off

Dependencies

By default, the script does not install packages automatically. If tools are missing, it prints install instructions and exits.

To enable auto-install, pass --auto-install:

bash <skill-dir>/scripts/pdf-extract.sh file.pdf --auto-install

This installs poppler-utils and tesseract-ocr via apt-get, yum, or brew as needed.

Pre-install recommended (run once on the server):

apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-chi-sim

Language Support

Default: Chinese Simplified + English (chi_sim+eng).

The --lang parameter is validated against a strict allowlist of official tesseract language codes. Invalid or malformed values are rejected.

Other languages:

# Japanese + English
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --lang jpn+eng

# Korean
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --lang kor

Tesseract language packs are auto-installed based on --lang.

Limitations

  • OCR quality depends on scan quality; low-resolution or handwritten PDFs may produce errors
  • Encrypted/password-protected PDFs are not supported
  • Large PDFs (50+ pages) may take 1-2 minutes for OCR
  • Pure-image pages (photos, diagrams without text) produce noise — this is expected

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

claw-saver

Back up the full OpenClaw environment (~/.openclaw) to a Git repository. Supports scheduled backups, interactive restore, and Git LFS for large model files.

Registry SourceRecently Updated
General

电商售后客服应答规范

电商售后客服应答规范技能。当客服人员需要处理用户退换货申请、物流异常咨询、售后补偿协商三类售后诉求时调用,生成符合品牌服务规范的统一话术应答。适用场景:(1) 用户申请退货或换货,(2) 用户咨询物流延误/丢件/破损等异常,(3) 用户要求补偿/赔偿/优惠券等协商。不包含代码,纯提示词驱动。

Registry SourceRecently Updated
General

English Homework Grader

Grade English written homework for elementary students (grades 3-6). Use when correcting or evaluating English written assignments such as spelling, fill-in-...

Registry SourceRecently Updated
General

Clawhub Skill

Play the daily up or dn chart-prediction game and report results

Registry SourceRecently Updated