Local OCR Pipeline Skill

Robust Optical Character Recognition (OCR) pipeline driven by ocrmypdf and tesseract . Handles scanned PDFs, rotated image inputs, and raw text extraction securely and locally without external APIs.

Why not GPU via PyTorch/EasyOCR? The ocrmypdf tool is the industry standard for producing searchable PDFs. It leverages tesseract for pixel-accurate text placement. A pure-CPU pipeline is leaner (avoids a 1.5GB PyTorch payload) and reliably embeds text exactly where it appears in the scanned image.

Capabilities

Searchable PDF Generation: Converts rasterized/scanned PDFs or raw images (.jpg , .png , etc.) into PDFs with a selectable, searchable text layer.
Auto-Rotation & Deskew: Automatically detects incorrectly rotated text and straightens crooked scans.
Idempotent In-Place Processing: Safely processes files in-place using --skip-text , preventing double-processing of a PDF that already has embedded text.
Structured JSON Output: All commands output structured JSON, making failure states (like missing dependencies) parseable by agents.
Raw Text Extraction: Raw string extraction fallback for when agents need text directly in-memory instead of a PDF file.

Setup

Installs system dependencies (tesseract, ocrmypdf, ghostscript) and sets up isolated venv

bash skills/ocr/scripts/setup.sh

Usage

uv run --project ~/.local-ocr scripts/ocr.py <command>

Generate a Searchable PDF (pdf )

Produces a standard, layered PDF. If you give it an image, it wraps it in a PDF. If you give it a scanned PDF, it adds the invisible text layer.

Overwrites the file in-place, skipping it safely if it already contains text

uv run --project ~/.local-ocr scripts/ocr.py pdf ./scanned_invoice.pdf

Output to a different file

uv run --project ~/.local-ocr scripts/ocr.py pdf ./scan_001.png -o ./contract.pdf

Force reprocessing (ignore existing text layer)

uv run --project ~/.local-ocr scripts/ocr.py pdf ./scanned_invoice.pdf --force

Note: By default, auto-rotate and deskew are enabled. Disable with --no-rotate or --no-deskew .

Batch Process a Directory (batch )

Recursively scans a directory for images and PDFs, applying OCR.

Process all files. Skips already-OCRed PDFs.

uv run --project ~/.local-ocr scripts/ocr.py batch ./archives/

Extract Raw Text (text )

Does not create a PDF. Just reads the words off the page and returns them as a JSON string. Good for agents reading documents on the fly.

uv run --project ~/.local-ocr scripts/ocr.py text ./han_solo_invoice.png

Franchise Examples (Star Wars)

Process the Death Star blueprints: uv run --project ~/.local-ocr scripts/ocr.py pdf ./ds-1_schematics.pdf
Extract raw orders: uv run --project ~/.local-ocr scripts/ocr.py text ./order_66_memo.jpg
Archive run: uv run --project ~/.local-ocr scripts/ocr.py batch /archives/jedi_temple

Troubleshooting

File already contains text: This is the most common "error", but it isn't an error. ocrmypdf returns exit code 6 when it skips a file that already has text. The wrapper script catches this and reports a JSON "status": "success" with a message noting the side-step.
Dependencies Missing: Run the setup.sh script again if the agent complains about missing tesseract or Python modules.

local-ocr

Safety Notice

Copy this and send it to your AI assistant to learn

Installs system dependencies (tesseract, ocrmypdf, ghostscript) and sets up isolated venv

Overwrites the file in-place, skipping it safely if it already contains text

Output to a different file

Force reprocessing (ignore existing text layer)

Process all files. Skips already-OCRed PDFs.

Source Transparency

Related Skills

kitchen-sink-design-system

design-lookup

nextjs-tinacms

cloudflare-pages