local-ocr

[macOS only] Use this skill when the user requests OCR (Optical Character Recognition), image/PDF text extraction. Uses macOS native Vision/PDFKit frameworks. Triggers: '识别图片', 'OCR', '提取图片文字', '提取PDF文字', '识别PDF', 'extract text from image', 'PDF OCR'.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "local-ocr" with this command: npx skills add ltryee/ocr-locally

Local OCR (macOS Only)

Overview

⚠️ Platform Requirement: This skill is macOS only. It requires macOS 10.15+ (Catalina or later) and uses macOS native frameworks:

  • Vision framework - For OCR text recognition
  • PDFKit framework - For PDF processing
  • Core Graphics - For image rendering

This skill provides OCR (Optical Character Recognition) capabilities using macOS native Vision framework. It extracts text from images and PDFs without requiring any third-party libraries or internet connection.

Platform Requirements

⚠️ macOS Only - This skill cannot run on Linux, Windows, or other operating systems.

Required:

  • macOS 10.15+ (Catalina or later)
  • Vision framework (pre-installed on macOS)
  • PDFKit framework (pre-installed on macOS)

Why macOS Only?

  • Uses Vision framework for OCR (macOS/iOS only)
  • Uses PDFKit framework for PDF processing (macOS/iOS only)
  • Uses AppKit/Core Graphics for image handling (macOS only)

When to Use This Skill

Trigger this skill when the user:

  • Requests OCR or image text extraction
  • Mentions extracting text from images, screenshots, PDF files, or scanned documents
  • Uses keywords like: "识别图片", "OCR", "提取文字", "提取PDF文字", "识别PDF", "extract text from image", "PDF OCR"
  • Provides an image file or PDF file and asks to read or extract its content

Core Capabilities

1. Text Extraction from Images

Use scripts/ocr_vision_pro.swift for comprehensive OCR with the following features:

  • Multi-language support (Chinese, English, Japanese, Korean, and more)
  • Two output modes (mutually exclusive):
    • Text Mode (-t): Output only extracted text (default)
    • JSON Mode (-j): Output complete raw info including text, position, and confidence as JSON
  • Confidence scores for each detected text block
  • Bounding box information (text position in image)
  • Output to console or file
  • Precise or fast recognition modes

Basic usage:

swift scripts/ocr_vision_pro.swift <image_path>

With options:

swift scripts/ocr_vision_pro.swift <image_path> -l zh-Hans,en -o output.txt -f

2. Text Extraction from PDF Files

Use scripts/pdf_ocr.swift to extract text from PDF files with the following features:

  • Extract text from specific pages or all pages
  • Support page range specification (e.g., 1-5, 1,3,5)
  • Two output modes (mutually exclusive):
    • Text Mode (-t): Output only extracted text (default)
    • JSON Mode (-j): Output complete raw info as JSON
  • Same multi-language support as image OCR
  • Precise or fast recognition modes

Basic usage (all pages):

swift scripts/pdf_ocr.swift <pdf_path>

With page specification:

# Single page
swift scripts/pdf_ocr.swift document.pdf -p 1

# Multiple pages
swift scripts/pdf_ocr.swift document.pdf -p 1,3,5

# Page range
swift scripts/pdf_ocr.swift document.pdf -p 1-5

# JSON mode
swift scripts/pdf_ocr.swift document.pdf -p 1 -j

3. Output Modes (Mutually Exclusive)

The script supports two output modes that cannot be used simultaneously:

Text Mode (Default, -t)

Outputs only the extracted text:

  • Console output: Pure text
  • File output (-o or -t): Saves text to file, optionally with separate confidence file

JSON Mode (-j)

Outputs complete raw information as JSON:

  • Contains: image path, total blocks, average confidence, and per-block details
  • Per-block info: index, text, confidence, bounding box (x, y, width, height)
  • Outputs to stdout only (no file output options in JSON mode)

JSON output structure:

{
  "imagePath": "/path/to/image.png",
  "totalBlocks": 25,
  "averageConfidence": 0.85,
  "blocks": [
    {
      "index": 1,
      "text": "recognized text",
      "confidence": 0.95,
      "boundingBox": {
        "x": 0.10,
        "y": 0.20,
        "width": 0.30,
        "height": 0.05
      }
    }
  ]
}

4. Supported File Formats

Image Formats (ocr_vision_pro.swift):

  • PNG (.png)
  • JPEG (.jpg, .jpeg)
  • TIFF (.tiff, .tif)
  • BMP (.bmp)

PDF Format (pdf_ocr.swift):

  • PDF (.pdf) - support single page, multiple pages, or page ranges
  • Specify pages with -p option: 1, 1,3,5, or 1-5

4. Command-Line Options

OptionDescription
-h, --helpShow help information
-t, --textText mode (default, output only extracted text)
-j, --jsonJSON mode (output complete raw info as JSON)
-l, --language <lang>Specify recognition language (comma-separated)
-o, --output <file>Output text to file, auto-generate confidence file (<file>_confidence.txt)
-t, --text <file>Output only complete text to specified file (text mode)
-c, --confidence <file>Output only confidence details to specified file (text mode)
-f, --fastUse fast mode (default: precise mode)

Note: -t (text mode) and -j (JSON mode) are mutually exclusive. JSON mode outputs to stdout only.

Supported languages:

  • zh-Hans - Simplified Chinese
  • zh-Hant - Traditional Chinese
  • en - English
  • ja - Japanese
  • ko - Korean
  • fr - French
  • de - German
  • es - Spanish
  • it - Italian
  • pt - Portuguese
  • ru - Russian

Workflow

Step 1: Identify the Image Path

When the user requests OCR:

  1. Ask for the image path if not provided
  2. Accept common path formats: absolute paths, ~/path, or relative paths
  3. Validate that the file exists before proceeding

Step 2: Determine Recognition Parameters

Based on user request or context:

  1. Language: Default to zh-Hans,en for Chinese users, or en for English users
  2. Mode: Use precise mode (default) for accuracy, fast mode (-f) for quick preview
  3. Output: Ask if user wants results saved to file (-o option)

Step 3: Execute OCR

Run the OCR script with appropriate parameters:

swift scripts/ocr_vision_pro.swift "<image_path>" -l zh-Hans,en

For saving to separate files (recommended):

swift scripts/ocr_vision_pro.swift "<image_path>" -o "<output>"

This automatically creates two files:

  • <output>.txt - Complete extracted text (pure text, no formatting)
  • <output>_confidence.txt - Confidence details with statistics and per-block info

For separate text and confidence files with custom names:

swift scripts/ocr_vision_pro.swift "<image_path>" -t "text.txt" -c "confidence.txt"

Step 4: Present Results

After OCR completes:

  1. Display the extracted text to the user
  2. If saved to file, inform the user of the output file path
  3. Ask if user wants to:
    • Correct misrecognized characters
    • Process another image
    • Save results in a different format

Step 5: PDF Processing (if PDF file)

When processing a PDF file:

  1. Identify PDF path and pages:

    • Ask for PDF path if not provided
    • Ask which pages to process (default: all pages)
    • Support formats: 1, 1,3,5, or 1-5
  2. Determine recognition parameters:

    • Language: Default to zh-hans,zh-hant,en
    • Mode: Precise (default) or fast (-f)
    • Output: Text mode (default) or JSON mode (-j)
  3. Execute PDF OCR:

# All pages
swift scripts/pdf_ocr.swift "<pdf_path>"

# Specific pages
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1,3,5

# Page range
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1-5

# JSON mode
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1 -j
  1. Present results:
    • Text mode: Display text by page
    • JSON mode: Output complete JSON to stdout
    • Inform user of output format and options

Output Format

The script supports two mutually exclusive output modes:

Text Mode (Default, -t)

Outputs only the extracted text.

Console Output (without -o or -t <file>):

[Extracted text content]

File Output (-o option):

Creates <output>.txt with pure text.

With Confidence Details (console, when not using -j):

=== 置信度详情 ===

总识别块数: 25
平均置信度: 0.85

--- 逐块详情 ---

[1] Text content
    置信度: 0.95
    位置: x=0.10, y=0.20, w=0.30, h=0.05

--- 低置信度警告 (< 0.8) ---
[3] "Some text" - 置信度: 0.50

File Output with Confidence (-o option):

  • <output>.txt - Complete extracted text
  • <output>_confidence.txt - Confidence details

JSON Mode (-j)

Outputs complete raw information as JSON to stdout (no file output in JSON mode).

JSON Structure:

{
  "imagePath": "/path/to/image.png",
  "totalBlocks": 25,
  "averageConfidence": 0.85,
  "blocks": [
    {
      "index": 1,
      "text": "recognized text",
      "confidence": 0.95,
      "boundingBox": {
        "x": 0.10,
        "y": 0.20,
        "width": 0.30,
        "height": 0.05
      }
    }
  ]
}

Fields:

  • imagePath: Path to the processed image
  • totalBlocks: Total number of recognized text blocks
  • averageConfidence: Average confidence score (0.0 - 1.0)
  • blocks: Array of recognized text blocks
    • index: Block index (1-based)
    • text: Recognized text content
    • confidence: Confidence score (0.0 - 1.0)
    • boundingBox: Normalized bounding box coordinates (0.0 - 1.0)

Tips and Best Practices

  1. macOS Only: This skill requires macOS. It will not work on Linux, Windows, or other operating systems.
  2. Image Quality: Higher resolution images produce better results
  3. Language Specification: Always specify language for better accuracy
  4. Precise vs Fast: Use precise mode for final results, fast mode for testing
  5. Batch Processing: For multiple images, use shell loop:
    for img in *.png; do
        swift scripts/ocr_vision_pro.swift "$img" -o "${img%.png}.txt"
    done
    
  6. Confidence Threshold: Results with confidence < 0.5 may need manual verification

References

For detailed usage instructions and examples, load references/usage.md.

Resources

scripts/

  • ocr_vision_pro.swift - Enhanced OCR script for images with full feature support
  • ocr_vision.swift - Basic OCR script for simple use cases
  • pdf_ocr.swift - PDF OCR script for extracting text from PDF files

references/

  • usage.md - Comprehensive usage guide with examples

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Google Workspace (No Cloud Console)

Gmail, Calendar, Drive, Docs, Sheets — NO Google Cloud Console required. Just OAuth sign-in. Zero setup complexity vs traditional Google API integrations.

Registry SourceRecently Updated
11.3K43dru-ca
General

powershell-module-architect

> PowerShell architecture expert specializing in module design, function structure, reusable libraries, profile optimization, and cross-version compatibility...

Registry SourceRecently Updated
General

powershell-ui-architect

> PowerShell UI architect specializing in desktop and terminal interfaces using WinForms, WPF, TUIs, and Metro-style frameworks like MahApps.Metro and Elysiu...

Registry SourceRecently Updated
General

postgres-pro

Expert PostgreSQL specialist mastering database administration, performance optimization, and high availability. Deep expertise in PostgreSQL internals, adva...

Registry SourceRecently Updated