Local OCR (macOS Only)

Overview

⚠️ Platform Requirement: This skill is macOS only. It requires macOS 10.15+ (Catalina or later) and uses macOS native frameworks:

Vision framework - For OCR text recognition
PDFKit framework - For PDF processing
Core Graphics - For image rendering

This skill provides OCR (Optical Character Recognition) capabilities using macOS native Vision framework. It extracts text from images and PDFs without requiring any third-party libraries or internet connection.

Platform Requirements

⚠️ macOS Only - This skill cannot run on Linux, Windows, or other operating systems.

Required:

macOS 10.15+ (Catalina or later)
Vision framework (pre-installed on macOS)
PDFKit framework (pre-installed on macOS)

Why macOS Only?

Uses Vision framework for OCR (macOS/iOS only)
Uses PDFKit framework for PDF processing (macOS/iOS only)
Uses AppKit/Core Graphics for image handling (macOS only)

When to Use This Skill

Trigger this skill when the user:

Requests OCR or image text extraction
Mentions extracting text from images, screenshots, PDF files, or scanned documents
Uses keywords like: "识别图片", "OCR", "提取文字", "提取PDF文字", "识别PDF", "extract text from image", "PDF OCR"
Provides an image file or PDF file and asks to read or extract its content

Core Capabilities

1. Text Extraction from Images

Use scripts/ocr_vision_pro.swift for comprehensive OCR with the following features:

Multi-language support (Chinese, English, Japanese, Korean, and more)
Two output modes (mutually exclusive):
- Text Mode (-t): Output only extracted text (default)
- JSON Mode (-j): Output complete raw info including text, position, and confidence as JSON
Confidence scores for each detected text block
Bounding box information (text position in image)
Output to console or file
Precise or fast recognition modes

Basic usage:

swift scripts/ocr_vision_pro.swift <image_path>

With options:

swift scripts/ocr_vision_pro.swift <image_path> -l zh-Hans,en -o output.txt -f

2. Text Extraction from PDF Files

Use scripts/pdf_ocr.swift to extract text from PDF files with the following features:

Extract text from specific pages or all pages
Support page range specification (e.g., 1-5, 1,3,5)
Two output modes (mutually exclusive):
- Text Mode (-t): Output only extracted text (default)
- JSON Mode (-j): Output complete raw info as JSON
Same multi-language support as image OCR
Precise or fast recognition modes

Basic usage (all pages):

swift scripts/pdf_ocr.swift <pdf_path>

With page specification:

# Single page
swift scripts/pdf_ocr.swift document.pdf -p 1

# Multiple pages
swift scripts/pdf_ocr.swift document.pdf -p 1,3,5

# Page range
swift scripts/pdf_ocr.swift document.pdf -p 1-5

# JSON mode
swift scripts/pdf_ocr.swift document.pdf -p 1 -j

3. Output Modes (Mutually Exclusive)

The script supports two output modes that cannot be used simultaneously:

Text Mode (Default, `-t`)

Outputs only the extracted text:

Console output: Pure text
File output (-o or -t): Saves text to file, optionally with separate confidence file

JSON Mode (`-j`)

Outputs complete raw information as JSON:

Contains: image path, total blocks, average confidence, and per-block details
Per-block info: index, text, confidence, bounding box (x, y, width, height)
Outputs to stdout only (no file output options in JSON mode)

JSON output structure:

{
  "imagePath": "/path/to/image.png",
  "totalBlocks": 25,
  "averageConfidence": 0.85,
  "blocks": [
    {
      "index": 1,
      "text": "recognized text",
      "confidence": 0.95,
      "boundingBox": {
        "x": 0.10,
        "y": 0.20,
        "width": 0.30,
        "height": 0.05
      }
    }
  ]
}

4. Supported File Formats

Image Formats (ocr_vision_pro.swift):

PNG (.png)
JPEG (.jpg, .jpeg)
TIFF (.tiff, .tif)
BMP (.bmp)

PDF Format (pdf_ocr.swift):

PDF (.pdf) - support single page, multiple pages, or page ranges
Specify pages with -p option: 1, 1,3,5, or 1-5

4. Command-Line Options

Option	Description
`-h`, `--help`	Show help information
`-t`, `--text`	Text mode (default, output only extracted text)
`-j`, `--json`	JSON mode (output complete raw info as JSON)
`-l`, `--language <lang>`	Specify recognition language (comma-separated)
`-o`, `--output <file>`	Output text to file, auto-generate confidence file (`<file>_confidence.txt`)
`-t`, `--text <file>`	Output only complete text to specified file (text mode)
`-c`, `--confidence <file>`	Output only confidence details to specified file (text mode)
`-f`, `--fast`	Use fast mode (default: precise mode)

Note: -t (text mode) and -j (JSON mode) are mutually exclusive. JSON mode outputs to stdout only.

Supported languages:

zh-Hans - Simplified Chinese
zh-Hant - Traditional Chinese
en - English
ja - Japanese
ko - Korean
fr - French
de - German
es - Spanish
it - Italian
pt - Portuguese
ru - Russian

Workflow

Step 1: Identify the Image Path

When the user requests OCR:

Ask for the image path if not provided
Accept common path formats: absolute paths, ~/path, or relative paths
Validate that the file exists before proceeding

Step 2: Determine Recognition Parameters

Based on user request or context:

Language: Default to zh-Hans,en for Chinese users, or en for English users
Mode: Use precise mode (default) for accuracy, fast mode (-f) for quick preview
Output: Ask if user wants results saved to file (-o option)

Step 3: Execute OCR

Run the OCR script with appropriate parameters:

swift scripts/ocr_vision_pro.swift "<image_path>" -l zh-Hans,en

For saving to separate files (recommended):

swift scripts/ocr_vision_pro.swift "<image_path>" -o "<output>"

This automatically creates two files:

<output>.txt - Complete extracted text (pure text, no formatting)
<output>_confidence.txt - Confidence details with statistics and per-block info

For separate text and confidence files with custom names:

swift scripts/ocr_vision_pro.swift "<image_path>" -t "text.txt" -c "confidence.txt"

Step 4: Present Results

After OCR completes:

Display the extracted text to the user
If saved to file, inform the user of the output file path
Ask if user wants to:
- Correct misrecognized characters
- Process another image
- Save results in a different format

Step 5: PDF Processing (if PDF file)

When processing a PDF file:

Identify PDF path and pages:
- Ask for PDF path if not provided
- Ask which pages to process (default: all pages)
- Support formats: 1, 1,3,5, or 1-5
Determine recognition parameters:
- Language: Default to zh-hans,zh-hant,en
- Mode: Precise (default) or fast (-f)
- Output: Text mode (default) or JSON mode (-j)
Execute PDF OCR:

# All pages
swift scripts/pdf_ocr.swift "<pdf_path>"

# Specific pages
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1,3,5

# Page range
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1-5

# JSON mode
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1 -j

Present results:
- Text mode: Display text by page
- JSON mode: Output complete JSON to stdout
- Inform user of output format and options

Output Format

The script supports two mutually exclusive output modes:

Text Mode (Default, `-t`)

Outputs only the extracted text.

Console Output (without `-o` or `-t <file>`):

[Extracted text content]

File Output (`-o` option):

Creates <output>.txt with pure text.

With Confidence Details (console, when not using `-j`):

=== 置信度详情 ===

总识别块数: 25
平均置信度: 0.85

--- 逐块详情 ---

[1] Text content
    置信度: 0.95
    位置: x=0.10, y=0.20, w=0.30, h=0.05

--- 低置信度警告 (< 0.8) ---
[3] "Some text" - 置信度: 0.50

File Output with Confidence (`-o` option):

<output>.txt - Complete extracted text
<output>_confidence.txt - Confidence details

JSON Mode (`-j`)

Outputs complete raw information as JSON to stdout (no file output in JSON mode).

JSON Structure:

{
  "imagePath": "/path/to/image.png",
  "totalBlocks": 25,
  "averageConfidence": 0.85,
  "blocks": [
    {
      "index": 1,
      "text": "recognized text",
      "confidence": 0.95,
      "boundingBox": {
        "x": 0.10,
        "y": 0.20,
        "width": 0.30,
        "height": 0.05
      }
    }
  ]
}

Fields:

imagePath: Path to the processed image
totalBlocks: Total number of recognized text blocks
averageConfidence: Average confidence score (0.0 - 1.0)
blocks: Array of recognized text blocks
- index: Block index (1-based)
- text: Recognized text content
- confidence: Confidence score (0.0 - 1.0)
- boundingBox: Normalized bounding box coordinates (0.0 - 1.0)

Tips and Best Practices

macOS Only: This skill requires macOS. It will not work on Linux, Windows, or other operating systems.
Image Quality: Higher resolution images produce better results
Language Specification: Always specify language for better accuracy
Precise vs Fast: Use precise mode for final results, fast mode for testing

Batch Processing: For multiple images, use shell loop:

for img in *.png; do
    swift scripts/ocr_vision_pro.swift "$img" -o "${img%.png}.txt"
done

Confidence Threshold: Results with confidence < 0.5 may need manual verification

References

For detailed usage instructions and examples, load references/usage.md.

Resources

scripts/

ocr_vision_pro.swift - Enhanced OCR script for images with full feature support
ocr_vision.swift - Basic OCR script for simple use cases
pdf_ocr.swift - PDF OCR script for extracting text from PDF files

references/

usage.md - Comprehensive usage guide with examples

local-ocr

Safety Notice

Copy this and send it to your AI assistant to learn

Local OCR (macOS Only)

Overview

Platform Requirements

When to Use This Skill

Core Capabilities

1. Text Extraction from Images

2. Text Extraction from PDF Files

3. Output Modes (Mutually Exclusive)

Text Mode (Default, `-t`)

JSON Mode (`-j`)

4. Supported File Formats

4. Command-Line Options

Workflow

Step 1: Identify the Image Path

Step 2: Determine Recognition Parameters

Step 3: Execute OCR

Step 4: Present Results

Step 5: PDF Processing (if PDF file)

Output Format

Text Mode (Default, `-t`)

Console Output (without `-o` or `-t <file>`):

File Output (`-o` option):

With Confidence Details (console, when not using `-j`):

File Output with Confidence (`-o` option):

JSON Mode (`-j`)

Tips and Best Practices

References

Resources

scripts/

references/

Source Transparency

Related Skills

Google Workspace (No Cloud Console)

powershell-module-architect

powershell-ui-architect

postgres-pro

local-ocr

Safety Notice

Copy this and send it to your AI assistant to learn

Local OCR (macOS Only)

Overview

Platform Requirements

When to Use This Skill

Core Capabilities

1. Text Extraction from Images

2. Text Extraction from PDF Files

3. Output Modes (Mutually Exclusive)

Text Mode (Default, -t)

JSON Mode (-j)

4. Supported File Formats

4. Command-Line Options

Workflow

Step 1: Identify the Image Path

Step 2: Determine Recognition Parameters

Step 3: Execute OCR

Step 4: Present Results

Step 5: PDF Processing (if PDF file)

Output Format

Text Mode (Default, -t)

Console Output (without -o or -t <file>):

File Output (-o option):

With Confidence Details (console, when not using -j):

File Output with Confidence (-o option):

JSON Mode (-j)

Tips and Best Practices

References

Resources

scripts/

references/

Source Transparency

Related Skills

Google Workspace (No Cloud Console)

powershell-module-architect

powershell-ui-architect

postgres-pro

Text Mode (Default, `-t`)

JSON Mode (`-j`)

Text Mode (Default, `-t`)

Console Output (without `-o` or `-t <file>`):

File Output (`-o` option):

With Confidence Details (console, when not using `-j`):

File Output with Confidence (`-o` option):

JSON Mode (`-j`)