Local OCR (macOS Only)
Overview
⚠️ Platform Requirement: This skill is macOS only. It requires macOS 10.15+ (Catalina or later) and uses macOS native frameworks:
- Vision framework - For OCR text recognition
- PDFKit framework - For PDF processing
- Core Graphics - For image rendering
This skill provides OCR (Optical Character Recognition) capabilities using macOS native Vision framework. It extracts text from images and PDFs without requiring any third-party libraries or internet connection.
Platform Requirements
⚠️ macOS Only - This skill cannot run on Linux, Windows, or other operating systems.
Required:
- macOS 10.15+ (Catalina or later)
- Vision framework (pre-installed on macOS)
- PDFKit framework (pre-installed on macOS)
Why macOS Only?
- Uses
Visionframework for OCR (macOS/iOS only) - Uses
PDFKitframework for PDF processing (macOS/iOS only) - Uses
AppKit/Core Graphicsfor image handling (macOS only)
When to Use This Skill
Trigger this skill when the user:
- Requests OCR or image text extraction
- Mentions extracting text from images, screenshots, PDF files, or scanned documents
- Uses keywords like: "识别图片", "OCR", "提取文字", "提取PDF文字", "识别PDF", "extract text from image", "PDF OCR"
- Provides an image file or PDF file and asks to read or extract its content
Core Capabilities
1. Text Extraction from Images
Use scripts/ocr_vision_pro.swift for comprehensive OCR with the following features:
- Multi-language support (Chinese, English, Japanese, Korean, and more)
- Two output modes (mutually exclusive):
- Text Mode (
-t): Output only extracted text (default) - JSON Mode (
-j): Output complete raw info including text, position, and confidence as JSON
- Text Mode (
- Confidence scores for each detected text block
- Bounding box information (text position in image)
- Output to console or file
- Precise or fast recognition modes
Basic usage:
swift scripts/ocr_vision_pro.swift <image_path>
With options:
swift scripts/ocr_vision_pro.swift <image_path> -l zh-Hans,en -o output.txt -f
2. Text Extraction from PDF Files
Use scripts/pdf_ocr.swift to extract text from PDF files with the following features:
- Extract text from specific pages or all pages
- Support page range specification (e.g.,
1-5,1,3,5) - Two output modes (mutually exclusive):
- Text Mode (
-t): Output only extracted text (default) - JSON Mode (
-j): Output complete raw info as JSON
- Text Mode (
- Same multi-language support as image OCR
- Precise or fast recognition modes
Basic usage (all pages):
swift scripts/pdf_ocr.swift <pdf_path>
With page specification:
# Single page
swift scripts/pdf_ocr.swift document.pdf -p 1
# Multiple pages
swift scripts/pdf_ocr.swift document.pdf -p 1,3,5
# Page range
swift scripts/pdf_ocr.swift document.pdf -p 1-5
# JSON mode
swift scripts/pdf_ocr.swift document.pdf -p 1 -j
3. Output Modes (Mutually Exclusive)
The script supports two output modes that cannot be used simultaneously:
Text Mode (Default, -t)
Outputs only the extracted text:
- Console output: Pure text
- File output (
-oor-t): Saves text to file, optionally with separate confidence file
JSON Mode (-j)
Outputs complete raw information as JSON:
- Contains: image path, total blocks, average confidence, and per-block details
- Per-block info: index, text, confidence, bounding box (x, y, width, height)
- Outputs to stdout only (no file output options in JSON mode)
JSON output structure:
{
"imagePath": "/path/to/image.png",
"totalBlocks": 25,
"averageConfidence": 0.85,
"blocks": [
{
"index": 1,
"text": "recognized text",
"confidence": 0.95,
"boundingBox": {
"x": 0.10,
"y": 0.20,
"width": 0.30,
"height": 0.05
}
}
]
}
4. Supported File Formats
Image Formats (ocr_vision_pro.swift):
- PNG (.png)
- JPEG (.jpg, .jpeg)
- TIFF (.tiff, .tif)
- BMP (.bmp)
PDF Format (pdf_ocr.swift):
- PDF (.pdf) - support single page, multiple pages, or page ranges
- Specify pages with
-poption:1,1,3,5, or1-5
4. Command-Line Options
| Option | Description |
|---|---|
-h, --help | Show help information |
-t, --text | Text mode (default, output only extracted text) |
-j, --json | JSON mode (output complete raw info as JSON) |
-l, --language <lang> | Specify recognition language (comma-separated) |
-o, --output <file> | Output text to file, auto-generate confidence file (<file>_confidence.txt) |
-t, --text <file> | Output only complete text to specified file (text mode) |
-c, --confidence <file> | Output only confidence details to specified file (text mode) |
-f, --fast | Use fast mode (default: precise mode) |
Note: -t (text mode) and -j (JSON mode) are mutually exclusive. JSON mode outputs to stdout only.
Supported languages:
zh-Hans- Simplified Chinesezh-Hant- Traditional Chineseen- Englishja- Japaneseko- Koreanfr- Frenchde- Germanes- Spanishit- Italianpt- Portugueseru- Russian
Workflow
Step 1: Identify the Image Path
When the user requests OCR:
- Ask for the image path if not provided
- Accept common path formats: absolute paths, ~/path, or relative paths
- Validate that the file exists before proceeding
Step 2: Determine Recognition Parameters
Based on user request or context:
- Language: Default to
zh-Hans,enfor Chinese users, orenfor English users - Mode: Use precise mode (default) for accuracy, fast mode (
-f) for quick preview - Output: Ask if user wants results saved to file (
-ooption)
Step 3: Execute OCR
Run the OCR script with appropriate parameters:
swift scripts/ocr_vision_pro.swift "<image_path>" -l zh-Hans,en
For saving to separate files (recommended):
swift scripts/ocr_vision_pro.swift "<image_path>" -o "<output>"
This automatically creates two files:
<output>.txt- Complete extracted text (pure text, no formatting)<output>_confidence.txt- Confidence details with statistics and per-block info
For separate text and confidence files with custom names:
swift scripts/ocr_vision_pro.swift "<image_path>" -t "text.txt" -c "confidence.txt"
Step 4: Present Results
After OCR completes:
- Display the extracted text to the user
- If saved to file, inform the user of the output file path
- Ask if user wants to:
- Correct misrecognized characters
- Process another image
- Save results in a different format
Step 5: PDF Processing (if PDF file)
When processing a PDF file:
-
Identify PDF path and pages:
- Ask for PDF path if not provided
- Ask which pages to process (default: all pages)
- Support formats:
1,1,3,5, or1-5
-
Determine recognition parameters:
- Language: Default to
zh-hans,zh-hant,en - Mode: Precise (default) or fast (
-f) - Output: Text mode (default) or JSON mode (
-j)
- Language: Default to
-
Execute PDF OCR:
# All pages
swift scripts/pdf_ocr.swift "<pdf_path>"
# Specific pages
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1,3,5
# Page range
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1-5
# JSON mode
swift scripts/pdf_ocr.swift "<pdf_path>" -p 1 -j
- Present results:
- Text mode: Display text by page
- JSON mode: Output complete JSON to stdout
- Inform user of output format and options
Output Format
The script supports two mutually exclusive output modes:
Text Mode (Default, -t)
Outputs only the extracted text.
Console Output (without -o or -t <file>):
[Extracted text content]
File Output (-o option):
Creates <output>.txt with pure text.
With Confidence Details (console, when not using -j):
=== 置信度详情 ===
总识别块数: 25
平均置信度: 0.85
--- 逐块详情 ---
[1] Text content
置信度: 0.95
位置: x=0.10, y=0.20, w=0.30, h=0.05
--- 低置信度警告 (< 0.8) ---
[3] "Some text" - 置信度: 0.50
File Output with Confidence (-o option):
<output>.txt- Complete extracted text<output>_confidence.txt- Confidence details
JSON Mode (-j)
Outputs complete raw information as JSON to stdout (no file output in JSON mode).
JSON Structure:
{
"imagePath": "/path/to/image.png",
"totalBlocks": 25,
"averageConfidence": 0.85,
"blocks": [
{
"index": 1,
"text": "recognized text",
"confidence": 0.95,
"boundingBox": {
"x": 0.10,
"y": 0.20,
"width": 0.30,
"height": 0.05
}
}
]
}
Fields:
imagePath: Path to the processed imagetotalBlocks: Total number of recognized text blocksaverageConfidence: Average confidence score (0.0 - 1.0)blocks: Array of recognized text blocksindex: Block index (1-based)text: Recognized text contentconfidence: Confidence score (0.0 - 1.0)boundingBox: Normalized bounding box coordinates (0.0 - 1.0)
Tips and Best Practices
- macOS Only: This skill requires macOS. It will not work on Linux, Windows, or other operating systems.
- Image Quality: Higher resolution images produce better results
- Language Specification: Always specify language for better accuracy
- Precise vs Fast: Use precise mode for final results, fast mode for testing
- Batch Processing: For multiple images, use shell loop:
for img in *.png; do swift scripts/ocr_vision_pro.swift "$img" -o "${img%.png}.txt" done - Confidence Threshold: Results with confidence < 0.5 may need manual verification
References
For detailed usage instructions and examples, load references/usage.md.
Resources
scripts/
ocr_vision_pro.swift- Enhanced OCR script for images with full feature supportocr_vision.swift- Basic OCR script for simple use casespdf_ocr.swift- PDF OCR script for extracting text from PDF files
references/
usage.md- Comprehensive usage guide with examples