SmolVLM - Local Image Analysis

Analyze images locally using SmolVLM-2B, a state-of-the-art compact vision-language model optimized for Apple Silicon via mlx-vlm.

Quick Usage

Describe an Image

python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png

Ask a Question About an Image

python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png "What text is visible?"

Specific Tasks

Extract text (OCR)

python ~/.claude/skills/smolvlm/scripts/view_image.py screenshot.png "Extract all text"

UI analysis

python ~/.claude/skills/smolvlm/scripts/view_image.py ui.png "Describe the UI elements"

Detailed description

python ~/.claude/skills/smolvlm/scripts/view_image.py photo.jpg --detailed

Effective Prompts

General Description

"Describe this image"
Basic description
"Describe this image in detail, including colors, composition, and any text"
Comprehensive

Text Extraction (OCR)

"Extract all visible text from this image"
"What text appears in this screenshot?"
"Read the text in this document"

UI/Screenshot Analysis

"Describe the user interface elements"
"What buttons and controls are visible?"
"Identify the application and its current state"

Visual Question Answering

"How many [objects] are in this image?"
"What color is the [object]?"
"Is there a [object] in this image?"

Code/Technical

"What programming language is shown?"
"Describe what this code does"
"Identify any errors in this code screenshot"

Model Details

Spec Value

Model SmolVLM-2B-Instruct

Size ~4GB

Peak Memory 5.8GB

Speed ~94 tok/s (M-series)

Supported Formats PNG, JPG, JPEG, GIF, WebP

Requirements

macOS with Apple Silicon (M1/M2/M3)
Python 3.10+
mlx-vlm package: uv pip install mlx-vlm --system

Troubleshooting

"Model not found": First run downloads the model (~4GB). Wait for completion.

Out of memory: Close other applications. Model needs ~6GB free RAM.

Slow first inference: Model loading takes 10-15s on first use, subsequent calls are faster.

smolvlm

Safety Notice

Copy this and send it to your AI assistant to learn

Extract text (OCR)

UI analysis

Detailed description

Source Transparency

Related Skills

travel-requirements-expert

twilio-api

twitter

figma-mcp