vision-multimodal

Vision & Multimodal Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "vision-multimodal" with this command: npx skills add lobbi-docs/claude/lobbi-docs-claude-vision-multimodal

Vision & Multimodal Skill

Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.

When to Use This Skill

  • Image analysis and description

  • Document/PDF processing

  • Screenshot analysis

  • OCR-like text extraction

  • Visual comparison

  • Chart and diagram interpretation

Supported Formats

Format Status Best For

JPEG ✓ Photos, natural scenes

PNG ✓ Screenshots, UI, text

GIF ✓ Animated (first frame)

WebP ✓ Modern, compressed

PDF ✓ Documents (via Files API)

Image Size Guidelines

  • Minimum: 200 pixels (smaller = reduced accuracy)

  • Optimal: 1000x1000 pixels

  • Maximum: 8000x8000 pixels

  • Token cost: ~(width × height) / 1000

  • Tip: Resize to 1568px max dimension for 30-50% token savings

Core Patterns

Pattern 1: Single Image Analysis

import anthropic import base64

client = anthropic.Anthropic()

Load and encode image

with open("image.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data } }, { "type": "text", "text": "Describe this image in detail." } ] }] )

Pattern 2: Image from URL

import httpx

Fetch and encode from URL

image_url = "https://example.com/image.jpg" response = httpx.get(image_url) image_data = base64.standard_b64encode(response.content).decode("utf-8")

Then use same pattern as above

Pattern 3: Multiple Images

Compare multiple images (up to 100 per request)

messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}}, {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}}, {"type": "text", "text": "Compare these two images and list the differences."} ] }]

Pattern 4: Few-Shot with Images

Teach by example

messages = [ # Example 1 {"role": "user", "content": [ {"type": "image", "source": {...}}, {"type": "text", "text": "Classify this image."} ]}, {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

# Example 2
{"role": "user", "content": [
    {"type": "image", "source": {...}},
    {"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

# Target image
{"role": "user", "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
    {"type": "text", "text": "Classify this image."}
]}

]

Pattern 5: PDF Processing

Using Files API (beta)

with open("document.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, {"type": "text", "text": "Summarize this document."} ] }] )

Prompt Engineering for Vision

Strategy 1: Role Assignment

prompt = """You have perfect vision and exceptional attention to detail, making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:

  1. All components
  2. Data flow between components
  3. Potential bottlenecks"""

Strategy 2: Step-by-Step Thinking

prompt = """Before answering, analyze the image systematically:

<thinking>

  1. What is the overall subject?
  2. What are the key elements?
  3. How do elements relate to each other?
  4. What details stand out? </thinking>

Then provide your answer based on this analysis."""

Strategy 3: Structured Output

prompt = """Extract information from this receipt and return as JSON:

{ "vendor": "", "date": "", "items": [{"name": "", "price": 0}], "total": 0 }"""

Image Optimization

from PIL import Image import io

def optimize_for_claude(image_path, max_dimension=1568): """Resize image to reduce token usage by 30-50%""" with Image.open(image_path) as img: # Calculate new dimensions ratio = min(max_dimension / img.width, max_dimension / img.height) if ratio < 1: new_size = (int(img.width * ratio), int(img.height * ratio)) img = img.resize(new_size, Image.LANCZOS)

    # Convert to bytes
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

Common Use Cases

Text Extraction (OCR-like)

prompt = """Extract all text from this image. Preserve the original formatting and structure as much as possible. If text is unclear, indicate with [unclear]."""

Table Extraction

prompt = """Extract the table data from this image. Return as a markdown table with proper headers and alignment."""

Chart Analysis

prompt = """Analyze this chart:

  1. What type of chart is this?
  2. What are the axes/labels?
  3. What are the key data points?
  4. What trends or patterns are visible?"""

Best Practices

DO:

  • Use high-quality images (≥1000px)

  • Resize large images to save tokens

  • Provide context about what to look for

  • Use few-shot examples for consistent output

DON'T:

  • Send images smaller than 200px

  • Expect perfect OCR for handwriting

  • Send very large images (>8000px)

  • Ignore token costs for multiple images

Limitations

  • Cannot identify specific individuals

  • May struggle with very small text

  • Animated GIFs: only first frame analyzed

  • Some specialized symbols may be misread

See Also

  • [[llm-integration]] - API basics

  • [[extended-thinking]] - Complex reasoning

  • [[citations-retrieval]] - Document citations

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

design-system

No summary provided by upstream source.

Repository SourceNeeds Review
General

kanban

No summary provided by upstream source.

Repository SourceNeeds Review
General

complex-reasoning

No summary provided by upstream source.

Repository SourceNeeds Review
General

gcp

No summary provided by upstream source.

Repository SourceNeeds Review