multimodal-llm

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).

Quick Reference

Category Rules Impact When to Use

Vision: Image Analysis 1 HIGH Image captioning, VQA, multi-image comparison, object detection

Vision: Document Understanding 1 HIGH OCR, chart/diagram analysis, PDF processing, table extraction

Vision: Model Selection 1 MEDIUM Choosing provider, cost optimization, image size limits

Audio: Speech-to-Text 1 HIGH Transcription, speaker diarization, long-form audio

Audio: Text-to-Speech 1 MEDIUM Voice synthesis, expressive TTS, multi-speaker dialogue

Audio: Model Selection 1 MEDIUM Real-time voice agents, provider comparison, pricing

Video: Model Selection 1 HIGH Choosing video gen provider (Kling, Sora, Veo, Runway)

Video: API Patterns 1 HIGH Async task polling, SDK integration, webhook callbacks

Video: Multi-Shot 1 HIGH Storyboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

Rule File Key Pattern

Image Analysis rules/vision-image-analysis.md

Base64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Rule File Key Pattern

Document Vision rules/vision-document.md

PDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

Rule File Key Pattern

Vision Models rules/vision-models.md

Provider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

Rule File Key Pattern

Speech-to-Text rules/audio-speech-to-text.md

Gemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

Rule File Key Pattern

Text-to-Speech rules/audio-text-to-speech.md

Gemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

Rule File Key Pattern

Audio Models rules/audio-models.md

Real-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

Rule File Key Pattern

Video Models rules/video-generation-models.md

Kling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

Rule File Key Pattern

API Integration rules/video-generation-patterns.md

Kling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

Rule File Key Pattern

Multi-Shot rules/video-multi-shot.md

Kling 3.0 character elements, 6-shot storyboards, identity binding

Key Decisions

Decision Recommendation

High accuracy vision Claude Opus 4.6 or GPT-5

Long documents Gemini 2.5 Pro (1M context)

Cost-efficient vision Gemini 2.5 Flash ($0.15/M tokens)

Video analysis Gemini 2.5/3 Pro (native video)

Voice assistant Grok Voice Agent (fastest, <1s)

Emotional voice AI Gemini Live API

Long audio transcription Gemini 2.5 Pro (9.5hr)

Speaker diarization AssemblyAI or Gemini

Self-hosted STT Whisper Large V3

Character-consistent video Kling 3.0 (Character Elements 3.0)

Narrative video / storytelling Sora 2 (best cause-and-effect coherence)

Cinematic B-roll Veo 3.1 (camera control + polished motion)

Professional VFX Runway Gen-4.5 (Act-Two motion transfer)

High-volume social video Kling 3.0 Standard ($0.20/video)

Open-source video gen Wan 2.6 or LTX-2

Lip-sync / avatar video Kling 3.0 (native lip-sync API)

Example

import anthropic, base64

client = anthropic.Anthropic() with open("image.png", "rb") as f: b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-opus-4-6", max_tokens=1024, messages=[{"role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}}, {"type": "text", "text": "Describe this image"} ]}] )

Common Mistakes

Not setting max_tokens on vision requests (responses truncated)
Sending oversized images without resizing (>2048px)
Using high detail level for simple yes/no classification
Using STT+LLM+TTS pipeline instead of native speech-to-speech
Not leveraging barge-in support for natural voice conversations
Using deprecated models (GPT-4V, Whisper-1)
Ignoring rate limits on vision and audio endpoints
Calling video generation APIs synchronously (they're async — poll or use callbacks)
Generating separate clips without character elements (characters look different each time)
Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)

Related Skills

ork:rag-retrieval
Multimodal RAG with image + text retrieval
ork:llm-integration
General LLM function calling patterns
streaming-api-patterns
WebSocket patterns for real-time audio
ork:demo-producer
Terminal demo videos (VHS, asciinema) — not AI video gen

multimodal-llm

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

responsive-patterns

domain-driven-design

dashboard-patterns

rag-retrieval