faion-multimodal-ai

Multimodal AI: vision, image/video generation, speech-to-text, text-to-speech, voice synthesis.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "faion-multimodal-ai" with this command: npx skills add faionfaion/faion-network/faionfaion-faion-network-faion-multimodal-ai

Entry point: /faion-net — invoke this skill for automatic routing to the appropriate domain.

Multimodal AI Skill

Communication: User's language. Code: English.

Purpose

Handles multimodal AI applications. Covers vision, image generation, video generation, speech, and voice synthesis.

Context Discovery

Auto-Investigation

Check these project signals before asking questions:

SignalWhere to CheckWhat to Look For
Dependenciespackage.json, requirements.txtopenai, PIL/pillow, ffmpeg-python, elevenlabs
Media files/images, /audio, /videoInput files to process
API usageGrep for "images.generate", "audio.transcriptions"Existing multimodal APIs
Output dirs/generated, /outputWhere generated content goes

Discovery Questions

question: "Which modality are you working with?"
header: "Modality"
multiSelect: true
options:
  - label: "Vision (image understanding)"
    description: "GPT-4o Vision, Gemini Vision for OCR/analysis"
  - label: "Image generation"
    description: "DALL-E 3, Midjourney, Stable Diffusion"
  - label: "Video generation/understanding"
    description: "Sora, Runway, or video analysis"
  - label: "Speech-to-text"
    description: "Whisper, Deepgram for transcription"
  - label: "Text-to-speech"
    description: "OpenAI TTS, ElevenLabs for voice synthesis"
question: "What's your primary use case?"
header: "Use Case"
multiSelect: false
options:
  - label: "Document/receipt OCR and analysis"
    description: "Extract structured data from images"
  - label: "Content generation (images/videos)"
    description: "Create marketing/creative assets"
  - label: "Accessibility (vision/speech conversion)"
    description: "Convert between modalities for a11y"
  - label: "Voice assistant/bot"
    description: "Speech → Text → LLM → TTS pipeline"
question: "Volume and latency requirements?"
header: "Scale"
multiSelect: false
options:
  - label: "Low volume, quality over speed"
    description: "Use premium models (HD TTS, GPT-4o Vision)"
  - label: "High volume, optimize for cost"
    description: "Batch APIs, smaller models"
  - label: "Real-time required"
    description: "Streaming APIs (Deepgram, OpenAI TTS)"
  - label: "Async processing OK"
    description: "Queue-based approach"

Scope

AreaCoverage
VisionGPT-4o Vision, Gemini Vision, image understanding
Image GenerationDALL-E 3, Midjourney, Stable Diffusion
Video GenerationSora, Runway, Pika
Speech-to-TextWhisper, Deepgram, AssemblyAI
Text-to-SpeechOpenAI TTS, ElevenLabs, Google TTS
VoiceReal-time voice, voice cloning

Quick Start

TaskFiles
Vision APIvision-basics.md → vision-applications.md
Image generationimg-gen-basics.md → img-gen-tools.md
Video generationvideo-gen-basics.md → video-gen-tools.md
Speech-to-textspeech-to-text-basics.md → speech-to-text-advanced.md
Text-to-speechtts-basics.md → tts-implementation.md
Voice synthesisvoice-basics.md → voice-implementation.md

Methodologies (12)

Vision (2):

  • vision-basics: Image understanding, OCR, scene analysis
  • vision-applications: Use cases, production patterns

Image Generation (2):

  • img-gen-basics: Prompt engineering, models
  • img-gen-tools: DALL-E 3, Midjourney, Stable Diffusion

Video Generation (2):

  • video-gen-basics: Fundamentals, prompting
  • video-gen-tools: Sora, Runway, Pika, Luma

Speech-to-Text (2):

  • speech-to-text-basics: Whisper API, real-time
  • speech-to-text-advanced: Diarization, timestamps

Text-to-Speech (2):

  • tts-basics: Voice selection, SSML
  • tts-implementation: Production patterns, streaming

Voice (2):

  • voice-basics: Real-time voice, cloning
  • voice-implementation: Integration patterns

Code Examples

GPT-4o Vision

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }]
)

DALL-E 3 Image Generation

from openai import OpenAI

client = OpenAI()

response = client.images.generate(
    model="dall-e-3",
    prompt="A futuristic city with flying cars",
    size="1024x1024",
    quality="hd",
    n=1
)

image_url = response.data[0].url

Whisper Speech-to-Text

from openai import OpenAI

client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

print(transcription.text)

OpenAI TTS

from openai import OpenAI
from pathlib import Path

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="alloy",
    input="Hello, this is a test of text to speech."
)

response.stream_to_file("speech.mp3")

Gemini Vision

import google.generativeai as genai

genai.configure(api_key="...")
model = genai.GenerativeModel('gemini-pro-vision')

image = PIL.Image.open("image.jpg")
response = model.generate_content([
    "Describe this image in detail",
    image
])

print(response.text)

Model Comparison

Vision Models

ModelBest ForMax Image Size
GPT-4oGeneral vision, OCR20MB
Gemini Pro VisionHigh-res images20MB
Claude Sonnet 4Document analysis5MB

Image Generation

ModelBest ForCost
DALL-E 3Photorealistic, text$$$
MidjourneyArtistic, creative$$
Stable DiffusionCustom, open-sourceFree/$

Speech-to-Text

ServiceBest ForLanguages
WhisperGeneral, multilingual99
DeepgramReal-time, low latency30+
AssemblyAIFeatures, diarization10+

Text-to-Speech

ServiceBest ForVoices
OpenAI TTSQuality, variety6
ElevenLabsCloning, realismCustom
Google TTSLanguages, SSML400+

Use Cases

Use CaseModalities
Document analysisVision → Text
Video narrationVideo → Speech → TTS
Voice assistantSpeech → LLM → TTS
Content generationText → Images/Video
AccessibilityVision → TTS, Speech → Text

Related Skills

SkillRelationship
faion-llm-integrationProvides vision APIs
faion-ai-agentsMultimodal agents

Multimodal AI v1.0 | 12 methodologies

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

faion-ppc-manager

No summary provided by upstream source.

Repository SourceNeeds Review
General

faion-ux-ui-designer

No summary provided by upstream source.

Repository SourceNeeds Review
General

faion-smm-manager

No summary provided by upstream source.

Repository SourceNeeds Review
General

faion-hr-recruiter

No summary provided by upstream source.

Repository SourceNeeds Review