Voice Generation Skill

Generate realistic speech using AI (Google Gemini TTS, ElevenLabs, OpenAI TTS).

Prerequisites

At least one API key is required:

GOOGLE_API_KEY - For Google Gemini TTS (same key as video/image/music) ✅
ELEVENLABS_API_KEY - For ElevenLabs high-quality voice synthesis
OPENAI_API_KEY - For OpenAI TTS voices

Available APIs

Google Gemini TTS (Recommended - Same API Key)

Best for: Podcasts, dialogues, audiobooks with style control
Voices: 30 voices with natural language style control
Multi-speaker: Up to 2 speakers for dialogues ✅
Languages: 24 languages (auto-detected)
Features: Control style, accent, pace via prompts
Output: 24kHz WAV
API Key: Same GOOGLE_API_KEY as video/image/music ✅

ElevenLabs (Best Quality)

Best for: Natural-sounding voices, voice cloning, long-form content
Voices: 100+ pre-made voices + custom voice cloning
Languages: 29+ languages
Models: Eleven Multilingual v2, Eleven Turbo v2

OpenAI TTS (Simplest)

Best for: Quick, reliable text-to-speech with consistent quality
Voices: alloy, echo, fable, onyx, nova, shimmer
Models: tts-1 (fast), tts-1-hd (high quality)
Output: MP3, Opus, AAC, FLAC

Workflow

Step 1: Understand the Request

Parse the user's voice request for:

Text content: What should be spoken?
Voice type: Male, female, specific character?
Tone: Professional, casual, dramatic, cheerful?
Use case: Narration, voiceover, audiobook, notification?
Language: English, Spanish, other?
Speed: Normal, slow, fast?

Step 2: Select Voice and API

Choose based on requirements:

Use Case	Recommended API	Reason
Default / Same key as video	Gemini TTS	Same `GOOGLE_API_KEY` ✅
Multi-speaker dialogue	Gemini TTS	Up to 2 speakers built-in
Style/accent control	Gemini TTS	Natural language prompts
Voice cloning	ElevenLabs	Only API with cloning
100+ voice options	ElevenLabs	Widest selection
Audiobook/podcast	ElevenLabs or Gemini	Both excellent for long content
Quick narration	OpenAI TTS	Fast, reliable
Budget-conscious	OpenAI TTS	Lower cost

Step 3: Prepare the Text

Optimize text for speech:

Add pauses: Use commas, periods for natural rhythm
Spell out numbers: "1,234" → "one thousand two hundred thirty-four" (if needed)
Handle acronyms: "NASA" vs "N.A.S.A." depending on pronunciation
Mark emphasis: Some APIs support emphasis markers

Example transformation:

Original: "The Q4 2024 results show a 15% YoY increase."
Optimized: "The Q4 2024 results show a fifteen percent year-over-year increase."

Step 4: Generate the Audio

Execute the appropriate script from ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/:

For Google Gemini TTS (single speaker):

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Welcome to our podcast!" \
  --voice "Charon"

Gemini TTS with style direction:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Have a wonderful day!" \
  --voice "Puck" \
  --style "Say cheerfully with a British accent:"

Gemini TTS multi-speaker (dialogue):

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --multi \
  --speaker "Host:Charon" \
  --speaker "Guest:Aoede" \
  --text "Host: Welcome to the show!
Guest: Thanks for having me!"

For ElevenLabs:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
  --text "Your text here" \
  --voice "Rachel" \
  --model "eleven_multilingual_v2"

For OpenAI TTS:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
  --text "Your text here" \
  --voice "nova" \
  --model "tts-1-hd"

List Gemini voices:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices

Step 5: Deliver the Result

Provide the generated audio file path
Mention the voice and settings used
Offer to:
- Try a different voice
- Adjust speed or tone
- Use a different API
- Generate in a different format

Error Handling

Missing API key: Inform the user which key is needed:

Gemini TTS: Same GOOGLE_API_KEY as video/image - https://aistudio.google.com/apikey
ElevenLabs: https://elevenlabs.io
OpenAI: https://platform.openai.com/api-keys

Gemini TTS requires google-genai package: pip install google-genai

Text too long: Split into chunks and concatenate, or suggest shorter text.

Rate limit: Suggest waiting or trying a different API.

Unsupported language: Suggest an alternative API that supports the language.

Multi-speaker limit: Gemini TTS supports max 2 speakers. For more, use ElevenLabs with multiple calls.

Voice Selection Guide

Google Gemini TTS Voices (30 voices)

Style	Voices	Best For
Bright/Upbeat	Zephyr, Puck, Aoede, Laomedeia	Marketing, cheerful content
Firm/Informative	Charon, Kore, Orus, Rasalgethi	News, tutorials, professional
Soft/Warm	Achernar, Sulafat, Vindemiatrix	Meditation, gentle narration
Smooth	Algieba, Despina, Callirrhoe	Audiobooks, storytelling
Clear	Erinome, Iapetus, Pulcherrima	Instructions, clarity
Character	Fenrir (excitable), Enceladus (breathy), Algenib (gravelly), Gacrux (mature)	Character voices, drama
Friendly	Achird, Zubenelgenubi (casual)	Casual, conversational

Gemini TTS Style Tips:

Use natural language: --style "Say angrily:" or --style "Whisper mysteriously:"
Specify accents: --style "Speak with a British accent from London:"
Control pace: --style "Speak slowly and deliberately:"
Combine: --style "Say excitedly with a Southern US accent:"

OpenAI TTS Voices

Voice	Description	Best For
alloy	Neutral, balanced	General purpose
echo	Warm, conversational	Podcasts, casual
fable	Expressive, British	Storytelling
onyx	Deep, authoritative	Narration, professional
nova	Friendly, upbeat	Marketing, tutorials
shimmer	Soft, gentle	Meditation, ASMR

ElevenLabs Popular Voices

Voice	Description	Best For
Rachel	Young female, American	Narration, audiobooks
Domi	Young female, energetic	Marketing, ads
Bella	Young female, soft	Storytelling
Antoni	Young male, well-rounded	Narration
Josh	Young male, deep	Audiobooks
Arnold	Mature male, authoritative	Documentary
Adam	Middle-aged male, deep	Narration
Sam	Young male, raspy	Character voices

Best Practices

For Narration

Use a consistent voice throughout
Add natural pauses between paragraphs
Consider pacing for the content type

For Dialogue

Use different voices for different characters
Match voice characteristics to character descriptions
Adjust speed for emotional scenes

For Accessibility

Use clear, well-paced speech
Avoid overly stylized voices
Test with screen readers if applicable

API Comparison

Feature	Gemini TTS	ElevenLabs	OpenAI TTS
API Key	`GOOGLE_API_KEY` ✅	`ELEVENLABS_API_KEY`	`OPENAI_API_KEY`
Voice quality	Excellent	Excellent	Very good
Voice variety	30 voices	100+ voices	6 voices
Multi-speaker	✅ Up to 2	❌ No	❌ No
Style control	✅ Natural language	Limited	❌ No
Voice cloning	❌ No	✅ Yes	❌ No
Languages	24	29+	50+
Speed control	Via prompts	Yes	Yes (0.25-4x)
Max length	32k tokens	5,000 chars	4,096 chars
Output format	WAV (24kHz)	MP3, WAV	MP3, Opus, AAC, FLAC
Same key as video/image	✅ Yes	❌ No	❌ No