Inworld AI
Text-to-Speech platform with voice cloning, audio markups, and timestamp alignment.
Quick Navigation
When to Use
- Text-to-speech audio generation
- Voice cloning from 5-15 seconds of audio
- Emotion-controlled speech (
[happy], [sad], etc.)
- Word/phoneme timestamps for lip sync
- Custom pronunciation with IPA
Models
| Model | ID | Latency | Price |
|---|
| TTS 1.5 Max | inworld-tts-1.5-max | ~200ms | $10/1M chars |
| TTS 1.5 Mini | inworld-tts-1.5-mini | ~120ms | $5/1M chars |
Minimal Example
import requests, base64, os
response = requests.post(
"https://api.inworld.ai/tts/v1/voice",
headers={"Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}"},
json={"text": "Hello!", "voiceId": "Ashley", "modelId": "inworld-tts-1.5-max"}
)
audio = base64.b64decode(response.json()['audioContent'])
Key Features
- 15 languages — en, zh, ja, ko, ru, it, es, pt, fr, de, pl, nl, hi, he, ar
- Instant cloning — 5-15 seconds audio, no training
- Audio markups —
[happy], [laughing], [sigh] (English only)
- Timestamps — word, phoneme, viseme timing for lip sync
- Streaming —
/voice:stream endpoint
Prohibitions
- Audio markups work only in English
- Use ONE emotion markup at text beginning
- Match voice language to text language
- Instant cloning may not work for children's voices or unique accents
Links