Pocket-TTS

Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Pocket-TTS" with this command: npx skills add leonaaardob/lb-pocket-tts-skill

Pocket TTS

Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.

When to Use

  • Generating speech from text on CPU without GPU
  • Voice cloning from audio samples
  • Streaming audio generation (low latency)
  • Local TTS without API dependencies
  • Real-time speech synthesis (~6x faster than real-time)

Key Features

  • 100M parameters - Small, efficient model
  • CPU-optimized - No GPU needed, uses only 2 cores
  • ~6x real-time - Fast generation on modern CPUs
  • ~200ms latency - To first audio chunk (streaming)
  • Voice cloning - From 3-10s audio samples
  • 24kHz mono WAV - High-quality output
  • English only - More languages planned

Installation

pip install pocket-tts
# or
uv add pocket-tts

CLI Commands

Generate Speech

# Basic generation (default voice)
pocket-tts generate --text "Hello world"

# Custom voice (local file, URL, or safetensors)
pocket-tts generate --voice ./my_voice.wav
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
pocket-tts generate --voice ./voice.safetensors

# Quality tuning
pocket-tts generate --temperature 0.7 --lsd-decode-steps 3

See docs/generate.md for full CLI reference.

Start Web Server

# Start FastAPI server with web UI
pocket-tts serve

# Custom host/port
pocket-tts serve --host localhost --port 8080

See docs/serve.md for server options.

Export Voice Embeddings

Convert audio files to .safetensors for faster loading:

# Single file
pocket-tts export-voice voice.mp3 voice.safetensors

# Batch conversion
pocket-tts export-voice voices/ embeddings/ --truncate

See docs/export_voice.md for export options.


Python API

Basic Usage

from pocket_tts import TTSModel
import scipy.io.wavfile

# Load model
model = TTSModel.load_model()

# Get voice state
voice = model.get_state_for_audio_prompt(
    "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)

# Generate audio
audio = model.generate_audio(voice, "Hello world!")

# Save
scipy.io.wavfile.write("output.wav", model.sample_rate, audio.numpy())

Load Model

model = TTSModel.load_model(
    config="b6369a24",       # Model variant
    temp=0.7,                # Temperature (0.5-1.0)
    lsd_decode_steps=1,      # Generation steps (1-5)
    eos_threshold=-4.0       # End-of-sequence threshold
)

Voice State

# From audio file/URL
voice = model.get_state_for_audio_prompt("./voice.wav")
voice = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")

# From safetensors (fast loading)
voice = model.get_state_for_audio_prompt("./voice.safetensors")

Streaming Generation

# Stream audio chunks
for chunk in model.generate_audio_stream(voice, "Long text..."):
    # Process/save/play each chunk as generated
    print(f"Chunk: {chunk.shape[0]} samples")

Multi-Voice Management

# Preload multiple voices
voices = {
    "casual": model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav"),
    "announcer": model.get_state_for_audio_prompt("./announcer.safetensors"),
}

# Use different voices
audio1 = model.generate_audio(voices["casual"], "Hey there!")
audio2 = model.generate_audio(voices["announcer"], "Breaking news!")

See docs/python-api.md for complete API reference.


Available Voices

Pre-made voices from hf://kyutai/tts-voices/:

  • alba-mackenna/casual.wav (default, female)
  • jessica-jian/casual.wav (female)
  • voice-donations/Selfie.wav (male, marius)
  • voice-donations/Butter.wav (male, javert)
  • ears/p010/freeform_speech_01.wav (male, jean)
  • vctk/p244_023.wav (female, fantine)
  • vctk/p262_023.wav (female, eponine)
  • vctk/p303_023.wav (female, azelma)

Or clone any voice from your own audio samples.


Voice Cloning Tips

  • Clean audio - Remove background noise (use Adobe Podcast Enhance)
  • Length - 3-10 seconds of speech is ideal
  • Quality - Input quality affects output quality
  • Format - WAV, MP3, or any common audio format supported

Performance Tips

  • CPU-only - GPU provides no speedup (model too small, batch size 1)
  • 2 cores - Uses only 2 CPU cores efficiently
  • Streaming - Low latency (<200ms to first chunk)
  • Safetensors - Pre-process voices to .safetensors for instant loading

Output Format

All commands output WAV files:

  • Sample rate: 24 kHz
  • Channels: Mono
  • Bit depth: 16-bit PCM

Links

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Lobster Radio – Free Local AI Radio

个性化资讯电台生成服务。使用场景:(1) 生成特定主题的电台,(2) 设置每日定时推送,(3) 配置TTS音色,(4) 收听历史电台。不适用:音乐播放、实时广播、视频内容。

Registry SourceRecently Updated
1051Profile unavailable
General

SiliconFlow TTS Gen

Text-to-Speech using SiliconFlow API (CosyVoice2). Supports multiple voices, languages, and dialects.

Registry SourceRecently Updated
5240Profile unavailable
General

Claw Desktop Pet

Give OpenClaw a body — a tiny fluid glass ball desktop pet with voice cloning, 15+ eye expressions, desktop lyrics overlay, and 7 mood colors. Electron-based, pure CSS/JS animation.

Registry SourceRecently Updated
1.5K0Profile unavailable
General

MLX Audio Server

Local 24x7 OpenAI-compatible API server for STT/TTS, powered by MLX on your Mac.

Registry SourceRecently Updated
2.2K0Profile unavailable