qwen3-tts

Build text-to-speech applications using Qwen3-TTS from Alibaba Qwen. Reference the local repository at D:\code\qwen3-tts for source code and examples.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "qwen3-tts" with this command: npx skills add jarmen423/skills/jarmen423-skills-qwen3-tts

Qwen3-TTS

Build text-to-speech applications using Qwen3-TTS from Alibaba Qwen. Reference the local repository at D:\code\qwen3-tts for source code and examples.

Quick Reference

Task Model Method

Custom voice with preset speakers CustomVoice generate_custom_voice()

Design new voice via description VoiceDesign generate_voice_design()

Clone voice from audio sample Base generate_voice_clone()

Encode/decode audio Tokenizer encode() / decode()

Environment Setup

Create fresh environment

conda create -n qwen3-tts python=3.12 -y conda activate qwen3-tts

Install package

pip install -U qwen-tts

Optional: FlashAttention 2 for reduced GPU memory

pip install -U flash-attn --no-build-isolation

Available Models

Model Features

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

9 preset speakers, instruction control

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Create voices from natural language descriptions

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Voice cloning, fine-tuning base

Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Smaller custom voice model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Smaller base model for cloning/fine-tuning

Qwen/Qwen3-TTS-Tokenizer-12Hz

Audio encoder/decoder

Task Workflows

  1. Custom Voice Generation

Use preset speakers with optional style instructions.

import torch import soundfile as sf from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )

Single generation

wavs, sr = model.generate_custom_voice( text="Hello, how are you today?", language="English", # Or "Auto" for auto-detection speaker="Ryan", instruct="Speak with enthusiasm", # Optional style control ) sf.write("output.wav", wavs[0], sr)

Batch generation

wavs, sr = model.generate_custom_voice( text=["First sentence.", "Second sentence."], language=["English", "English"], speaker=["Ryan", "Aiden"], instruct=["Happy tone", "Calm tone"], )

Available Speakers:

Speaker Description Native Language

Vivian Bright, edgy young female Chinese

Serena Warm, gentle young female Chinese

Uncle_Fu Low, mellow mature male Chinese

Dylan Youthful Beijing male Chinese (Beijing)

Eric Lively Chengdu male Chinese (Sichuan)

Ryan Dynamic male with rhythmic drive English

Aiden Sunny American male English

Ono_Anna Playful Japanese female Japanese

Sohee Warm Korean female Korean

  1. Voice Design

Create new voices from natural language descriptions.

model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )

wavs, sr = model.generate_voice_design( text="Welcome to our presentation today.", language="English", instruct="Professional male voice, warm baritone, confident and clear", ) sf.write("designed_voice.wav", wavs[0], sr)

  1. Voice Cloning

Clone a voice from a reference audio sample (3+ seconds recommended).

model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )

Direct cloning

wavs, sr = model.generate_voice_clone( text="This is the cloned voice speaking.", language="English", ref_audio="path/to/reference.wav", # Or URL or (numpy_array, sr) tuple ref_text="Transcript of the reference audio.", ) sf.write("cloned.wav", wavs[0], sr)

Reusable clone prompt (for multiple generations)

prompt = model.create_voice_clone_prompt( ref_audio="path/to/reference.wav", ref_text="Transcript of the reference audio.", ) wavs, sr = model.generate_voice_clone( text="Another sentence with the same voice.", language="English", voice_clone_prompt=prompt, )

  1. Voice Design + Clone Workflow

Design a voice, then reuse it across multiple generations.

Step 1: Design the voice

design_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )

ref_text = "Sample text for the reference audio." ref_wavs, sr = design_model.generate_voice_design( text=ref_text, language="English", instruct="Young energetic male, tenor range", )

Step 2: Create reusable clone prompt

clone_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2", )

prompt = clone_model.create_voice_clone_prompt( ref_audio=(ref_wavs[0], sr), ref_text=ref_text, )

Step 3: Generate multiple outputs with consistent voice

for sentence in ["First line.", "Second line.", "Third line."]: wavs, sr = clone_model.generate_voice_clone( text=sentence, language="English", voice_clone_prompt=prompt, )

  1. Audio Tokenization

Encode and decode audio for transport or processing.

from qwen_tts import Qwen3TTSTokenizer import soundfile as sf

tokenizer = Qwen3TTSTokenizer.from_pretrained( "Qwen/Qwen3-TTS-Tokenizer-12Hz", device_map="cuda:0", )

Encode audio (accepts path, URL, numpy array, or base64)

enc = tokenizer.encode("path/to/audio.wav")

Decode back to waveform

wavs, sr = tokenizer.decode(enc) sf.write("reconstructed.wav", wavs[0], sr)

Generation Parameters

Common parameters for all generate_* methods:

wavs, sr = model.generate_custom_voice( text="...", language="Auto", speaker="Ryan", max_new_tokens=2048, do_sample=True, top_k=50, top_p=1.0, temperature=0.9, repetition_penalty=1.05, )

Web UI Demo

Launch local Gradio demo:

CustomVoice demo

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000

VoiceDesign demo

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000

Base (voice clone) demo - requires HTTPS for microphone

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
--ssl-certfile cert.pem --ssl-keyfile key.pem --no-ssl-verify

Supported Languages

Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Pass language="Auto" for automatic detection, or specify explicitly for best quality.

References

  • Fine-tuning guide: See references/finetuning.md for training custom speakers

  • API details: See references/api-reference.md for complete method signatures

  • Local repo: D:\code\qwen3-tts contains source code and examples

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

code-review-checklist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

voice-ai-development

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

subagent-driven-development

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

bun-development

No summary provided by upstream source.

Repository SourceNeeds Review