Qwen-TTS Voice Cloning Pipeline
Full ML lifecycle for voice cloning on Apple Silicon (MLX). The qwen-tts package is installed via pip/uv and provides both a CLI and Python API.
Installation
# Install the package (includes mlx-audio, librosa, etc.)
uv pip install qwen-tts
# Additional tools for data collection & cleaning
brew install yt-dlp ffmpeg # if not already installed
uv pip install google-genai python-dotenv # for eval step
Quick Reference
qwen-tts split <audio> -o ./clips # ASR-based split + transcribe
qwen-tts prepare --data <dir> -o train.jsonl
qwen-tts train --name <speaker> --data train.jsonl -o ./<speaker>_voice
qwen-tts generate -m custom-voice --voice-model <dir> --speaker <name> -p "text" -o out.wav
qwen-tts check -g <generated> -r <reference> -S "<Speaker Name>"
qwen-tts voices [dir]
qwen-tts speakers --voice-model <dir>
Always start with the 0.6B model. It trains faster, uses less memory, and produces good results. Only try 1.7B if 0.6B quality is insufficient.
Pipeline Overview
1. Collect → yt-dlp + ffmpeg
2. Split → qwen-tts split (ASR sentence-aligned splitting + transcription)
3. Prepare → qwen-tts prepare (encodes audio → codec tokens)
4. Train → qwen-tts train (LoRA SFT, ~15 min on M-series)
5. Generate → qwen-tts generate
6. Evaluate → qwen-tts check (Gemini-as-judge)
7. Iterate → adjust data / hyperparams based on eval
Step 1: Data Collection
Target: ~10 minutes of clean speech
# Download audio from YouTube
yt-dlp -x --audio-format wav -o "raw_audio/%(title)s.%(ext)s" "https://youtube.com/watch?v=VIDEO_ID"
# Convert to 24kHz mono WAV (required format)
ffmpeg -i "raw_audio/source.wav" -ar 24000 -ac 1 "raw_audio/source_24k.wav"
Multiple YouTube sources
for url in "URL1" "URL2" "URL3"; do
yt-dlp -x --audio-format wav -o "raw_audio/%(id)s.%(ext)s" "$url"
done
# Combine into one file
ffmpeg -f concat -safe 0 -i <(for f in raw_audio/*.wav; do echo "file '$f'"; done) \
-ar 24000 -ac 1 "raw_audio/combined_24k.wav"
Guidelines:
- 24kHz mono WAV is the required format
- Prefer clear speech (interviews, lectures, speeches)
- Remove music, background noise, other speakers before splitting
Step 2: Split + Transcribe (ASR-based)
Uses parakeet ASR to split audio on sentence boundaries and generate transcripts in one step. No more blind fixed-duration chopping.
qwen-tts split raw_audio/source_24k.wav -o ./clips --max-dur 8
This will:
- Run parakeet ASR on the full audio → sentence-level timestamps
- Merge short sentences into clips (3–8s target range)
- Slice audio at sentence boundaries
- Write clips +
transcript.txt
Options
--min-dur 3.0— minimum clip duration (seconds)--max-dur 8.0— maximum clip duration (seconds). Keep ≤10s — longer clips use more memory during training and can cause OOM with gradient accumulation.--pad 0.1— padding around clip boundaries (seconds)--asr-model mlx-community/parakeet-tdt-0.6b-v3— ASR model
After splitting, review transcript.txt and:
- Fix transcription errors (they become training labels)
- Remove clips with multiple speakers or heavy background noise
- Ensure proper punctuation and capitalization
transcript.txt format
clip_000.wav|The stage that we have reached in Singapore. It was colorful, it was moving.
clip_001.wav|These two need not run counter to each other. But in practice, they tend to...
clip_002.wav|And the human knowledge approach tends to complicate methods in ways that...
One line per clip: filename|transcription. Filename is relative to the data directory.
Step 3: Prepare Training Data
Encodes each audio clip through the speech tokenizer → codec tokens. Takes ~1-2 min for 75 clips.
qwen-tts prepare --data ./clips -o ./clips/train.jsonl --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
This will:
- Read
clips/transcript.txtautomatically (or pass--transcript) - Use first clip as reference audio (or pass
--ref-audio) - Encode each clip → 16-codebook tokens at 12Hz
- Write
train.jsonl
train.jsonl format
Each line is a JSON object:
{
"audio": "clips/clip_000.wav",
"text": "The stage that we have reached in Singapore.",
"ref_audio": "./clips/clip_000.wav",
"audio_codes": [[1995, 431, 581, ...], [215, 431, 321, ...], ...]
}
audio_codes: 2D array[time_frames, 16]— 16 codebooks at 12Hz
Directory structure after prepare
clips/
clip_000.wav ... clip_074.wav
transcript.txt
train.jsonl
Step 4: Train (LoRA SFT)
qwen-tts train \
--name <speaker_name> \
--data ./clips/train.jsonl \
--output ./<speaker_name>_voice \
--model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 \
--epochs 3 \
--lr 2e-5 \
--grad-accum 4 \
--lora-rank 8
Hyperparameters
| Parameter | Default | Notes |
|---|---|---|
--epochs | 3 | 2-5 typical. More risks overfitting with small data. |
--lr | 2e-5 | Lower (1e-5) for clean data, higher (5e-5) if underfitting. |
--grad-accum | 4 | Effective batch = batch_size × grad_accum. Use 2 if clips are long (>8s) or training hangs — reduces peak memory from holding gradients. |
--lora-rank | 8 | 4 for subtle shifts, 16 for stronger adaptation. |
--lora-scale | 20.0 | Higher = stronger LoRA effect. |
--batch-size | 1 | Keep at 1 unless >64GB RAM. |
What to watch
- Loss should drop from ~4-6 to ~2-3 over 3 epochs
- Loss plateaus above 4 → data quality issue or lr too low
- Loss below 1 → overfitting, reduce epochs
- Epoch 0 is slow (~15 min) due to one-time Metal JIT shader compilation; subsequent epochs are fast (~40-60s each)
- Peak memory: ~5-7GB
Step 5: Generate
qwen-tts generate \
-m custom-voice \
--voice-model ./<speaker_name>_voice \
--speaker <speaker_name> \
--size 0.6B \
-p "Text to synthesize." \
-o output.wav
Batch generate for evaluation
mkdir -p eval_clips
prompts=(
"The future depends on what we do today."
"We must be pragmatic and face reality as it is."
"Education is the key to progress and prosperity."
)
for i in "${!prompts[@]}"; do
qwen-tts generate -m custom-voice --voice-model ./<speaker_name>_voice \
--speaker <speaker_name> --size 0.6B \
-p "${prompts[$i]}" \
-o "eval_clips/gen_$(printf '%02d' $i).wav"
done
Options
--seed N— reproducible output--play— play audio after generation- Generation speed: ~0.5x real-time on M-series
Step 6: Evaluate
Two complementary checks. Run ASR first (free, local, fast) to catch intelligibility issues, then Gemini (API calls) for speaker similarity.
6a: ASR intelligibility check (local, no API key)
Runs parakeet on generated clips, compares transcription against the prompt text via word error rate (WER). Catches garbled speech, missing words, mispronunciations.
# ASR-only — no API key needed
qwen-tts check \
-g ./eval_clips \
--asr-only \
--expected-text \
"gen_00.wav=The future depends on what we do today." \
"gen_01.wav=We must be pragmatic and face reality as it is."
6b: Gemini speaker similarity check (API calls)
Requires GEMINI_API_KEY in environment or .env file. If not set, ask the user to provide one.
qwen-tts check \
-g ./eval_clips \
-r ./clips \
-S "Speaker Name" \
--max-clips 3 \
--pairs 3
6c: Both together (recommended)
qwen-tts check \
-g ./eval_clips \
-r ./clips \
-S "Speaker Name" \
--max-clips 3 \
--pairs 3 \
--expected-text \
"gen_00.wav=The future depends on what we do today." \
"gen_01.wav=We must be pragmatic and face reality as it is."
# Add --json for machine-readable output
How it works
ASR check (local, 0 API calls): runs mlx_audio.stt.generate with parakeet-tdt-0.6b-v3, computes WER against expected text. Only runs for clips with --expected-text.
Gemini check — total API calls = max_clips + pairs (report computed locally):
-
Single clip eval (1 call/clip):
{ "speaker_match": "yes|no|uncertain", "confidence": "low|medium|high", "natural": true, "audio_quality": "poor|fair|good|excellent", "issues": [] } -
Pair comparison (1 call/pair): sends ref + gen clip side by side:
{ "same_speaker": "yes|no|uncertain", "similarity_score": 9, "accent_match": true, "cadence_match": true, "tone_match": true, "issues": [] }
Report (computed locally from above):
{
"overall_pass": true,
"avg_wer": 0.0,
"intelligibility_rate": 1.0,
"speaker_match_rate": 1.0,
"naturalness_rate": 1.0,
"avg_similarity": 9.0
}
Pass criteria
- Intelligibility rate ≥ 80% (WER < 30% per clip)
- Speaker match rate ≥ 80% (Gemini)
- Naturalness rate ≥ 80% (Gemini)
- Avg similarity ≥ 6/10 (Gemini pair comparison)
Python API
from qwen_tts.check import CheckConfig, run_check
result = run_check(CheckConfig(
generated="./eval_clips",
reference="./clips",
speaker="Speaker Name",
max_clips=3,
pairs=3,
expected_texts={"gen_00.wav": "The future depends on what we do today."},
))
# result["report"]["overall_pass"] → True/False
# result["asr_results"] → per-clip WER + transcription
# result["single_results"] → Gemini per-clip evals
# result["pair_results"] → Gemini pair comparisons
Step 7: Iterate
Common eval failures → fixes
| Issue | Fix |
|---|---|
| High WER / unintelligible | Bad training data quality, or overfitting — reduce epochs |
| Low speaker match | More/better data, increase epochs or lora-rank |
| Robotic / not natural | Reduce epochs (overfitting), remove noisy clips |
| Wrong accent | Remove clips with different accents |
| Audio glitches | Check source audio quality, remove bad clips |
| Low pair similarity | Increase lora-rank (8→16), add more data |
Full automated loop
qwen-tts prepare --data ./clips -o ./clips/train.jsonl --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
qwen-tts train --name spk --data ./clips/train.jsonl -o ./spk_voice --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
mkdir -p eval_clips
qwen-tts generate -m custom-voice --voice-model ./spk_voice --speaker spk --size 0.6B \
-p "Test sentence one." -o eval_clips/gen_01.wav
qwen-tts generate -m custom-voice --voice-model ./spk_voice --speaker spk --size 0.6B \
-p "Test sentence two." -o eval_clips/gen_02.wav
# Quick local check first (no API key)
qwen-tts check -g eval_clips --asr-only \
--expected-text "gen_01.wav=Test sentence one." "gen_02.wav=Test sentence two."
# Full check with Gemini
qwen-tts check -g eval_clips -r clips -S "Speaker Name" \
--max-clips 2 --pairs 2 \
--expected-text "gen_01.wav=Test sentence one." "gen_02.wav=Test sentence two."
# If FAIL → adjust data/hyperparams and re-run
Data Requirements Summary
| Requirement | Value |
|---|---|
| Audio format | WAV, 24kHz, mono, 16-bit PCM |
| Clip duration | 3-8 seconds (5-8s ideal). Keep ≤10s — longer clips cause OOM during gradient accumulation. |
| Total duration | ~8-10 minutes (50-80 clips). ~7,500 total codec frames is the sweet spot. |
| Content | Single speaker, clean speech, minimal noise |
| Transcript | Accurate text matching the speech |
Troubleshooting
| Problem | Solution |
|---|---|
No module named 'dotenv' | uv pip install python-dotenv |
No module named 'google.genai' | uv pip install google-genai |
speech_tokenizer encoder not found | Use the base model for prepare |
| OOM / hang during training | Reduce --grad-accum to 2, trim clips to ≤8s, reduce total to ~50-60 clips. Long clips + high grad-accum holds too many gradients in memory. |
OOM during split (ASR) | Source audio too long (>~40 min). Split into chunks with ffmpeg first, run qwen-tts split on each, then merge clips + transcripts into one directory. See below. |
| OOM during generation | Close other apps, or use --size 0.6B |
incorrect regex pattern warning | Harmless — ignore |
model of type qwen3_tts warning | Harmless — ignore |
Chunking long source audio for split
If qwen-tts split OOMs on a long file (>~40 min), chunk it with ffmpeg first:
mkdir -p raw_audio/chunks
ffmpeg -i raw_audio/source_24k.wav -f segment -segment_time 600 -c copy raw_audio/chunks/chunk_%02d.wav
for chunk in raw_audio/chunks/chunk_*.wav; do
qwen-tts split "$chunk" -o "clips_tmp/$(basename "$chunk" .wav)"
done
# Merge into one directory
mkdir -p clips && idx=0
for dir in clips_tmp/chunk_*/; do
while IFS='|' read -r f t; do
new=$(printf "clip_%03d.wav" $idx)
cp "${dir}${f}" "clips/${new}"
echo "${new}|${t}" >> clips/transcript.txt
idx=$((idx + 1))
done < "${dir}transcript.txt"
done
Available Models
| Key | Size | HF Repo |
|---|---|---|
| base | 1.7B | mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 |
| base | 0.6B | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 |
| custom-voice | 1.7B | mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16 |
| custom-voice | 0.6B | mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16 |
| voice-design | 1.7B | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 |