qwen-tts-voice-cloning

End-to-end voice cloning pipeline using qwen-tts on Apple Silicon. Covers data collection (yt-dlp), audio cleaning (ASR with parakeet-tdt-0.6b-v3), data preparation, LoRA fine-tuning, generation, and evaluation via ASR intelligibility check (local, no API) + Gemini-as-judge (qwen-tts check). Use when the user wants to clone a voice, train a TTS model, prepare voice training data, evaluate generated speech quality, or run the full collect→clean→prepare→train→generate→eval pipeline.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "qwen-tts-voice-cloning" with this command: npx skills add hewliyang/qwen-tts/hewliyang-qwen-tts-qwen-tts-voice-cloning

Qwen-TTS Voice Cloning Pipeline

Full ML lifecycle for voice cloning on Apple Silicon (MLX). The qwen-tts package is installed via pip/uv and provides both a CLI and Python API.

Installation

# Install the package (includes mlx-audio, librosa, etc.)
uv pip install qwen-tts

# Additional tools for data collection & cleaning
brew install yt-dlp ffmpeg  # if not already installed
uv pip install google-genai python-dotenv  # for eval step

Quick Reference

qwen-tts split     <audio> -o ./clips              # ASR-based split + transcribe
qwen-tts prepare   --data <dir> -o train.jsonl
qwen-tts train     --name <speaker> --data train.jsonl -o ./<speaker>_voice
qwen-tts generate  -m custom-voice --voice-model <dir> --speaker <name> -p "text" -o out.wav
qwen-tts check     -g <generated> -r <reference> -S "<Speaker Name>"
qwen-tts voices    [dir]
qwen-tts speakers  --voice-model <dir>

Always start with the 0.6B model. It trains faster, uses less memory, and produces good results. Only try 1.7B if 0.6B quality is insufficient.

Pipeline Overview

1. Collect     →  yt-dlp + ffmpeg
2. Split       →  qwen-tts split (ASR sentence-aligned splitting + transcription)
3. Prepare     →  qwen-tts prepare (encodes audio → codec tokens)
4. Train       →  qwen-tts train (LoRA SFT, ~15 min on M-series)
5. Generate    →  qwen-tts generate
6. Evaluate    →  qwen-tts check (Gemini-as-judge)
7. Iterate     →  adjust data / hyperparams based on eval

Step 1: Data Collection

Target: ~10 minutes of clean speech

# Download audio from YouTube
yt-dlp -x --audio-format wav -o "raw_audio/%(title)s.%(ext)s" "https://youtube.com/watch?v=VIDEO_ID"

# Convert to 24kHz mono WAV (required format)
ffmpeg -i "raw_audio/source.wav" -ar 24000 -ac 1 "raw_audio/source_24k.wav"

Multiple YouTube sources

for url in "URL1" "URL2" "URL3"; do
  yt-dlp -x --audio-format wav -o "raw_audio/%(id)s.%(ext)s" "$url"
done

# Combine into one file
ffmpeg -f concat -safe 0 -i <(for f in raw_audio/*.wav; do echo "file '$f'"; done) \
  -ar 24000 -ac 1 "raw_audio/combined_24k.wav"

Guidelines:

  • 24kHz mono WAV is the required format
  • Prefer clear speech (interviews, lectures, speeches)
  • Remove music, background noise, other speakers before splitting

Step 2: Split + Transcribe (ASR-based)

Uses parakeet ASR to split audio on sentence boundaries and generate transcripts in one step. No more blind fixed-duration chopping.

qwen-tts split raw_audio/source_24k.wav -o ./clips --max-dur 8

This will:

  1. Run parakeet ASR on the full audio → sentence-level timestamps
  2. Merge short sentences into clips (3–8s target range)
  3. Slice audio at sentence boundaries
  4. Write clips + transcript.txt

Options

  • --min-dur 3.0 — minimum clip duration (seconds)
  • --max-dur 8.0 — maximum clip duration (seconds). Keep ≤10s — longer clips use more memory during training and can cause OOM with gradient accumulation.
  • --pad 0.1 — padding around clip boundaries (seconds)
  • --asr-model mlx-community/parakeet-tdt-0.6b-v3 — ASR model

After splitting, review transcript.txt and:

  • Fix transcription errors (they become training labels)
  • Remove clips with multiple speakers or heavy background noise
  • Ensure proper punctuation and capitalization

transcript.txt format

clip_000.wav|The stage that we have reached in Singapore. It was colorful, it was moving.
clip_001.wav|These two need not run counter to each other. But in practice, they tend to...
clip_002.wav|And the human knowledge approach tends to complicate methods in ways that...

One line per clip: filename|transcription. Filename is relative to the data directory.


Step 3: Prepare Training Data

Encodes each audio clip through the speech tokenizer → codec tokens. Takes ~1-2 min for 75 clips.

qwen-tts prepare --data ./clips -o ./clips/train.jsonl --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16

This will:

  1. Read clips/transcript.txt automatically (or pass --transcript)
  2. Use first clip as reference audio (or pass --ref-audio)
  3. Encode each clip → 16-codebook tokens at 12Hz
  4. Write train.jsonl

train.jsonl format

Each line is a JSON object:

{
  "audio": "clips/clip_000.wav",
  "text": "The stage that we have reached in Singapore.",
  "ref_audio": "./clips/clip_000.wav",
  "audio_codes": [[1995, 431, 581, ...], [215, 431, 321, ...], ...]
}
  • audio_codes: 2D array [time_frames, 16] — 16 codebooks at 12Hz

Directory structure after prepare

clips/
  clip_000.wav ... clip_074.wav
  transcript.txt
  train.jsonl

Step 4: Train (LoRA SFT)

qwen-tts train \
  --name <speaker_name> \
  --data ./clips/train.jsonl \
  --output ./<speaker_name>_voice \
  --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 \
  --epochs 3 \
  --lr 2e-5 \
  --grad-accum 4 \
  --lora-rank 8

Hyperparameters

ParameterDefaultNotes
--epochs32-5 typical. More risks overfitting with small data.
--lr2e-5Lower (1e-5) for clean data, higher (5e-5) if underfitting.
--grad-accum4Effective batch = batch_size × grad_accum. Use 2 if clips are long (>8s) or training hangs — reduces peak memory from holding gradients.
--lora-rank84 for subtle shifts, 16 for stronger adaptation.
--lora-scale20.0Higher = stronger LoRA effect.
--batch-size1Keep at 1 unless >64GB RAM.

What to watch

  • Loss should drop from ~4-6 to ~2-3 over 3 epochs
  • Loss plateaus above 4 → data quality issue or lr too low
  • Loss below 1 → overfitting, reduce epochs
  • Epoch 0 is slow (~15 min) due to one-time Metal JIT shader compilation; subsequent epochs are fast (~40-60s each)
  • Peak memory: ~5-7GB

Step 5: Generate

qwen-tts generate \
  -m custom-voice \
  --voice-model ./<speaker_name>_voice \
  --speaker <speaker_name> \
  --size 0.6B \
  -p "Text to synthesize." \
  -o output.wav

Batch generate for evaluation

mkdir -p eval_clips
prompts=(
  "The future depends on what we do today."
  "We must be pragmatic and face reality as it is."
  "Education is the key to progress and prosperity."
)
for i in "${!prompts[@]}"; do
  qwen-tts generate -m custom-voice --voice-model ./<speaker_name>_voice \
    --speaker <speaker_name> --size 0.6B \
    -p "${prompts[$i]}" \
    -o "eval_clips/gen_$(printf '%02d' $i).wav"
done

Options

  • --seed N — reproducible output
  • --play — play audio after generation
  • Generation speed: ~0.5x real-time on M-series

Step 6: Evaluate

Two complementary checks. Run ASR first (free, local, fast) to catch intelligibility issues, then Gemini (API calls) for speaker similarity.

6a: ASR intelligibility check (local, no API key)

Runs parakeet on generated clips, compares transcription against the prompt text via word error rate (WER). Catches garbled speech, missing words, mispronunciations.

# ASR-only — no API key needed
qwen-tts check \
  -g ./eval_clips \
  --asr-only \
  --expected-text \
    "gen_00.wav=The future depends on what we do today." \
    "gen_01.wav=We must be pragmatic and face reality as it is."

6b: Gemini speaker similarity check (API calls)

Requires GEMINI_API_KEY in environment or .env file. If not set, ask the user to provide one.

qwen-tts check \
  -g ./eval_clips \
  -r ./clips \
  -S "Speaker Name" \
  --max-clips 3 \
  --pairs 3

6c: Both together (recommended)

qwen-tts check \
  -g ./eval_clips \
  -r ./clips \
  -S "Speaker Name" \
  --max-clips 3 \
  --pairs 3 \
  --expected-text \
    "gen_00.wav=The future depends on what we do today." \
    "gen_01.wav=We must be pragmatic and face reality as it is."

# Add --json for machine-readable output

How it works

ASR check (local, 0 API calls): runs mlx_audio.stt.generate with parakeet-tdt-0.6b-v3, computes WER against expected text. Only runs for clips with --expected-text.

Gemini check — total API calls = max_clips + pairs (report computed locally):

  • Single clip eval (1 call/clip):

    {
      "speaker_match": "yes|no|uncertain",
      "confidence": "low|medium|high",
      "natural": true,
      "audio_quality": "poor|fair|good|excellent",
      "issues": []
    }
    
  • Pair comparison (1 call/pair): sends ref + gen clip side by side:

    {
      "same_speaker": "yes|no|uncertain",
      "similarity_score": 9,
      "accent_match": true,
      "cadence_match": true,
      "tone_match": true,
      "issues": []
    }
    

Report (computed locally from above):

{
  "overall_pass": true,
  "avg_wer": 0.0,
  "intelligibility_rate": 1.0,
  "speaker_match_rate": 1.0,
  "naturalness_rate": 1.0,
  "avg_similarity": 9.0
}

Pass criteria

  • Intelligibility rate ≥ 80% (WER < 30% per clip)
  • Speaker match rate ≥ 80% (Gemini)
  • Naturalness rate ≥ 80% (Gemini)
  • Avg similarity ≥ 6/10 (Gemini pair comparison)

Python API

from qwen_tts.check import CheckConfig, run_check

result = run_check(CheckConfig(
    generated="./eval_clips",
    reference="./clips",
    speaker="Speaker Name",
    max_clips=3,
    pairs=3,
    expected_texts={"gen_00.wav": "The future depends on what we do today."},
))
# result["report"]["overall_pass"] → True/False
# result["asr_results"] → per-clip WER + transcription
# result["single_results"] → Gemini per-clip evals
# result["pair_results"] → Gemini pair comparisons

Step 7: Iterate

Common eval failures → fixes

IssueFix
High WER / unintelligibleBad training data quality, or overfitting — reduce epochs
Low speaker matchMore/better data, increase epochs or lora-rank
Robotic / not naturalReduce epochs (overfitting), remove noisy clips
Wrong accentRemove clips with different accents
Audio glitchesCheck source audio quality, remove bad clips
Low pair similarityIncrease lora-rank (8→16), add more data

Full automated loop

qwen-tts prepare --data ./clips -o ./clips/train.jsonl --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
qwen-tts train --name spk --data ./clips/train.jsonl -o ./spk_voice --model-id mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
mkdir -p eval_clips
qwen-tts generate -m custom-voice --voice-model ./spk_voice --speaker spk --size 0.6B \
  -p "Test sentence one." -o eval_clips/gen_01.wav
qwen-tts generate -m custom-voice --voice-model ./spk_voice --speaker spk --size 0.6B \
  -p "Test sentence two." -o eval_clips/gen_02.wav

# Quick local check first (no API key)
qwen-tts check -g eval_clips --asr-only \
  --expected-text "gen_01.wav=Test sentence one." "gen_02.wav=Test sentence two."

# Full check with Gemini
qwen-tts check -g eval_clips -r clips -S "Speaker Name" \
  --max-clips 2 --pairs 2 \
  --expected-text "gen_01.wav=Test sentence one." "gen_02.wav=Test sentence two."
# If FAIL → adjust data/hyperparams and re-run

Data Requirements Summary

RequirementValue
Audio formatWAV, 24kHz, mono, 16-bit PCM
Clip duration3-8 seconds (5-8s ideal). Keep ≤10s — longer clips cause OOM during gradient accumulation.
Total duration~8-10 minutes (50-80 clips). ~7,500 total codec frames is the sweet spot.
ContentSingle speaker, clean speech, minimal noise
TranscriptAccurate text matching the speech

Troubleshooting

ProblemSolution
No module named 'dotenv'uv pip install python-dotenv
No module named 'google.genai'uv pip install google-genai
speech_tokenizer encoder not foundUse the base model for prepare
OOM / hang during trainingReduce --grad-accum to 2, trim clips to ≤8s, reduce total to ~50-60 clips. Long clips + high grad-accum holds too many gradients in memory.
OOM during split (ASR)Source audio too long (>~40 min). Split into chunks with ffmpeg first, run qwen-tts split on each, then merge clips + transcripts into one directory. See below.
OOM during generationClose other apps, or use --size 0.6B
incorrect regex pattern warningHarmless — ignore
model of type qwen3_tts warningHarmless — ignore

Chunking long source audio for split

If qwen-tts split OOMs on a long file (>~40 min), chunk it with ffmpeg first:

mkdir -p raw_audio/chunks
ffmpeg -i raw_audio/source_24k.wav -f segment -segment_time 600 -c copy raw_audio/chunks/chunk_%02d.wav

for chunk in raw_audio/chunks/chunk_*.wav; do
  qwen-tts split "$chunk" -o "clips_tmp/$(basename "$chunk" .wav)"
done

# Merge into one directory
mkdir -p clips && idx=0
for dir in clips_tmp/chunk_*/; do
  while IFS='|' read -r f t; do
    new=$(printf "clip_%03d.wav" $idx)
    cp "${dir}${f}" "clips/${new}"
    echo "${new}|${t}" >> clips/transcript.txt
    idx=$((idx + 1))
  done < "${dir}transcript.txt"
done

Available Models

KeySizeHF Repo
base1.7Bmlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16
base0.6Bmlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16
custom-voice1.7Bmlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16
custom-voice0.6Bmlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16
voice-design1.7Bmlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

qwen-tts

No summary provided by upstream source.

Repository SourceNeeds Review
General

spreadjs

No summary provided by upstream source.

Repository SourceNeeds Review
General

qwen-tts

No summary provided by upstream source.

Repository SourceNeeds Review
General

PayPilot by AGMS

Process payments, send invoices, issue refunds, manage subscriptions, and detect fraud via a secure payment gateway proxy. Use when a user asks to charge som...

Registry SourceRecently Updated