EchoForge Moss Voice

Use this skill to run voice interaction with user-preferred timbre.

Required runtime config

Always send:

Collect:

text (required, what to speak)
Voice source (one of):
- voice_id (preferred when available), or
- reference_audio (public URL), or
- local audio path (upload first, then clone voice)

Optional:

expected_duration_sec
sampling_params:
- max_new_tokens (default 512)
- temperature (default 1.7)
- top_p (default 0.8)
- top_k (default 25)
meta_info (default false)

Resolve voice source.
- If voice_id is available, use it directly.
- If only local audio path is available:
  - Upload file: POST /api/v1/files/upload with multipart field file.
  - Clone voice: POST /api/v1/voice/clone with file_id (or url).
  - If returned voice status is not active, poll GET /api/v1/voices/{voice_id} until ACTIVE or timeout.
- If reference_audio URL is available, use it directly in TTS.
Run TTS: POST /v1/audio/tts.
- Required payload:
  - model: "moss-tts"
  - text
  - one of voice_id or reference_audio
Parse response:
- Decode audio_data (base64) to WAV.
- Read duration_s and usage when present.
Return a concise result:
- voice_id used
- output file path
- duration
- brief status message