together-audio

Text-to-speech (TTS) and speech-to-text (STT) via Together AI. TTS models include Orpheus, Kokoro, Cartesia Sonic, Rime, MiniMax with REST, streaming, and WebSocket support. STT models include Whisper and Voxtral. Use when users need voice synthesis, audio generation, speech recognition, transcription, TTS, STT, or real-time voice applications.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "together-audio" with this command: npx skills add zainhas/togetherai-skills/zainhas-togetherai-skills-together-audio

Together Audio (TTS & STT)

Overview

Together AI provides text-to-speech and speech-to-text capabilities.

TTS — Generate speech from text via REST, streaming, or WebSocket:

  • Endpoint: /v1/audio/speech
  • WebSocket: wss://api.together.xyz/v1/audio/speech/websocket

STT — Transcribe audio to text:

  • Endpoint: /v1/audio/transcriptions

Installation

# Python (recommended)
uv init  # optional, if starting a new project
uv add together
# or with pip
pip install together
# TypeScript / JavaScript
npm install together-ai

Set your API key:

export TOGETHER_API_KEY=<your-api-key>

TTS Quick Start

Basic Speech Generation

from together import Together
client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
)
response.stream_to_file("speech.mp3")
import Together from "together-ai";
import { Readable } from "stream";
import { createWriteStream } from "fs";

const together = new Together();

async function generateAudio() {
  const res = await together.audio.create({
    input: "Today is a wonderful day to build something people love!",
    voice: "tara",
    response_format: "mp3",
    sample_rate: 44100,
    stream: false,
    model: "canopylabs/orpheus-3b-0.1-ft",
  });

  if (res.body) {
    const nodeStream = Readable.from(res.body as ReadableStream);
    const fileStream = createWriteStream("./speech.mp3");
    nodeStream.pipe(fileStream);
  }
}

generateAudio();
curl -X POST "https://api.together.xyz/v1/audio/speech" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"canopylabs/orpheus-3b-0.1-ft","input":"Hello world","voice":"tara","response_format":"mp3"}' \
  --output speech.mp3

Streaming Audio (Low Latency)

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",
    response_encoding="pcm_s16le",
)
response.stream_to_file("speech.wav", response_format="wav")
import Together from "together-ai";

const together = new Together();

async function streamAudio() {
  const response = await together.audio.speech.create({
    model: "canopylabs/orpheus-3b-0.1-ft",
    input: "The quick brown fox jumps over the lazy dog",
    voice: "tara",
    stream: true,
    response_format: "raw",
    response_encoding: "pcm_s16le",
  });

  const chunks = [];
  for await (const chunk of response) {
    chunks.push(chunk);
  }

  console.log("Streaming complete!");
}

streamAudio();

WebSocket (Lowest Latency)

import asyncio, websockets, json, base64

async def generate_speech():
    url = "wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy"
    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        session = json.loads(await ws.recv())
        await ws.send(json.dumps({"type": "input_text_buffer.append", "text": "Hello!"}))
        await ws.send(json.dumps({"type": "input_text_buffer.commit"}))

        audio_data = bytearray()
        async for msg in ws:
            data = json.loads(msg)
            if data["type"] == "conversation.item.audio_output.delta":
                audio_data.extend(base64.b64decode(data["delta"]))
            elif data["type"] == "conversation.item.audio_output.done":
                break

TTS Models

ModelAPI StringEndpointsPrice
Orpheus 3Bcanopylabs/orpheus-3b-0.1-ftREST, Streaming, WebSocket$15/1M chars
Kokorohexgrad/Kokoro-82MREST, Streaming, WebSocket$4/1M chars
Cartesia Sonic 2cartesia/sonic-2REST$65/1M chars
Cartesia Soniccartesia/sonicREST-
Rime Arcana v3 Turborime-labs/rime-arcana-v3-turboREST, Streaming, WebSocketDE only
MiniMax Speech 2.6minimax/speech-2.6-turboREST, Streaming, WebSocketDE only

TTS Parameters

ParameterTypeDescriptionDefault
modelstringTTS model (required)-
inputstringText to synthesize (required)-
voicestringVoice ID (required)-
response_formatstringmp3, wav (default), raw, mulawwav
streamboolEnable streaming (raw format only)false
response_encodingstringpcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw for raw-
languagestringLanguage of input text: en, de, fr, es, hi, it, ja, ko, nl, pl, pt, ru, sv, tr, zh"en"
sample_rateintAudio sample rate (e.g., 44100)-

List Available Voices

response = client.audio.voices.list()
for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - {voice.name}")

Key voices: Orpheus: tara, leah, leo, dan, mia, zac. Kokoro: af_alloy, af_bella, am_adam, am_echo. See references/tts-models.md for complete voice lists.

STT Quick Start

Transcribe Audio

response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.mp3", "rb"),
)
print(response.text)
import Together from "together-ai";

const together = new Together();

const transcription = await together.audio.transcriptions.create({
  file: "path/to/audio.mp3",
  model: "openai/whisper-large-v3",
  language: "en",
});
console.log(transcription.text);
curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -F model="openai/whisper-large-v3" \
  -F file=@audio.mp3

STT Models

ModelAPI String
Whisper Large v3openai/whisper-large-v3
Voxtral Mini 3Bmistralai/Voxtral-Mini-3B-2507

Delivery Method Guide

  • REST: Batch processing, complete audio files
  • Streaming: Real-time apps where TTFB matters
  • WebSocket: Interactive/conversational apps, lowest latency

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

together-images

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-dedicated-endpoints

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-video

No summary provided by upstream source.

Repository SourceNeeds Review
General

together-evaluations

No summary provided by upstream source.

Repository SourceNeeds Review