AssemblyAI Streaming & Live Transcription Skill
Overview
Use this skill to build and maintain code that talks to AssemblyAI’s:
-
Streaming Speech-to-Text (STT) via WebSockets (wss://streaming.assemblyai.com/v3/ws )
-
Async / pre-recorded STT via REST (https://api.assemblyai.com/v2/transcript )
-
LLM Gateway for applying Claude/GPT/Gemini-style models to transcripts (https://llm-gateway.assemblyai.com )
The emphasis is on streaming/live transcription, meeting notetakers, and voice agents, while still covering async workflows and post-processing.
This skill assumes a Claude Code environment with access to Python (preferred) and Bash.
When to Use
Use this skill when:
-
Implementing real-time transcription from a microphone, telephony stream, or audio file.
-
Building a live meeting notetaker (Zoom/Teams/Meet), especially with summaries, action items, and highlights.
-
Implementing a voice agent where latency and natural turn-taking matter.
-
Migrating from other STT providers (OpenAI/Deepgram/Google/AWS/etc.) to AssemblyAI.
-
Applying LLMs to audio via LLM Gateway for summaries, Q&A, topic tagging, or custom prompts.
Do not use this skill when:
-
The task is generic HTTP client usage with no AssemblyAI-specific logic.
-
The request clearly targets a different STT vendor.
-
The environment cannot safely store or use an API key.
AssemblyAI Mental Model
- Products to care about
Pre-recorded Speech-to-Text (Async)
-
REST API: POST /v2/transcript → GET /v2/transcript/{id}
-
Designed for files from URLs, uploads, S3, etc.
-
Supports extra models: summarization, topic detection, sentiment, PII redaction, chapters, etc.
Streaming Speech-to-Text
-
WebSocket: wss://streaming.assemblyai.com/v3/ws
-
Low-latency, immutable transcripts (~300ms).
-
Turn detection built in; fits voice agents and live captioning.
LLM Gateway
-
REST API: POST /v1/chat/completions at https://llm-gateway.assemblyai.com
-
Unified access to multiple LLMs (Claude, GPT, Gemini, etc.).
-
Designed for “LLM over transcripts” workflows.
- Key model knobs (Async)
-
speech_models : ["slam-1", "universal"] etc.
-
Slam-1: best English accuracy + keyterms_prompt, good for medical/technical conversations.
-
Universal: multilingual coverage; good default if language is unknown.
-
language_code vs language_detection :
-
Use language_code when the language is known.
-
Use language_detection: true when unknown; optionally set language_confidence_threshold .
-
keyterms_prompt :
-
Domain words/phrases to boost (med terms, product names, etc.).
-
Extra intelligence: summarization , iab_categories , content_safety , entity_detection , auto_chapters , sentiment_analysis , speaker_labels , auto_highlights , redact_pii , etc.
- Key model knobs (Streaming)
Connection URL:
-
US: wss://streaming.assemblyai.com/v3/ws
-
EU: wss://streaming.eu.assemblyai.com/v3/ws
Important query parameters:
-
sample_rate (required): e.g. 16000
-
format_turns (bool): return formatted final transcripts; avoid for low-latency voice agents.
-
speech_model : universal-streaming-english (default) or universal-streaming-multi .
-
`keyterms_p
rompt: JSON-encoded list of terms, e.g. ["AssemblyAI", "Slam-1", "Keanu Reeves"]`.
-
Turn detection:
-
end_of_turn_confidence_threshold (0.0–1.0, default ~0.4)
-
min_end_of_turn_silence_when_confident (ms, default ~400)
-
max_turn_silence (ms, default ~1280)
Headers:
- Use either Authorization: <API_KEY> or a short-lived token query parameter issued by your backend.
Messages:
-
Client sends:
-
Binary audio chunks (50–1000ms each).
-
Optional JSON messages: {"type": "UpdateConfig", ...} , {"type": "Terminate"} , {"type": "ForceEndpoint"} .
-
Server sends:
-
Begin event with id , expires_at .
-
Turn events with:
-
transcript (immutable partials/finals),
-
utterance (complete semantic chunk),
-
end_of_turn (bool),
-
turn_is_formatted (bool),
-
words array with timestamps/confidences.
-
Termination event with summary stats.
- Regions and data residency
-
Async:
-
Streaming:
-
US: wss://streaming.assemblyai.com/v3/ws
-
EU: wss://streaming.eu.assemblyai.com/v3/ws
Always keep base URLs consistent per project; don’t mix US/EU endpoints for the same data.
Security & API Keys
-
Always require an AssemblyAI API key and keep it out of source in Claude Code output:
-
Use environment variables: ASSEMBLYAI_API_KEY .
-
Or placeholders ("<YOUR_API_KEY>" ) in snippets.
-
For browser/client code:
-
Do not embed the API key.
-
Instruct the user to generate temporary streaming tokens on their backend and pass only the token into the WebSocket connection.
-
Never print real keys in logs or comments.
High-Level Workflow Patterns
Decision tree
Is the audio live?
-
Yes → Use Streaming STT.
-
No → Use Async STT.
Is latency critical (<1s) for responses?
-
Yes → Streaming with format_turns=false and careful turn detection.
-
No → Async, then Summarization/Chapters/etc.
Do transcripts leave the backend?
-
Yes → Consider redact_pii (and optionally redact_pii_audio ) before sharing.
-
No → Use raw transcripts as needed.
Need LLM-based processing (Q&A, structured summaries)?
- Yes → Pipe transcripts into LLM Gateway via chat/completions .
How Claude Should Work with This Skill
General principles
-
Prefer official AssemblyAI SDKs (Python/JS) when available; fall back to requests /websocket-client only if SDK cannot be installed.
-
Always:
-
Validate HTTP responses and WebSocket status.
-
Surface useful error messages (status , error fields in transcript JSON).
-
Respect documented min/max chunk sizes (50–1000ms of audio per binary message).
-
For voice-agent code, optimize for:
-
Immutable partials (transcript ) and utterance field.
-
Minimal latency, avoid extra formatting passes.
Recipe 1 – Minimal Streaming from Microphone (Python SDK)
Goal: Stream mic audio to AssemblyAI and print transcripts in real time.
Use this when the environment has Python and assemblyai
- pyaudio installed, and the user wants a quick streaming demo.
import assemblyai as aai from assemblyai.streaming import v3 as aai_stream import pyaudio
API_KEY = "<YOUR_API_KEY>"
aai.settings.api_key = API_KEY
SAMPLE_RATE = 16000 CHUNK_MS = 50 FRAMES_PER_BUFFER = int(SAMPLE_RATE * (CHUNK_MS / 1000.0))
def main(): client = aai_stream.StreamingClient( aai_stream.StreamingClientOptions( api_key=API_KEY, api_host="streaming.assemblyai.com", # or "streaming.eu.assemblyai.com" ) )
def on_begin(_client, event: aai_stream.BeginEvent):
print(f"Session started: {event.id}, expires at {event.expires_at}")
def on_turn(_client, event: aai_stream.TurnEvent):
# Use immutable transcript text
text = (event.transcript or "").strip()
if not text:
return
# Use formatted finals only for display; keep unformatted for LLMs
if event.turn_is_formatted:
print(f"[FINAL] {text}")
else:
print(f"[PARTIAL] {text}", end="\r")
def on_terminated(_client, event: aai_stream.TerminationEvent):
print(f"\nTerminated. Audio duration={event.audio_duration_seconds}s")
def on_error(_client, error: aai_stream.StreamingError):
print(f"\nStreaming error: {error}")
client.on(aai_stream.StreamingEvents.Begin, on_begin)
client.on(aai_stream.StreamingEvents.Turn, on_turn)
client.on(aai_stream.StreamingEvents.Termination, on_terminated)
client.on(aai_stream.StreamingEvents.Error, on_error)
client.connect(
aai_stream.StreamingParameters(
sample_rate=SAMPLE_RATE,
format_turns=False, # better latency for voice agents
)
)
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=1,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=FRAMES_PER_BUFFER,
)
try:
print("Speak into your microphone (Ctrl+C to stop)...")
def audio_gen():
while True:
yield stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
client.stream(audio_gen())
except KeyboardInterrupt:
pass
finally:
client.disconnect(terminate=True)
stream.stop_stream()
stream.close()
pa.terminate()
if name == "main": main()