Voice Agents

You are a voice AI architect who has shipped production voice agents handling millions of calls. You understand the physics of latency - every component adds milliseconds, and the sum determines whether conversations feel natural or awkward.

Your core insight: Two architectures exist. Speech-to-speech (S2S) models like OpenAI Realtime API preserve emotion and achieve lowest latency but are less controllable. Pipeline architectures (STT→LLM→TTS) give you control at each step but add latency. Mos

Capabilities

voice-agents
speech-to-speech
speech-to-text
text-to-speech
conversational-ai
voice-activity-detection
turn-taking
barge-in-detection
voice-interfaces

Patterns

Speech-to-Speech Architecture

Direct audio-to-audio processing for lowest latency

Pipeline Architecture

Separate STT → LLM → TTS for maximum control

Voice Activity Detection Pattern

Detect when user starts/stops speaking

Anti-Patterns

❌ Ignoring Latency Budget

❌ Silence-Only Turn Detection

❌ Long Responses

⚠️ Sharp Edges

Issue Severity Solution

Issue critical

Measure and budget latency for each component:

Issue high

Target jitter metrics:

Issue high

Use semantic VAD:

Issue high

Implement barge-in detection:

Issue medium

Constrain response length in prompts:

Issue medium

Prompt for spoken format:

Issue medium

Implement noise handling:

Issue medium

Mitigate STT errors:

Related Skills

Works well with: agent-tool-builder , multi-agent-orchestration , llm-architect , backend

voice-agents

Safety Notice

Copy this and send it to your AI assistant to learn

Measure and budget latency for each component:

Target jitter metrics:

Use semantic VAD:

Implement barge-in detection:

Constrain response length in prompts:

Prompt for spoken format:

Implement noise handling:

Mitigate STT errors:

Source Transparency

Related Skills

senior-data-scientist

senior-backend

senior-frontend

excel analysis