audio-transcribe

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "audio-transcribe" with this command: npx skills add agntswrm/agent-media/agntswrm-agent-media-audio-transcribe

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

npx agent-media@latest audio transcribe --in <path> [options]

Inputs

OptionRequiredDescription
--inYesInput audio file path or URL (supports mp3, wav, m4a, ogg)
--diarizeNoEnable speaker identification
--languageNoLanguage code (auto-detected if not provided)
--speakersNoNumber of speakers hint for diarization
--outNoOutput path, filename or directory (default: ./)
--providerNoProvider to use (local, fal, replicate, runpod)

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

npx agent-media@latest audio transcribe --in interview.mp3

Transcription with speaker identification:

npx agent-media@latest audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

npx agent-media@latest audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

npx agent-media@latest audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
npx agent-media@latest audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
npx agent-media@latest audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

  • Uses Moonshine model (5x faster than Whisper)
  • Models downloaded on first use (~100MB)
  • Does NOT support diarization — use fal or replicate for speaker identification
  • You may see a mutex lock failed error — ignore it, the output is correct if "ok": true
npx agent-media@latest audio transcribe --in audio.mp3 --provider local

fal

  • Requires FAL_API_KEY
  • Uses wizper model for fast transcription (2x faster) when diarization is disabled
  • Uses whisper model when diarization is enabled (native support)

replicate

  • Requires REPLICATE_API_TOKEN
  • Uses whisper-diarization model with Whisper Large V3 Turbo
  • Native diarization support with word-level timestamps

runpod

  • Requires RUNPOD_API_KEY
  • Uses pruna/whisper-v3-large model (Whisper Large V3)
  • Does NOT support diarization (speaker identification) - use fal or replicate for diarization
npx agent-media@latest audio transcribe --in audio.mp3 --provider runpod

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

image-remove-background

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

video-generate

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

image-crop

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

image-edit

No summary provided by upstream source.

Repository SourceNeeds Review