Transcribe Skill

Production-grade speech-to-text transcription with intelligent file handling, multiple output formats, and parallel processing.

When to Use

✅ USE this skill when:

Transcribing audio recordings to text
Creating subtitles for video content
Converting speech to searchable text
Needing word-level timestamps
Processing podcasts or meeting recordings
Transcribing interviews
Converting audio notes to text
Creating transcripts for video editing

❌ DON'T use this skill when:

Transcribing YouTube videos → Use youtube-transcript (faster, no API cost)
Real-time transcription → Use streaming tools
Already have captions → Use youtube-transcript
Need video-specific processing → Use ffmpeg-tools first

Prerequisites

1. Get Groq API key

Visit: https://console.groq.com/

Create an API key

2. Set environment variable

export GROQ_API_KEY="gsk_your_api_key_here"

3. Install FFmpeg (for audio processing)

brew install ffmpeg # macOS sudo apt install ffmpeg # Ubuntu/Debian

4. Verify

node --version # Should show version

Commands

Basic Usage

Basic transcription (outputs plain text)

{baseDir}/transcribe.js audio.m4a

Transcribe with specific output format

{baseDir}/transcribe.js audio.mp3 --format srt --output subtitles.srt {baseDir}/transcribe.js meeting.wav --format json --output result.json

Specify language for better accuracy

{baseDir}/transcribe.js spanish.mp3 --language es --format text {baseDir}/transcribe.js audio.mp3 --language de --format vtt

Output Formats

Plain text (default)

{baseDir}/transcribe.js audio.mp3 --format text Transcriber output follows without timestamps.

JSON with detailed data

{baseDir}/transcribe.js audio.mp3 --format json { "text": "Transcription text...", "duration": 123.45, "language": "en", "words": [{"word": "Transcription", "start": 0.0, "end": 0.5}, ...] }

SRT subtitles

{baseDir}/transcribe.js audio.mp3 --format srt --output subtitles.srt 1 00:00:00,000 --> 00:00:05,500 Transcription of the audio begins here

2 00:00:05,500 --> 00:00:11,200 And continues in the next segment

VTT subtitles

{baseDir}/transcribe.js audio.mp3 --format vtt --output captions.vtt WEBVTT

00:00.000 --> 00:05.500 Transcription of the audio begins here

Word timings TSV

{baseDir}/transcribe.js audio.mp3 --format tsv start\tend\tword 0.000\t0.450\tTranscription 0.450\t0.820\tof 0.820\t1.240\tthe

Word timings CSV

{baseDir}/transcribe.js audio.mp3 --format csv start,end,word 0.000,0.450,"Transcription" 0.450,0.820,"of" 0.820,1.240,"the"

Format Comparison:

Format Use Case Word Timestamps File Size

text

General use ❌ Small

json

API integration ✅ Large

srt

Subtitles ⚠️ Phrases Medium

vtt

Web captions ⚠️ Phrases Medium

tsv

Spreadsheet ✅ Medium

csv

Database import ✅ Medium

word_timings

Analysis ✅ Large

Language Selection

Auto-detect (default)

{baseDir}/transcribe.js audio.mp3

Specify language for better accuracy

{baseDir}/transcribe.js audio.mp3 --language en # English {baseDir}/transcribe.js audio.mp3 --language es # Spanish {baseDir}/transcribe.js audio.mp3 --language fr # French {baseDir}/transcribe.js audio.mp3 --language de # German {baseDir}/transcribe.js audio.mp3 --language ja # Japanese

Supported Languages: All 99 languages supported by Whisper

Large File Processing

Files >25MB are automatically segmented

{baseDir}/transcribe.js long-recording.mp3

Progress shown for segmented files

⏳ Transcribing: Segment 3/12 (25.0%) | Elapsed: 45.2s

Output combined automatically

Cache Control

Use cache (default) - instant for previously transcribed

{baseDir}/transcribe.js audio.mp3

Force fresh transcription

{baseDir}/transcribe.js audio.mp3 --no-cache

API Provider Selection

Use Groq (default) - faster, cheaper

{baseDir}/transcribe.js audio.mp3 --provider groq

Use OpenAI Whisper (requires OPENAI_API_KEY)

{baseDir}/transcribe.js audio.mp3 --provider openai

Supported Audio Formats

Format Extension Notes

MP3 .mp3 Best compatibility

MP4 .mp4, .m4a iOS recordings

WAV .wav Uncompressed, large files

OGG .ogg, .oga, .ogv Open format

FLAC .flac Lossless compression

WebM .webm Web audio/videos

AAC .aac Apple format

WMA .wma Windows format

Audio Preprocessing:

Unsupported formats are auto-converted to MP3
Sample rate normalized to 16kHz (Whisper optimal)
Mono channel for better accuracy
Bitrate: 192kbps MP3

Features

Automatic Segmentation

Large audio files are automatically split for processing:

Audio File >25MB ↓ FFmpeg Convert to MP3 (16kHz, mono) ↓ Split into 10-minute segments ↓ Transcribe segments in parallel ↓ Merge results with adjusted timestamps

Segmentation Benefits:

✓ Handles recordings up to 2 hours
✓ Respects API rate limits
✓ Parallel processing for speed
✓ Seamless results (timestamps adjusted)

Word-Level Timestamps

Each word includes start and end timestamps:

{ "words": [ {"word": "Hello", "start": 0.000, "end": 0.320}, {"word": "and", "start": 0.320, "end": 0.560}, {"word": "welcome", "start": 0.560, "end": 0.980}, {"word": "everyone", "start": 0.980, "end": 1.420} ] }

Uses for Timestamps:

Jump to specific words in audio
Create perfectly synced subtitles
Search within transcripts
Edit audio at transcript points
Analyze speech patterns

Intelligent Caching

Cache Location: /tmp/transcribe-cache/
TTL: 24 hours
Cache Key: File hash + language + model

First time: ~10-60 seconds

{baseDir}/transcribe.js audio.mp3 --format json

Second time: ~1 second (cache hit)

{baseDir}/transcribe.js audio.mp3 --format json

Force fresh: ~10-60 seconds

{baseDir}/transcribe.js audio.mp3 --format json --no-cache

Rate Limiting

Built-in protection against API limits:

Max 60 requests per minute
Automatic delays between requests
Sequential processing for safety

Cost Optimization:

Groq Whisper Turbo: Free tier available
Cached results cost nothing
Segmented files use 1 request per segment

Error Handling

Error Codes

Code Name Description

0 SUCCESS Transcription complete

1 INVALID_INPUT Bad parameters

2 FILE_NOT_FOUND Audio file missing

3 FILE_TOO_LARGE Exceeds 2 hours

4 UNSUPPORTED_FORMAT Can't process format

5 API_KEY_MISSING GROQ_API_KEY not set

6 API_ERROR Request failed

7 RATE_LIMITED API throttling

8 NETWORK_ERROR Connection issue

9 TIMEOUT Request took too long

10 AUDIO_PROCESSING_ERROR FFmpeg failed

11 SEGMENTATION_ERROR Splitting failed

12 INTERRUPTED User cancelled

99 UNKNOWN Unexpected error

Common Errors

"API key not found"

Solution: Set the environment variable

export GROQ_API_KEY="gsk_your_key" echo "export GROQ_API_KEY=gsk_your_key" >> ~/.zshrc # Persist

"File too large"

Video duration exceeds 2 hours

Solution: Split manually first

ffmpeg -i long.mp4 -ss 0 -t 7200 first.mp4 ffmpeg -i long.mp4 -ss 7200 -t 7200 second.mp4

"Rate limited"

Too many requests

Solution: Wait 1 minute, try again

Or add delay between batch operations

Technical Details

Processing Pipeline

Validate Input ├── Check file exists ├── Check format supported ├── Probe audio metadata └── Validate size/duration
Check Cache └── Return cached if available
Preprocess (if needed) ├── Convert to MP3 ├── Set sample rate to 16kHz └── Normalize to mono
Split (if >25MB) └── Create 10-minute segments
Transcribe ├── Rate-limited requests ├── Word-level timestamps └── Progress tracking
Merge (if segmented) └── Adjust timestamps
Format Output └── Apply selected format
Cache Result └── Store for 24 hours

API Configuration

Groq (Default):

Endpoint: api.groq.com/v1/audio/transcriptions
Model: whisper-large-v3-turbo
Max file size: 25MB per request
Word-level timestamps: Yes
Cost: Free tier: $0.0013/minute

OpenAI (Optional):

Endpoint: api.openai.com/v1/audio/transcriptions
Model: whisper-1
Max file size: 25MB per request
Word-level timestamps: Yes
Cost: $0.006/minute

Timestamp Adjustment

For segmented files, timestamps are adjusted:

Segment 1: [0:00 - 10:00] → [0:00 - 10:00] Segment 2: [0:00 - 10:00] → [10:00 - 20:00] Segment 3: [0:00 - 10:00] → [20:00 - 30:00]

Example:

Segment 2 word: "discussion", start: 5:30 Adjusted timestamp: 5:30 + 10:00 = 15:30

Examples

Transcribe Meeting Recording

#!/bin/bash MEETING="meeting-$(date +%Y%m%d).mp3"

echo "Transcribing meeting..." {baseDir}/transcribe.js "$MEETING" --format txt --output "$MEETING.txt" {baseDir}/transcribe.js "$MEETING" --format srt --output "$MEETING.srt" {baseDir}/transcribe.js "$MEETING" --format json --output "$MEETING.json"

echo "Done: $MEETING.{txt,srt,json}"

Batch Transcribe Directory

#!/bin/bash mkdir -p transcripts

for audio in *.mp3 *.m4a *.wav; do [ -f "$audio" ] || continue

echo "Processing: $audio" base="${audio%.*}"

{baseDir}/transcribe.js "$audio" --format srt --output "transcripts/${base}.srt" 2>/dev/null

if [ $? -eq 0 ]; then echo " ✓ Created transcripts/${base}.srt" else echo " ✗ Failed" fi

sleep 1 # Rate limit protection done

Create Searchable Meeting Archive

#!/bin/bash INPUT="meeting.mp3"

Transcribe with word timings

{baseDir}/transcribe.js "$INPUT" --format json --output meeting.json

Extract all utterances with timestamps

jq -r ' .words[] | "(.start | tostring | split(".") | .[0] + "." + .[1][:2])\t(.word)" ' meeting.json > meeting-by-words.txt

Create time-indexed file

echo "Meeting transcript indexed by time" > index.txt while IFS=$'\t' read -r time word; do echo "$time: $word" >> index.txt done < meeting-by-words.txt

echo "Archive created: index.txt"

Subtitle Synchronization

#!/bin/bash VIDEO="video.mp4" AUDIO="video.m4a" # Extracted audio

Get word-level transcription

{baseDir}/transcribe.js "$AUDIO" --format json --output transcription.json

Create SRT with optimized line breaks

"WEBVTT", "", (.words | map(.word) | join(" ") | split("\. ") | .[] | select(length > 0) | { text: ., start: ., end: . }) | "(format_srt_time(.start)) --> (format_srt_time(.end))", "(.text)" ' transcription.json > subtitles.srt

echo "SRT subtitles created: subtitles.srt"

Extract Keywords with Timestamps

#!/bin/bash AUDIO="recording.mp3" KEYWORDS=("budget" "timeline" "decision")

Transcribe

{baseDir}/transcribe.js "$AUDIO" --format json --output data.json

Find keywords with timestamps

echo "Keyword timestamps:" for kw in "${KEYWORDS[@]}"; do jq -r --arg kw "${kw,,}" '.words[] | select(.word | ascii_downcase | contains($kw)) | "(.word) at (.start)s"' data.json done

Performance Tips

Use Cache

First time (slow)

{baseDir}/transcribe.js audio.mp3

Second time (fast)

{baseDir}/transcribe.js audio.mp3

Same file, different format - different cache

{baseDir}/transcribe.js audio.mp3 --format srt # New cache entry

Specify Language

Auto-detect (slower first pass)

{baseDir}/transcribe.js spanish.mp3

Specify language (faster, more accurate)

{baseDir}/transcribe.js spanish.mp3 --language es

Pre-extract Audio

Slower: video with embedded audio

{baseDir}/transcribe.js video.mp4

Faster: pre-extracted audio

ffmpeg -i video.mp4 -vn -c:a libmp3lame -b:a 192k audio.mp3 {baseDir}/transcribe.js audio.mp3

Batch Processing

Process multiple files

for f in *.mp3; do {baseDir}/transcribe.js "$f" & done wait

Parallel Segments

Large files process segments in parallel

30-minute file with 3 segments

Elapsed time: ~60 seconds (3x faster than sequential)

Notes

Maximum file duration: 2 hours
Maximum file size for direct upload: 25MB
Caching includes format in key (different formats = different caches)
API rate limits: 60 requests/minute
Segment size: 10 minutes (configurable in code)
Output format affects cache (srt and json cached separately)
Word timestamps provide ~50ms precision
SRT/VTT formats group words into phrases (~5 words)
TSV/CSV provide per-word timestamps
JSON includes all metadata and word-level data
Audio preprocessing preserves quality while optimizing for Whisper
FFmpeg required for format conversion and segmentation
Network errors retry up to 3 times with exponential backoff

transcribe

Safety Notice

Copy this and send it to your AI assistant to learn

1. Get Groq API key

Visit: https://console.groq.com/

Create an API key

2. Set environment variable

3. Install FFmpeg (for audio processing)

4. Verify

Basic transcription (outputs plain text)

Transcribe with specific output format

Specify language for better accuracy

Plain text (default)

JSON with detailed data

SRT subtitles

VTT subtitles

Word timings TSV

Word timings CSV

Auto-detect (default)

Specify language for better accuracy

Files >25MB are automatically segmented

Progress shown for segmented files

Output combined automatically

Use cache (default) - instant for previously transcribed

Force fresh transcription

Use Groq (default) - faster, cheaper

Use OpenAI Whisper (requires OPENAI_API_KEY)

First time: ~10-60 seconds

Second time: ~1 second (cache hit)

Force fresh: ~10-60 seconds

Solution: Set the environment variable

Video duration exceeds 2 hours

Solution: Split manually first

Too many requests

Solution: Wait 1 minute, try again

Or add delay between batch operations

Transcribe with word timings

Extract all utterances with timestamps

Create time-indexed file

Get word-level transcription

Create SRT with optimized line breaks

Transcribe

Find keywords with timestamps

First time (slow)

Second time (fast)

Same file, different format - different cache

Auto-detect (slower first pass)

Specify language (faster, more accurate)

Slower: video with embedded audio

Faster: pre-extracted audio

Process multiple files

Large files process segments in parallel

30-minute file with 3 segments

Elapsed time: ~60 seconds (3x faster than sequential)

Source Transparency

Related Skills

vector-memory

model-router

rss-reader

video-frames