Video Clipping Expert Knowledge
Cross-Platform Notes
All tools (ffmpeg, ffprobe, yt-dlp, whisper) use identical CLI flags on Windows, macOS, and Linux. The differences are only in shell syntax:
Feature macOS / Linux Windows (cmd.exe)
Suppress stderr 2>/dev/null
2>NUL
Filter output | grep pattern
| findstr pattern
Delete files rm file1 file2
del file1 file2
Null output device -f null -
-f null - (same)
ffmpeg subtitle paths subtitles=clip.srt
subtitles=clip.srt (relative OK, absolute needs C\:/path )
IMPORTANT: ffmpeg filter paths (-vf "subtitles=..." ) always need forward slashes. On Windows with absolute paths, escape the colon: subtitles=C\:/Users/me/clip.srt
Prefer using file_write tool for creating SRT/text files instead of shell echo/heredoc.
yt-dlp Reference
Download with Format Selection
Best video up to 1080p + best audio, merged
yt-dlp -f "bv[height<=1080]+ba/b[height<=1080]" --restrict-filenames -o "source.%(ext)s" "URL"
720p max (smaller, faster)
yt-dlp -f "bv[height<=720]+ba/b[height<=720]" --restrict-filenames -o "source.%(ext)s" "URL"
Audio only (for transcription-only workflows)
yt-dlp -x --audio-format wav --restrict-filenames -o "audio.%(ext)s" "URL"
Metadata Inspection
Get full metadata as JSON (duration, title, chapters, available subs)
yt-dlp --dump-json "URL"
Key fields: duration, title, description, chapters, subtitles, automatic_captions
YouTube Auto-Subtitles
Download auto-generated subtitles in json3 format (word-level timing)
yt-dlp --write-auto-subs --sub-lang en --sub-format json3 --skip-download --restrict-filenames -o "source" "URL"
Download manual subtitles if available
yt-dlp --write-subs --sub-lang en --sub-format srt --skip-download --restrict-filenames -o "source" "URL"
List available subtitle languages
yt-dlp --list-subs "URL"
Useful Flags
-
--restrict-filenames — safe ASCII filenames (no spaces/special chars) — important on all platforms
-
--no-playlist — download single video even if URL is in a playlist
-
-o "template.%(ext)s" — output template (%(ext)s auto-detects format)
-
--cookies-from-browser chrome — use browser cookies for age-restricted content
-
--extract-audio / -x — extract audio only
-
--audio-format wav — convert audio to wav (for whisper)
Whisper Transcription Reference
Audio Extraction for Whisper
Extract mono 16kHz WAV (whisper's preferred input format)
ffmpeg -i source.mp4 -vn -ar 16000 -ac 1 -y audio.wav
Basic Transcription
Standard transcription with word-level timestamps
whisper audio.wav --model small --output_format json --word_timestamps true --language en
Faster alternative (same flags, 4x speed)
whisper-ctranslate2 audio.wav --model small --output_format json --word_timestamps true --language en
Model Sizes
Model VRAM Speed Quality Use When
tiny ~1GB Fastest Rough Quick previews, testing pipeline
base ~1GB Fast OK Short clips, clear speech
small ~2GB Good Good Default — best balance
medium ~5GB Slow Better Important content, accented speech
large-v3 ~10GB Slowest Best Final production, multiple languages
Note: On macOS Apple Silicon, consider mlx-whisper as a faster native alternative.
JSON Output Structure
{ "text": "full transcript text...", "segments": [ { "id": 0, "start": 0.0, "end": 4.52, "text": " Hello everyone, welcome back.", "words": [ {"word": " Hello", "start": 0.0, "end": 0.32, "probability": 0.95}, {"word": " everyone,", "start": 0.32, "end": 0.78, "probability": 0.91}, {"word": " welcome", "start": 0.78, "end": 1.14, "probability": 0.98}, {"word": " back.", "start": 1.14, "end": 1.52, "probability": 0.97} ] } ] }
-
segments[].words[] gives word-level timing when --word_timestamps true
-
probability indicates confidence (< 0.5 = likely wrong)
YouTube json3 Subtitle Parsing
Format Structure
{ "events": [ { "tStartMs": 1230, "dDurationMs": 5000, "segs": [ {"utf8": "hello ", "tOffsetMs": 0}, {"utf8": "world ", "tOffsetMs": 200}, {"utf8": "how ", "tOffsetMs": 450}, {"utf8": "are you", "tOffsetMs": 700} ] } ] }
Extracting Word Timing
For each event and each segment within it:
-
word_start_ms = event.tStartMs + seg.tOffsetMs
-
word_start_secs = word_start_ms / 1000.0
-
word_text = seg.utf8.trim()
Events without segs are line breaks or formatting — skip them. Events with segs containing only "\n" are newlines — skip them.
SRT Generation from Transcript
SRT Format
1 00:00:00,000 --> 00:00:02,500 First line of caption text
2 00:00:02,500 --> 00:00:05,100 Second line of caption text
Rules for Building Good SRT
-
Group words into subtitle lines of ~8-12 words (2-3 seconds per line)
-
Break at natural pause points (periods, commas, clause boundaries)
-
Keep lines under 42 characters for readability on mobile
-
Adjust timestamps relative to clip start (subtract clip start time from all timestamps)
-
Timestamp format: HH:MM:SS,mmm (comma separator, not dot)
-
Each entry: index line, timestamp line, text line(s), blank line
-
Use file_write tool to create the SRT file — works identically on all platforms
Styled Captions with ASS Format
For animated/styled captions, use ASS subtitle format instead of SRT:
ffmpeg -i clip.mp4 -vf "subtitles=clip.ass:force_style='FontSize=22,FontName=Arial,Bold=1,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Shadow=1,Alignment=2,MarginV=40'" -c:a copy output.mp4
Key ASS style properties:
-
PrimaryColour=&H00FFFFFF — white text (AABBGGRR format)
-
OutlineColour=&H00000000 — black outline
-
Outline=2 — outline thickness
-
Alignment=2 — bottom center
-
MarginV=40 — margin from bottom edge
-
FontSize=22 — good size for 1080x1920 vertical
FFmpeg Video Processing
Scene Detection
ffmpeg -i input.mp4 -filter:v "select='gt(scene,0.3)',showinfo" -f null - 2>&1
-
Threshold 0.1 = very sensitive, 0.5 = only major cuts
-
Parse pts_time: from showinfo output for timestamps
-
On macOS/Linux pipe through grep showinfo , on Windows pipe through findstr showinfo
Silence Detection
ffmpeg -i input.mp4 -af "silencedetect=noise=-30dB:d=1.5" -f null - 2>&1
-
d=1.5 = minimum 1.5 seconds of silence
-
Look for silence_start and silence_end in output
Clip Extraction
Re-encoded (accurate cuts)
ffmpeg -ss 00:01:30 -to 00:02:15 -i input.mp4 -c:v libx264 -c:a aac -preset fast -crf 23 -movflags +faststart -y clip.mp4
Lossless copy (fast but may have keyframe alignment issues)
ffmpeg -ss 00:01:30 -to 00:02:15 -i input.mp4 -c copy -y clip.mp4
-
-ss before -i = fast seek (recommended for extraction)
-
-to = end timestamp, -t = duration
Vertical Video (9:16 for Shorts/Reels/TikTok)
Center crop (when source is 16:9)
ffmpeg -i input.mp4 -vf "crop=ih9/16:ih:(iw-ih9/16)/2:0,scale=1080:1920" -c:a copy output.mp4
Scale with letterbox padding (preserves full frame)
ffmpeg -i input.mp4 -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" -c:a copy output.mp4
Caption Burn-in
SRT subtitles with styling (use relative path or forward-slash absolute path)
ffmpeg -i input.mp4 -vf "subtitles=subs.srt:force_style='FontSize=22,FontName=Arial,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2,Alignment=2,MarginV=40'" -c:a copy output.mp4
Simple text overlay
ffmpeg -i input.mp4 -vf "drawtext=text='Caption':fontsize=48:fontcolor=white:borderw=3:bordercolor=black:x=(w-text_w)/2:y=h-th-40" output.mp4
Windows path escaping: subtitles=C\:/Users/me/subs.srt (double-backslash before colon)
Thumbnail Generation
At specific time (2 seconds in)
ffmpeg -i input.mp4 -ss 2 -frames:v 1 -q:v 2 -y thumb.jpg
Best keyframe
ffmpeg -i input.mp4 -vf "select='eq(pict_type,I)',scale=1280:720" -frames:v 1 thumb.jpg
Contact sheet
ffmpeg -i input.mp4 -vf "fps=1/10,scale=320:-1,tile=4x4" contact.jpg
Video Analysis
Full metadata (JSON)
ffprobe -v quiet -print_format json -show_format -show_streams input.mp4
Duration only
ffprobe -v error -show_entries format=duration -of csv=p=0 input.mp4
Resolution
ffprobe -v error -select_streams v:0 -show_entries stream=width,height -of csv=p=0 input.mp4
API-Based STT Reference
Groq Whisper API
Fastest cloud STT — uses whisper-large-v3 on Groq hardware. Free tier available.
curl -s -X POST "https://api.groq.com/openai/v1/audio/transcriptions"
-H "Authorization: Bearer $GROQ_API_KEY"
-H "Content-Type: multipart/form-data"
-F "file=@audio.wav"
-F "model=whisper-large-v3"
-F "response_format=verbose_json"
-F "timestamp_granularities[]=word"
-o transcript_raw.json
Response: {"text": "...", "words": [{"word": "hello", "start": 0.0, "end": 0.32}]}
-
Max file size: 25MB. For longer audio, split with ffmpeg first.
-
timestamp_granularities[]=word is required for word-level timing.
OpenAI Whisper API
curl -s -X POST "https://api.openai.com/v1/audio/transcriptions"
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Content-Type: multipart/form-data"
-F "file=@audio.wav"
-F "model=whisper-1"
-F "response_format=verbose_json"
-F "timestamp_granularities[]=word"
-o transcript_raw.json
Response format same as Groq. Max 25MB.
Deepgram Nova-2
curl -s -X POST "https://api.deepgram.com/v1/listen?model=nova-2&smart_format=true&utterances=true&punctuate=true"
-H "Authorization: Token $DEEPGRAM_API_KEY"
-H "Content-Type: audio/wav"
--data-binary @audio.wav
-o transcript_raw.json
Response: {"results": {"channels": [{"alternatives": [{"words": [{"word": "hello", "start": 0.0, "end": 0.32, "confidence": 0.99}]}]}]}}
-
Supports streaming, but for clips use batch mode.
-
smart_format=true adds punctuation and casing.
TTS Reference
Edge TTS (free, no API key needed)
List available voices
edge-tts --list-voices
Generate speech
edge-tts --text "Your caption text here" --voice en-US-AriaNeural --write-media tts_output.mp3
Other good voices: en-US-GuyNeural, en-GB-SoniaNeural, en-AU-NatashaNeural
Install: pip install edge-tts
OpenAI TTS
curl -s -X POST "https://api.openai.com/v1/audio/speech"
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Content-Type: application/json"
-d '{"model":"tts-1","input":"Your text here","voice":"alloy"}'
--output tts_output.mp3
Voices: alloy , echo , fable , onyx , nova , shimmer
Models: tts-1 (fast), tts-1-hd (quality)
ElevenLabs
curl -s -X POST "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM"
-H "xi-api-key: $ELEVENLABS_API_KEY"
-H "Content-Type: application/json"
-d '{"text":"Your text here","model_id":"eleven_monolingual_v1"}'
--output tts_output.mp3
Voice ID 21m00Tcm4TlvDq8ikWAM = Rachel (default). List voices: GET /v1/voices
Audio Merging (TTS + Original)
Mix TTS over original audio (original at 30% volume, TTS at 100%)
ffmpeg -i clip.mp4 -i tts.mp3
-filter_complex "[0:a]volume=0.3[orig];[1:a]volume=1.0[tts];[orig][tts]amix=inputs=2:duration=first[out]"
-map 0:v -map "[out]" -c:v copy -c:a aac -y clip_voiced.mp4
Replace audio entirely (no original audio)
ffmpeg -i clip.mp4 -i tts.mp3 -map 0:v -map 1:a -c:v copy -c:a aac -shortest -y clip_voiced.mp4
Quality & Performance Tips
-
Use -preset ultrafast for quick previews, -preset slow for final output
-
Use -crf 23 for good quality (18=high, 28=low, lower=bigger files)
-
Add -movflags +faststart for web-friendly MP4
-
Use -threads 0 to auto-detect CPU cores
-
Always use -y to overwrite without asking
Telegram Bot API Reference
sendVideo — Upload and send a video to a chat/channel
curl -s -X POST "https://api.telegram.org/bot<BOT_TOKEN>/sendVideo"
-F "chat_id=<CHAT_ID>"
-F "video=@clip_N_final.mp4"
-F "caption=Clip title here"
-F "parse_mode=HTML"
-F "supports_streaming=true"
Parameters
Parameter Required Description
chat_id
Yes Channel (-100XXXXXXXXXX or @channelname ), group, or user numeric ID
video
Yes @filepath for upload (max 50MB) or a Telegram file_id for re-send
caption
No Text caption, up to 1024 characters
parse_mode
No HTML or MarkdownV2 for styled captions
supports_streaming
No true enables progressive playback
Success Response
{"ok": true, "result": {"message_id": 1234, "video": {"file_id": "BAACAgI...", "file_size": 5242880}}}
Error Response
{"ok": false, "error_code": 400, "description": "Bad Request: chat not found"}
Common Errors
Error Code Description Fix
400 Chat not found Verify chat_id; bot must be added to the channel/group
401 Unauthorized Bot token is invalid or revoked — regenerate via @BotFather
413 Request entity too large File exceeds 50MB — re-encode: ffmpeg -i input.mp4 -fs 49M -c:v libx264 -crf 28 -preset fast -c:a aac -y output.mp4
429 Too many requests Rate limited — wait the retry_after seconds from the response
File Size Limit
Telegram allows up to 50MB for video uploads via Bot API. If a clip exceeds this:
ffmpeg -i clip_N_final.mp4 -fs 49M -c:v libx264 -crf 28 -preset fast -c:a aac -movflags +faststart -y clip_N_tg.mp4
WhatsApp Business Cloud API Reference
Two-Step Flow: Upload Media → Send Message
WhatsApp Cloud API requires uploading the video first to get a media_id , then sending a message referencing that ID.
Step 1 — Upload Media
curl -s -X POST "https://graph.facebook.com/v21.0/<PHONE_NUMBER_ID>/media"
-H "Authorization: Bearer <ACCESS_TOKEN>"
-F "file=@clip_N_final.mp4"
-F "type=video/mp4"
-F "messaging_product=whatsapp"
Success response:
{"id": "1234567890"}
Step 2 — Send Video Message
curl -s -X POST "https://graph.facebook.com/v21.0/<PHONE_NUMBER_ID>/messages"
-H "Authorization: Bearer <ACCESS_TOKEN>"
-H "Content-Type: application/json"
-d '{
"messaging_product": "whatsapp",
"to": "<RECIPIENT_PHONE>",
"type": "video",
"video": {
"id": "<MEDIA_ID>",
"caption": "Clip title here"
}
}'
Success response:
{"messaging_product": "whatsapp", "contacts": [{"wa_id": "14155551234"}], "messages": [{"id": "wamid.HBgL..."}]}
File Size Limit
WhatsApp allows up to 16MB for video uploads. If a clip exceeds this:
ffmpeg -i clip_N_final.mp4 -fs 15M -c:v libx264 -crf 30 -preset fast -c:a aac -movflags +faststart -y clip_N_wa.mp4
24-Hour Messaging Window
WhatsApp requires the recipient to have messaged you within the last 24 hours (for non-template messages). If you get a "template required" error, either:
-
Ask the recipient to send any message to the business number first
-
Use a pre-approved message template instead of a free-form video message
Common Errors
Error Code Description Fix
100 Invalid parameter Check phone_number_id and recipient format (no + prefix, no spaces)
190 Invalid/expired access token Regenerate token in Meta Business Settings; temporary tokens expire in 24h
131030 Recipient not in allowed list In test mode, add recipient to allowed numbers in Meta Developer Portal
131047 Re-engagement message / template required Recipient hasn't messaged within 24h — use a template or ask them to message first
131053 Media upload failed File too large or unsupported format — re-encode as MP4 under 16MB