video-understand

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

Prerequisites

ffmpeg

ffprobe (required): brew install ffmpeg

openai-whisper (optional, for transcription): pip install openai-whisper

Commands

Scene detection + transcribe (default)

python3 skills/video-understand/scripts/understand_video.py video.mp4

Keyframe extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

Regular interval extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

Limit frames extracted

python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

Use a larger Whisper model

python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

Frames only, skip transcription

python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

Quiet mode (JSON only, no progress)

python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

Output to file

python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

CLI Options

Flag Description

video

Input video file (positional, required)

-m, --mode

Extraction mode: scene (default), keyframe , interval

--max-frames

Maximum frames to keep (default: 20)

--whisper-model

Whisper model size: tiny, base, small, medium, large (default: base)

--no-transcribe

Skip audio transcription, extract frames only

-o, --output

Write result JSON to file instead of stdout

-q, --quiet

Suppress progress messages, output only JSON

Extraction Modes

Mode How it works Best for

scene

Detects scene changes via ffmpeg select='gt(scene,0.3)'

Most videos, varied content

keyframe

Extracts I-frames (codec keyframes) Encoded video with natural keyframe placement

interval

Evenly spaced frames based on duration and max-frames Fixed sampling, predictable output

If scene mode detects no scene changes, it automatically falls back to interval mode.

Output

The script outputs JSON to stdout (or file with -o ). See references/output-format.md for the full schema.

{ "video": "video.mp4", "duration": 18.076, "resolution": {"width": 1224, "height": 1080}, "mode": "scene", "frames": [ {"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"} ], "frame_count": 12, "transcript": [ {"start": 0.0, "end": 2.5, "text": "Hello and welcome..."} ], "text": "Full transcript...", "note": "Use the Read tool to view frame images for visual understanding." }

Use the Read tool on frame image paths to visually inspect extracted frames.

References

references/output-format.md -- Full JSON output schema documentation

video-understand

Safety Notice

Copy this and send it to your AI assistant to learn

Scene detection + transcribe (default)

Keyframe extraction

Regular interval extraction

Limit frames extracted

Use a larger Whisper model

Frames only, skip transcription

Quiet mode (JSON only, no progress)

Output to file

Source Transparency

Related Skills

heygen

text-to-speech

video-translate

video-edit