video-understand

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "video-understand" with this command: npx skills add heygen-com/skills/heygen-com-skills-video-understand

video-understand

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

Prerequisites

  • ffmpeg
  • ffprobe (required): brew install ffmpeg
  • openai-whisper (optional, for transcription): pip install openai-whisper

Commands

Scene detection + transcribe (default)

python3 skills/video-understand/scripts/understand_video.py video.mp4

Keyframe extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

Regular interval extraction

python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

Limit frames extracted

python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

Use a larger Whisper model

python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

Frames only, skip transcription

python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

Quiet mode (JSON only, no progress)

python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

Output to file

python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

CLI Options

Flag Description

video

Input video file (positional, required)

-m, --mode

Extraction mode: scene (default), keyframe , interval

--max-frames

Maximum frames to keep (default: 20)

--whisper-model

Whisper model size: tiny, base, small, medium, large (default: base)

--no-transcribe

Skip audio transcription, extract frames only

-o, --output

Write result JSON to file instead of stdout

-q, --quiet

Suppress progress messages, output only JSON

Extraction Modes

Mode How it works Best for

scene

Detects scene changes via ffmpeg select='gt(scene,0.3)'

Most videos, varied content

keyframe

Extracts I-frames (codec keyframes) Encoded video with natural keyframe placement

interval

Evenly spaced frames based on duration and max-frames Fixed sampling, predictable output

If scene mode detects no scene changes, it automatically falls back to interval mode.

Output

The script outputs JSON to stdout (or file with -o ). See references/output-format.md for the full schema.

{ "video": "video.mp4", "duration": 18.076, "resolution": {"width": 1224, "height": 1080}, "mode": "scene", "frames": [ {"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"} ], "frame_count": 12, "transcript": [ {"start": 0.0, "end": 2.5, "text": "Hello and welcome..."} ], "text": "Full transcript...", "note": "Use the Read tool to view frame images for visual understanding." }

Use the Read tool on frame image paths to visually inspect extracted frames.

References

  • references/output-format.md -- Full JSON output schema documentation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

heygen

No summary provided by upstream source.

Repository SourceNeeds Review
General

text-to-speech

No summary provided by upstream source.

Repository SourceNeeds Review
General

video-translate

No summary provided by upstream source.

Repository SourceNeeds Review
General

video-edit

No summary provided by upstream source.

Repository SourceNeeds Review