video-understand

Understand video content locally using ffmpeg frame extraction and Whisper transcription. No API keys needed. Use when: (1) Understanding what a video contains, (2) Transcribing video audio locally, (3) Extracting key frames for visual analysis, (4) Getting video content without API keys.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "video-understand" with this command: npx skills add heygen-com/skills/heygen-com-skills-video-understand

video-understand

Understand video content locally using ffmpeg for frame extraction and Whisper for transcription. Fully offline, no API keys required.

Prerequisites

  • ffmpeg + ffprobe (required): brew install ffmpeg
  • openai-whisper (optional, for transcription): pip install openai-whisper

Commands

# Scene detection + transcribe (default)
python3 skills/video-understand/scripts/understand_video.py video.mp4

# Keyframe extraction
python3 skills/video-understand/scripts/understand_video.py video.mp4 -m keyframe

# Regular interval extraction
python3 skills/video-understand/scripts/understand_video.py video.mp4 -m interval

# Limit frames extracted
python3 skills/video-understand/scripts/understand_video.py video.mp4 --max-frames 10

# Use a larger Whisper model
python3 skills/video-understand/scripts/understand_video.py video.mp4 --whisper-model small

# Frames only, skip transcription
python3 skills/video-understand/scripts/understand_video.py video.mp4 --no-transcribe

# Quiet mode (JSON only, no progress)
python3 skills/video-understand/scripts/understand_video.py video.mp4 -q

# Output to file
python3 skills/video-understand/scripts/understand_video.py video.mp4 -o result.json

CLI Options

FlagDescription
videoInput video file (positional, required)
-m, --modeExtraction mode: scene (default), keyframe, interval
--max-framesMaximum frames to keep (default: 20)
--whisper-modelWhisper model size: tiny, base, small, medium, large (default: base)
--no-transcribeSkip audio transcription, extract frames only
-o, --outputWrite result JSON to file instead of stdout
-q, --quietSuppress progress messages, output only JSON

Extraction Modes

ModeHow it worksBest for
sceneDetects scene changes via ffmpeg select='gt(scene,0.3)'Most videos, varied content
keyframeExtracts I-frames (codec keyframes)Encoded video with natural keyframe placement
intervalEvenly spaced frames based on duration and max-framesFixed sampling, predictable output

If scene mode detects no scene changes, it automatically falls back to interval mode.

Output

The script outputs JSON to stdout (or file with -o). See references/output-format.md for the full schema.

{
  "video": "video.mp4",
  "duration": 18.076,
  "resolution": {"width": 1224, "height": 1080},
  "mode": "scene",
  "frames": [
    {"path": "/abs/path/frame_0001.jpg", "timestamp": 0.0, "timestamp_formatted": "00:00"}
  ],
  "frame_count": 12,
  "transcript": [
    {"start": 0.0, "end": 2.5, "text": "Hello and welcome..."}
  ],
  "text": "Full transcript...",
  "note": "Use the Read tool to view frame images for visual understanding."
}

Use the Read tool on frame image paths to visually inspect extracted frames.

References

  • references/output-format.md -- Full JSON output schema documentation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

heygen

No summary provided by upstream source.

Repository SourceNeeds Review
General

text-to-speech

No summary provided by upstream source.

Repository SourceNeeds Review
General

video-translate

No summary provided by upstream source.

Repository SourceNeeds Review
General

avatar-video

No summary provided by upstream source.

Repository SourceNeeds Review