remotion-word-highlight-subtitles

Add word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "remotion-word-highlight-subtitles" with this command: npx skills add 0x00000003/remotion-word-highlight-subtitles

Remotion Word Highlight Subtitles

Overview

This skill turns a local video into a subtitled video using the reusable "fine version": Whisper word timestamps plus Remotion-rendered current-word highlighting. Use this instead of plain SRT burn-in unless the user explicitly asks for simple static subtitles.

Detailed usage guide, README, and effect preview: https://github.com/0x00000003/remotion-word-highlight-subtitles

Trigger Phrases

Use this skill for requests like:

  • "给这个视频添加字幕:/path/to/video.mp4"
  • "给这个视频加逐词高亮字幕"
  • "按之前 Remotion 那个字幕方案处理这个视频"
  • "重新转写 word timestamps,然后加当前词高亮字幕"

If the user only gives one video path, write the output next to the source video.

Defaults

  • Source: local .mp4, .mov, .m4v, or audio/video file with usable audio.
  • Output: <source-stem>_remotion逐词高亮字幕.mp4 in the same directory unless the user names another target.
  • Transcription: Whisper with word timestamps, Chinese by default: --language zh --word_timestamps True --output_format json.
  • Transcript QA: mandatory before rendering. Build a tiny glossary from the user prompt, folder/file names, visible context, and topic; correct obvious ASR mistakes in product/model/person/place names, English terms, numbers, and homophones.
  • Caption data: Remotion caption chunks built from Whisper words, with token-level startMs and endMs. Long Whisper segments must be split by punctuation, timing gaps, duration, or length before rendering.
  • Caption position: keep all captions in a screenshot-friendly lower band, slightly above the bottom UI area. Start around height * 0.28 bottom padding, then adjust only if the source framing clearly needs it.
  • Visual style: bold Chinese UI font, white base text, current token yellow, optional sparse keyword accent, and black shadow/outline for readability.
  • Verification: check the rendered file exists, keeps audio, matches the source duration closely, and visually inspect stills before accepting the style.

Workflow

  1. Inspect the source with ffprobe for width, height, fps, duration, and audio presence.
  2. Build a small domain glossary before transcription. Include likely product names, model names, people, places, mixed Chinese/English terms, numbers, and terms hinted by the folder or filename.
  3. Run Whisper word timestamp transcription. Prefer a cached/local model that works on the machine; turbo is a good default when available. Use --initial_prompt when the glossary contains high-risk names or English terms.
  4. Do a mandatory transcript QA pass before rendering. Read every segment, compare it with the video's topic/context, and create a correction map for obvious ASR mistakes. Do not render while known mistakes remain.
  5. Convert the Whisper JSON to public/captions.json using scripts/whisper_json_to_captions.py, passing corrections with --replace or --replace-phrase. If Whisper returns long paragraphs, keep the helper's caption splitting enabled.
  6. Build or reuse a small Remotion project in the video's folder. Copy the video to public/input.mp4 or encode an H.264 compatibility copy as public/input-h264.mp4.
  7. Set the Remotion composition to the exact source width, height, fps, and duration in frames.
  8. Before the full render, render and inspect at least two stills from the Remotion composition: one long/two-line caption and one dense keyword/current-token caption. Fix muddy text, excessive outline, large dark backing, bad line breaks, poor placement, or remaining typo/wrong-character subtitles before rendering the final video.
  9. Render with Remotion to the output path next to the source.
  10. Verify the final output with ffprobe; extract and inspect a still from the rendered video to confirm the final encoded file kept the approved subtitle look.

Whisper Command

Use this shape, adjusting model and paths as needed:

whisper "/absolute/path/video.mp4" --model turbo --language zh --word_timestamps True --output_format json --output_dir "/absolute/path"

For names and mixed Chinese/English topics, add a short prompt rather than relying on Whisper to infer them:

whisper "/absolute/path/video.mp4" \
  --model turbo \
  --language zh \
  --word_timestamps True \
  --output_format json \
  --initial_prompt "本视频可能出现这些词:Cursor、Kimi 2.5、马斯克、AI大模型、转推、套壳。" \
  --output_dir "/absolute/path"

If Whisper fails on the video container, extract audio first:

ffmpeg -y -i "/absolute/path/video.mp4" -vn -ac 1 -ar 16000 "/absolute/path/video_audio.wav"

Then run Whisper on the WAV and keep the output naming clear.

Mandatory Transcript QA

After Whisper finishes, print or read the segment transcript before building captions. Treat this as a required gate, not a nice-to-have.

Check especially:

  • product, model, company, person, and place names
  • mixed Chinese/English terms such as Cursor, Kimi 2.5, ChatGPT, Claude, OpenAI
  • numbers, dates, model versions, and acronyms
  • Chinese homophones that are plausible but wrong in context, such as "却" versus "圈" or "死腿" versus "转推"
  • low-confidence words if the Whisper JSON includes probabilities

Create a short correction map and apply it before rendering. Use --replace for single-token fixes and --replace-phrase for words split across adjacent Whisper tokens.

Example:

python3 scripts/whisper_json_to_captions.py \
  "/absolute/path/transcript.json" \
  "/absolute/path/remotion-project/public/captions.json" \
  --replace-phrase "科舍=Cursor" \
  --replace-phrase "KMI 2.5=Kimi 2.5" \
  --replace-phrase "Kimi 2.5=Kimi 2.5" \
  --replace-phrase "AI却=AI圈" \
  --replace-phrase "死腿=转推" \
  --merge-term "Cursor" \
  --merge-term "Kimi 2.5" \
  --keyword "AI" \
  --keyword "Kimi 2.5"

If a correction is uncertain, prefer rerunning Whisper with a better --initial_prompt or inspect the relevant audio/video moment before deciding. Report the correction map in the final response.

Caption JSON

Remotion should load public/captions.json with this shape:

[
  {
    "text": "我们用手机随便拍张照片",
    "startMs": 0,
    "endMs": 1600,
    "tokens": [
      { "text": "我们", "startMs": 0, "endMs": 300, "keyword": false },
      { "text": "手机", "startMs": 440, "endMs": 700, "keyword": true }
    ]
  }
]

Use the helper script:

python3 scripts/whisper_json_to_captions.py \
  "/absolute/path/transcript.json" \
  "/absolute/path/remotion-project/public/captions.json" \
  --keyword "提示词" \
  --keyword "Codex" \
  --replace-phrase "错识别词=正确词" \
  --max-caption-chars 28 \
  --max-caption-duration-ms 4200 \
  --split-gap-ms 260 \
  --min-punctuation-caption-ms 900

Run the transcript QA pass before this conversion command. If any correction changes adjacent tokens into one display term, also pass that final term with --merge-term or a same-text --replace-phrase so the highlight appears as a clean word instead of broken characters.

The helper splits long Whisper segments by visible length, duration, punctuation, and word timing gaps. Keep this behavior on for short-video subtitles; otherwise a single Whisper paragraph can become an unreadable multi-line caption.

Remotion Caption Layer Requirements

The caption layer should:

  • Use OffthreadVideo for the source video.
  • Load captions.json with delayRender, continueRender, and staticFile.
  • Find the active caption by currentMs >= startMs && currentMs < endMs.
  • Highlight the active token when currentMs is within the token's timing.
  • Use keyword coloring only as a secondary accent; the current spoken token is the main effect.
  • Keep letterSpacing: 0.
  • Keep all captions at the normal position; do not include special screenshot-sentence placement unless the user explicitly asks.
  • Do not use WebkitTextStroke as a thick Chinese subtitle outline. It easily eats the white fill at small resolutions. Prefer multi-direction textShadow; if WebkitTextStroke is used at all, keep it at or below 1.5px and verify a still.
  • Do not use a large semi-transparent rounded black rectangle behind the whole caption by default. If the footage truly needs backing, use a very subtle per-line backing, opacity <= 0.16, tight padding, and verify it does not look like a dark banner.

Reject and revise any still where the caption has muddy/gray text, a thick black halo, a large black box, clipped words, awkward wrapping, or placement over the mouth/chin in a talking-head video.

Use these style constants as the baseline:

const captionBottom = Math.round(height * 0.28);
const captionFontSize = Math.round(height * 0.032);
const captionMaxWidth = Math.round(width * 0.88);
const activeColor = "#FFE45C";
const keywordColor = "#D6FFF8";
const outlinePx = Math.min(3, Math.max(1.5, captionFontSize * 0.055));

Use a clean shadow outline rather than a heavy stroke:

const textShadow = [
  `${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
  `-${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
  `0 ${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
  `0 -${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
  `0 ${outlinePx * 1.4}px ${outlinePx * 1.4}px rgba(0, 0, 0, 0.72)`,
].join(", ");

For a 720x1280 vertical video, this is roughly paddingBottom: 358, fontSize: 41, maxWidth: 634, and a 2px outline. For the original 2160x2974 source that inspired this skill, the equivalent bottom padding was about 830.

Keyword Handling

If the user does not specify keywords, infer a small set from the transcript:

  • product/tool names: Codex, Remotion, Whisper
  • content objects: 提示词, 照片, 手机, 封面
  • place names or nouns that carry the video's hook
  • numbers or outcomes that are important to the promise

Keep keyword highlighting sparse. Too many colored words makes the active yellow word less clear.

Output Naming

Prefer:

<source-stem>_remotion逐词高亮字幕.mp4

If iterating for the same source, add _v2, _v3, etc. Do not overwrite the user's source file.

Final Response

Tell the user the output path, mention that it used the reusable Remotion + Whisper word timestamp flow, and include the transcript correction map or say no corrections were needed. If any verification step could not be run, say that plainly.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Download Tool

支持下载 YouTube、TikTok、小红书、抖音等平台的视频。

Registry SourceRecently Updated
2450Profile unavailable
General

Jimeng Video Generator

即梦AI视频生成工具(带声音版本),通过火山引擎API自动生成带音频的高质量视频。支持文生视频、图生视频,适用于短视频内容创作。

Registry SourceRecently Updated
2K3Profile unavailable
General

Video Intelligence

Download videos and get transcripts, summaries, or metadata from YouTube, TikTok, Instagram, and X (Twitter). Use when the user shares a video URL and wants...

Registry SourceRecently Updated
4050Profile unavailable
General

Gettr Transcribe

Download audio from a GETTR post or streaming page and transcribe it locally with MLX Whisper on Apple Silicon (with timestamps via VTT). Use when given a GE...

Registry SourceRecently Updated
5360Profile unavailable