Remotion Word Highlight Subtitles
Overview
This skill turns a local video into a subtitled video using the reusable "fine version": Whisper word timestamps plus Remotion-rendered current-word highlighting. Use this instead of plain SRT burn-in unless the user explicitly asks for simple static subtitles.
Detailed usage guide, README, and effect preview: https://github.com/0x00000003/remotion-word-highlight-subtitles
Trigger Phrases
Use this skill for requests like:
- "给这个视频添加字幕:/path/to/video.mp4"
- "给这个视频加逐词高亮字幕"
- "按之前 Remotion 那个字幕方案处理这个视频"
- "重新转写 word timestamps,然后加当前词高亮字幕"
If the user only gives one video path, write the output next to the source video.
Defaults
- Source: local
.mp4,.mov,.m4v, or audio/video file with usable audio. - Output:
<source-stem>_remotion逐词高亮字幕.mp4in the same directory unless the user names another target. - Transcription: Whisper with word timestamps, Chinese by default:
--language zh --word_timestamps True --output_format json. - Transcript QA: mandatory before rendering. Build a tiny glossary from the user prompt, folder/file names, visible context, and topic; correct obvious ASR mistakes in product/model/person/place names, English terms, numbers, and homophones.
- Caption data: Remotion caption chunks built from Whisper words, with token-level
startMsandendMs. Long Whisper segments must be split by punctuation, timing gaps, duration, or length before rendering. - Caption position: keep all captions in a screenshot-friendly lower band, slightly above the bottom UI area. Start around
height * 0.28bottom padding, then adjust only if the source framing clearly needs it. - Visual style: bold Chinese UI font, white base text, current token yellow, optional sparse keyword accent, and black shadow/outline for readability.
- Verification: check the rendered file exists, keeps audio, matches the source duration closely, and visually inspect stills before accepting the style.
Workflow
- Inspect the source with
ffprobefor width, height, fps, duration, and audio presence. - Build a small domain glossary before transcription. Include likely product names, model names, people, places, mixed Chinese/English terms, numbers, and terms hinted by the folder or filename.
- Run Whisper word timestamp transcription. Prefer a cached/local model that works on the machine;
turbois a good default when available. Use--initial_promptwhen the glossary contains high-risk names or English terms. - Do a mandatory transcript QA pass before rendering. Read every segment, compare it with the video's topic/context, and create a correction map for obvious ASR mistakes. Do not render while known mistakes remain.
- Convert the Whisper JSON to
public/captions.jsonusingscripts/whisper_json_to_captions.py, passing corrections with--replaceor--replace-phrase. If Whisper returns long paragraphs, keep the helper's caption splitting enabled. - Build or reuse a small Remotion project in the video's folder. Copy the video to
public/input.mp4or encode an H.264 compatibility copy aspublic/input-h264.mp4. - Set the Remotion composition to the exact source width, height, fps, and duration in frames.
- Before the full render, render and inspect at least two stills from the Remotion composition: one long/two-line caption and one dense keyword/current-token caption. Fix muddy text, excessive outline, large dark backing, bad line breaks, poor placement, or remaining typo/wrong-character subtitles before rendering the final video.
- Render with Remotion to the output path next to the source.
- Verify the final output with
ffprobe; extract and inspect a still from the rendered video to confirm the final encoded file kept the approved subtitle look.
Whisper Command
Use this shape, adjusting model and paths as needed:
whisper "/absolute/path/video.mp4" --model turbo --language zh --word_timestamps True --output_format json --output_dir "/absolute/path"
For names and mixed Chinese/English topics, add a short prompt rather than relying on Whisper to infer them:
whisper "/absolute/path/video.mp4" \
--model turbo \
--language zh \
--word_timestamps True \
--output_format json \
--initial_prompt "本视频可能出现这些词:Cursor、Kimi 2.5、马斯克、AI大模型、转推、套壳。" \
--output_dir "/absolute/path"
If Whisper fails on the video container, extract audio first:
ffmpeg -y -i "/absolute/path/video.mp4" -vn -ac 1 -ar 16000 "/absolute/path/video_audio.wav"
Then run Whisper on the WAV and keep the output naming clear.
Mandatory Transcript QA
After Whisper finishes, print or read the segment transcript before building captions. Treat this as a required gate, not a nice-to-have.
Check especially:
- product, model, company, person, and place names
- mixed Chinese/English terms such as
Cursor,Kimi 2.5,ChatGPT,Claude,OpenAI - numbers, dates, model versions, and acronyms
- Chinese homophones that are plausible but wrong in context, such as "却" versus "圈" or "死腿" versus "转推"
- low-confidence words if the Whisper JSON includes probabilities
Create a short correction map and apply it before rendering. Use --replace for single-token fixes and --replace-phrase for words split across adjacent Whisper tokens.
Example:
python3 scripts/whisper_json_to_captions.py \
"/absolute/path/transcript.json" \
"/absolute/path/remotion-project/public/captions.json" \
--replace-phrase "科舍=Cursor" \
--replace-phrase "KMI 2.5=Kimi 2.5" \
--replace-phrase "Kimi 2.5=Kimi 2.5" \
--replace-phrase "AI却=AI圈" \
--replace-phrase "死腿=转推" \
--merge-term "Cursor" \
--merge-term "Kimi 2.5" \
--keyword "AI" \
--keyword "Kimi 2.5"
If a correction is uncertain, prefer rerunning Whisper with a better --initial_prompt or inspect the relevant audio/video moment before deciding. Report the correction map in the final response.
Caption JSON
Remotion should load public/captions.json with this shape:
[
{
"text": "我们用手机随便拍张照片",
"startMs": 0,
"endMs": 1600,
"tokens": [
{ "text": "我们", "startMs": 0, "endMs": 300, "keyword": false },
{ "text": "手机", "startMs": 440, "endMs": 700, "keyword": true }
]
}
]
Use the helper script:
python3 scripts/whisper_json_to_captions.py \
"/absolute/path/transcript.json" \
"/absolute/path/remotion-project/public/captions.json" \
--keyword "提示词" \
--keyword "Codex" \
--replace-phrase "错识别词=正确词" \
--max-caption-chars 28 \
--max-caption-duration-ms 4200 \
--split-gap-ms 260 \
--min-punctuation-caption-ms 900
Run the transcript QA pass before this conversion command. If any correction changes adjacent tokens into one display term, also pass that final term with --merge-term or a same-text --replace-phrase so the highlight appears as a clean word instead of broken characters.
The helper splits long Whisper segments by visible length, duration, punctuation, and word timing gaps. Keep this behavior on for short-video subtitles; otherwise a single Whisper paragraph can become an unreadable multi-line caption.
Remotion Caption Layer Requirements
The caption layer should:
- Use
OffthreadVideofor the source video. - Load
captions.jsonwithdelayRender,continueRender, andstaticFile. - Find the active caption by
currentMs >= startMs && currentMs < endMs. - Highlight the active token when
currentMsis within the token's timing. - Use keyword coloring only as a secondary accent; the current spoken token is the main effect.
- Keep
letterSpacing: 0. - Keep all captions at the normal position; do not include special screenshot-sentence placement unless the user explicitly asks.
- Do not use
WebkitTextStrokeas a thick Chinese subtitle outline. It easily eats the white fill at small resolutions. Prefer multi-directiontextShadow; ifWebkitTextStrokeis used at all, keep it at or below1.5pxand verify a still. - Do not use a large semi-transparent rounded black rectangle behind the whole caption by default. If the footage truly needs backing, use a very subtle per-line backing, opacity
<= 0.16, tight padding, and verify it does not look like a dark banner.
Reject and revise any still where the caption has muddy/gray text, a thick black halo, a large black box, clipped words, awkward wrapping, or placement over the mouth/chin in a talking-head video.
Use these style constants as the baseline:
const captionBottom = Math.round(height * 0.28);
const captionFontSize = Math.round(height * 0.032);
const captionMaxWidth = Math.round(width * 0.88);
const activeColor = "#FFE45C";
const keywordColor = "#D6FFF8";
const outlinePx = Math.min(3, Math.max(1.5, captionFontSize * 0.055));
Use a clean shadow outline rather than a heavy stroke:
const textShadow = [
`${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
`-${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
`0 ${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
`0 -${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
`0 ${outlinePx * 1.4}px ${outlinePx * 1.4}px rgba(0, 0, 0, 0.72)`,
].join(", ");
For a 720x1280 vertical video, this is roughly paddingBottom: 358, fontSize: 41, maxWidth: 634, and a 2px outline. For the original 2160x2974 source that inspired this skill, the equivalent bottom padding was about 830.
Keyword Handling
If the user does not specify keywords, infer a small set from the transcript:
- product/tool names:
Codex,Remotion,Whisper - content objects:
提示词,照片,手机,封面 - place names or nouns that carry the video's hook
- numbers or outcomes that are important to the promise
Keep keyword highlighting sparse. Too many colored words makes the active yellow word less clear.
Output Naming
Prefer:
<source-stem>_remotion逐词高亮字幕.mp4
If iterating for the same source, add _v2, _v3, etc. Do not overwrite the user's source file.
Final Response
Tell the user the output path, mention that it used the reusable Remotion + Whisper word timestamp flow, and include the transcript correction map or say no corrections were needed. If any verification step could not be run, say that plainly.