YouTube Transcript

Overview

Extract YouTube video transcripts, metadata, and chapters using yt-dlp. Output formatted as Markdown with YAML frontmatter, saved to ~/Brains/brain/ (Obsidian vault).

Quick Start

To extract a transcript from a YouTube video:

python scripts/extract_transcript.py <youtube_url>

Optional: Specify custom output filename:

python scripts/extract_transcript.py <youtube_url> custom_filename.md

Output Format

YAML Frontmatter

The generated Markdown includes comprehensive metadata:

title - Video title
channel - Channel name
url - YouTube URL
upload_date - Upload date (YYYY-MM-DD)
duration - Video duration (HH:MM:SS)
description - Video description (truncated to 500 chars)
tags - Array of video tags
view_count - View count
like_count - Like count

Body Structure

Transcript organized by video chapters (if available):

## Chapter Title

**00:05:23** Transcript text for this segment.

**00:05:45** Next segment text.

If no chapters exist, all content appears under "## Transcript" heading.

Timestamps formatted as HH:MM:SS for consistency.

Workflow

Extract metadata and subtitles using yt-dlp
Parse VTT subtitle format to extract timestamps and text
Group transcript segments by video chapters (if present)
Format as Markdown with YAML frontmatter
Save to ~/Brains/brain/ with sanitized filename based on video title
Clean up temporary subtitle files

Deduplication

To remove duplicates from existing transcript files:

python scripts/deduplicate_transcript.py <markdown_file>

This removes transcript entries that are prefixes of subsequent entries (common in VTT files where subtitles accumulate).

Requirements

Ensure yt-dlp is installed:

pip install yt-dlp

Limitations

Extracts subtitles in English first, falls back to Russian if English unavailable
Requires video to have subtitles (auto-generated or manual)
Does not download video or audio files
Description truncated to 500 characters in frontmatter