HN Podcast Transcriber
Fetch new episodes from the Hacker News Morning Brief podcast RSS feed, transcribe with Whisper, and archive as searchable markdown.
Prerequisites
- whisper CLI installed (
pip install openai-whisper) - ffmpeg on PATH (required by whisper; download from https://ffmpeg.org)
- python3 with standard library (no extra deps for the fetch script)
- Disk space for audio files (~5-10 MB per episode)
Quick Start
Run the main script to fetch and transcribe all new episodes:
bash scripts/fetch_and_transcribe.sh --archive ~/hn-podcast-archive
First run processes all episodes. Subsequent runs only process new ones (tracked via state.json).
Options
| Flag | Default | Description |
|---|---|---|
--feed URL | HN Morning Brief RSS | Podcast RSS feed URL |
--archive DIR | ./hn-podcast-archive | Archive root directory |
--model MODEL | turbo | Whisper model (tiny/base/small/medium/large/turbo) |
--limit N | 0 (all) | Max new episodes to process per run |
Custom Feeds
Point at any podcast RSS feed:
bash scripts/fetch_and_transcribe.sh --feed "https://example.com/podcast/feed.xml" --archive ./my-podcast-archive
Scheduling
Set up an OpenClaw cron job for daily checks:
- Create an isolated cron job that runs the script
- Or add a heartbeat check in HEARTBEAT.md
Archive Structure
See references/archive-layout.md for directory layout and state.json schema.
Workflow Summary
- Download RSS feed → parse
<item>entries - Skip already-processed episodes (state.json lookup)
- Download audio (mp3/m4a) to episode directory
- Run
whisperto produce.txttranscript - Generate cleaned
transcript.mdwith title + date header - Update state.json with processed episode ID
Notes
- Whisper models cache to
~/.cache/whisperafter first download - Use
--model tinyfor speed,--model largefor best accuracy - Average episode (~6 min) takes ~1-2 min with turbo model on CPU
- For GPU acceleration, install ffmpeg with CUDA support