HN Podcast Transcribe & Archive
Automatically download, transcribe, and archive Hacker News podcast episodes into a searchable local archive.
Default Podcast
Hacker News Recap by Wondercraft.ai — daily AI-generated recap of top HN posts.
- RSS:
https://rss.buzzsprout.com/2170103.rss
Override with HN_PODCAST_RSS env var for any podcast RSS feed.
Workflow
1. Fetch new episodes
python3 scripts/fetch_episodes.py [--rss URL] [--archive DIR] [--limit N] [--no-download]
- Parses podcast RSS feed
- Compares against existing archive to skip already-processed episodes
- Downloads audio (mp3/m4a/wav) for each new episode
- Saves metadata as JSON alongside audio
- Default archive:
./hn-podcast-archive/ --no-download: save metadata only, skip audio download
Download strategies (tried in order):
- Direct HTTP download — works for most podcast CDNs
- yt-dlp fallback — handles some Cloudflare-protected hosts
- If both fail, the episode directory is created with metadata; place audio manually
Cloudflare note: Some hosts (e.g. Buzzsprout) block automated downloads. If direct download fails:
- Use
--no-downloadto create the directory structure - Download audio manually via browser or podcast app
- Place the file as
audio.mp3in the episode directory - Re-run the transcribe step
2. Transcribe audio
python3 scripts/transcribe_episodes.py [--archive DIR] [--model MODEL] [--format FORMAT]
- Finds episodes with audio but no transcript
- Runs Whisper locally (no API key needed)
- Outputs:
txt,srt,vtt, orjson(default:txt) - Default model:
turbo(fast, good accuracy) - Supports audio formats: mp3, m4a, wav, ogg, flac
3. Generate archive index
python3 scripts/build_index.py [--archive DIR]
- Creates
archive_index.jsonwith all episodes, dates, titles, and transcript paths - Enables fast search across the archive
4. Search archive
python3 scripts/search_archive.py [--archive DIR] "search query"
- Full-text search across all transcribed episodes
- Returns matching episodes with context snippets
One-shot: Full Pipeline
python3 scripts/pipeline.py [--rss URL] [--archive DIR] [--model MODEL] [--limit N]
Runs fetch → transcribe → index in sequence.
Cron Integration
Set up periodic processing with OpenClaw cron:
# Daily at 6am — process new HN Recap episodes
cron add --name "hn-podcast-digest" --schedule "0 6 * * *" --payload '{"kind":"agentTurn","message":"Run the HN podcast transcription pipeline: python3 scripts/pipeline.py --limit 3"}'
Archive Structure
hn-podcast-archive/
├── archive_index.json
├── 2026-05-10_hardware-attestation-as-monopoly-enabler/
│ ├── episode.json
│ ├── audio.mp3
│ └── transcript.txt
├── 2026-05-09_a-recent-experience-with-chatgpt-5-5-pro/
│ ├── episode.json
│ ├── audio.mp3
│ └── transcript.txt
└── ...
Configuration
| Env Var | Default | Description |
|---|---|---|
HN_PODCAST_RSS | Buzzsprout HN Recap feed | Podcast RSS feed URL |
HN_ARCHIVE_DIR | ./hn-podcast-archive | Archive directory |
WHISPER_MODEL | turbo | Whisper model name |
WHISPER_FORMAT | txt | Transcript output format |
Requirements
- Python 3.10+
openai-whisper(pip install openai-whisper)requests(pip install requests)static-ffmpeg(pip install static-ffmpeg) — auto-provides ffmpegyt-dlp(optional, for fallback downloads)- Whisper models auto-download to
~/.cache/whisperon first use