Web to Markdown
Overview
Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.
Key features:
-
Headless browser (no visible window)
-
Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
-
Batch processing multiple URLs → single file
-
Self-contained (no MCP dependency)
Prerequisites
-
Node.js/pnpm for running TypeScript scripts
-
Dependencies installed: cd skills/web-to-markdown && pnpm install
-
Playwright browsers will auto-install on first run
Usage
When user provides URLs and asks to:
-
"capture these pages as markdown"
-
"save web content for documentation"
-
"fetch and convert these webpages"
-
"get markdown from these sites"
-
"download and convert to markdown"
Output: docs/web-captures/YYYYMMDD_HHMMSS.md containing all pages.
Workflow
Single Command
cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...
That's it! Script handles:
-
Creates timestamped output file with header
-
Launches headless Playwright browser
-
Scrapes each URL sequentially
-
Converts HTML → markdown (Turndown)
-
Appends to output file with formatted headers
-
Closes browser and reports summary
Examples
Single URL:
cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs
Output: docs/web-captures/20251103_143052.md
Multiple URLs (Batch):
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts
https://example.com/guide
https://example.com/api
https://example.com/faq
Output: docs/web-captures/20251103_143052.md (all 3 pages)
From project root:
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>
Output Format
Web Captures - YYYY-MM-DD HH:MM:SS
Generated: YYYYMMDD_HHMMSS URLs: N
📄 https://example.com/page1
[Converted markdown content...]
📄 https://example.com/page2
[Converted markdown content...]
Implementation Notes
TypeScript + FP Patterns:
-
Pure functions (no classes except custom errors)
-
Explicit error handling with typed errors: BrowserError , FileError , HtmlConversionError
-
Small, focused functions
-
Side effects isolated at edges
-
CLI-style logging with chalk/ora
File Structure:
skills/web-to-markdown/ ├── SKILL.md # This file (workflow instructions) ├── package.json # pnpm workspace config ├── tsconfig.json # TypeScript config ├── scripts/ │ ├── scrape-and-convert.ts # Main CLI (Playwright + Turndown) │ ├── html-to-markdown.ts # Pure conversion function (Turndown wrapper) │ └── convert-and-append.ts # Legacy CLI (deprecated, kept for reference) └── tests/ └── html-to-markdown.test.ts # Unit tests
Error Handling
-
Browser launch failure: Check Playwright installation, run pnpm exec playwright install chromium
-
Page not found (404): Logs error, continues with other URLs
-
Timeout (>30s): Reports slow page, continues to next URL
-
Navigation error: Logs error, continues to next URL
-
Conversion failure: Reports malformed HTML, skips page
Configuration
Default timeout: 30 seconds per page
To customize, edit DEFAULT_CONFIG in scripts/scrape-and-convert.ts :
const DEFAULT_CONFIG = { outputDir: 'docs/web-captures', timeout: 30000, // milliseconds };
Performance
-
Headless mode (no GUI overhead)
-
Sequential processing (one URL at a time for stability)
-
Browser reuse across URLs (faster than launching per-page)
-
~2-5 seconds per page (depends on site complexity)
Comparison with Alternatives
Feature web-to-markdown scratchpad-fetch Jina AI Reader
Transport Playwright (headless) curl (HTTP) Cloud API
JavaScript ✅ Full rendering ❌ No ✅ Server-side
Conversion ✅ Turndown ❌ Raw HTML ✅ LLM-powered
Self-hosted ✅ Yes ✅ Yes ❌ Cloud only
Setup pnpm install None API key
Speed Medium (2-5s/page) Fast (<1s) Fast (~2s)
Visible browser ❌ No (headless) N/A N/A
Troubleshooting
"Executable doesn't exist" error:
cd skills/web-to-markdown pnpm exec playwright install chromium
Pages timing out:
-
Increase timeout in DEFAULT_CONFIG
-
Check network connectivity
-
Some sites may block automated browsers (use Jina AI Reader alternative)
Empty markdown output:
-
Site may use heavy client-side rendering
-
Try waiting longer (increase timeout)
-
Check if site blocks headless browsers (User-Agent detection)
Notes
-
One request = one file (all URLs aggregated)
-
Handles JavaScript-rendered content (React, Vue, Angular, etc.)
-
Headless by default (no visible browser window)
-
Browser auto-installs on first run
-
Ideal for documentation scraping and archival
-
No external API dependencies