Web to Markdown

Overview

Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.

Key features:

Headless browser (no visible window)
Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
Batch processing multiple URLs → single file
Self-contained (no MCP dependency)

Prerequisites

Node.js/pnpm for running TypeScript scripts
Dependencies installed: cd skills/web-to-markdown && pnpm install
Playwright browsers will auto-install on first run

Usage

When user provides URLs and asks to:

"capture these pages as markdown"
"save web content for documentation"
"fetch and convert these webpages"
"get markdown from these sites"
"download and convert to markdown"

Output: docs/web-captures/YYYYMMDD_HHMMSS.md containing all pages.

Workflow

Single Command

cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...

That's it! Script handles:

Creates timestamped output file with header
Launches headless Playwright browser
Scrapes each URL sequentially
Converts HTML → markdown (Turndown)
Appends to output file with formatted headers
Closes browser and reports summary

Examples

Single URL:

cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs

Output: docs/web-captures/20251103_143052.md

Multiple URLs (Batch):

cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts
https://example.com/guide
https://example.com/api
https://example.com/faq

Output: docs/web-captures/20251103_143052.md (all 3 pages)

From project root:

pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>

Output Format

Web Captures - YYYY-MM-DD HH:MM:SS

Generated: YYYYMMDD_HHMMSS URLs: N

📄 https://example.com/page1

[Converted markdown content...]

📄 https://example.com/page2

[Converted markdown content...]

Implementation Notes

TypeScript + FP Patterns:

Pure functions (no classes except custom errors)
Explicit error handling with typed errors: BrowserError , FileError , HtmlConversionError
Small, focused functions
Side effects isolated at edges
CLI-style logging with chalk/ora

File Structure:

skills/web-to-markdown/ ├── SKILL.md # This file (workflow instructions) ├── package.json # pnpm workspace config ├── tsconfig.json # TypeScript config ├── scripts/ │ ├── scrape-and-convert.ts # Main CLI (Playwright + Turndown) │ ├── html-to-markdown.ts # Pure conversion function (Turndown wrapper) │ └── convert-and-append.ts # Legacy CLI (deprecated, kept for reference) └── tests/ └── html-to-markdown.test.ts # Unit tests

Error Handling

Browser launch failure: Check Playwright installation, run pnpm exec playwright install chromium
Page not found (404): Logs error, continues with other URLs
Timeout (>30s): Reports slow page, continues to next URL
Navigation error: Logs error, continues to next URL
Conversion failure: Reports malformed HTML, skips page

Configuration

Default timeout: 30 seconds per page

To customize, edit DEFAULT_CONFIG in scripts/scrape-and-convert.ts :

const DEFAULT_CONFIG = { outputDir: 'docs/web-captures', timeout: 30000, // milliseconds };

Performance

Headless mode (no GUI overhead)
Sequential processing (one URL at a time for stability)
Browser reuse across URLs (faster than launching per-page)
~2-5 seconds per page (depends on site complexity)

Comparison with Alternatives

Feature web-to-markdown scratchpad-fetch Jina AI Reader

Transport Playwright (headless) curl (HTTP) Cloud API

JavaScript ✅ Full rendering ❌ No ✅ Server-side

Conversion ✅ Turndown ❌ Raw HTML ✅ LLM-powered

Self-hosted ✅ Yes ✅ Yes ❌ Cloud only

Setup pnpm install None API key

Speed Medium (2-5s/page) Fast (<1s) Fast (~2s)

Visible browser ❌ No (headless) N/A N/A

Troubleshooting

"Executable doesn't exist" error:

cd skills/web-to-markdown pnpm exec playwright install chromium

Pages timing out:

Increase timeout in DEFAULT_CONFIG
Check network connectivity
Some sites may block automated browsers (use Jina AI Reader alternative)

Empty markdown output:

Site may use heavy client-side rendering
Try waiting longer (increase timeout)
Check if site blocks headless browsers (User-Agent detection)

Notes

One request = one file (all URLs aggregated)
Handles JavaScript-rendered content (React, Vue, Angular, etc.)
Headless by default (no visible browser window)
Browser auto-installs on first run
Ideal for documentation scraping and archival
No external API dependencies

web-to-markdown

Safety Notice

Copy this and send it to your AI assistant to learn

Output: docs/web-captures/20251103_143052.md

Output: docs/web-captures/20251103_143052.md (all 3 pages)

Web Captures - YYYY-MM-DD HH:MM:SS

📄 https://example.com/page1

📄 https://example.com/page2

Source Transparency

Related Skills

fix-eslint

readwise-api

timestamp

claude-permissions