web-to-markdown

Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-to-markdown" with this command: npx skills add otrebu/agents/otrebu-agents-web-to-markdown

Web to Markdown

Overview

Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.

Key features:

  • Headless browser (no visible window)

  • Handles JavaScript-rendered content (SPAs, React, Vue, etc.)

  • Batch processing multiple URLs → single file

  • Self-contained (no MCP dependency)

Prerequisites

  • Node.js/pnpm for running TypeScript scripts

  • Dependencies installed: cd skills/web-to-markdown && pnpm install

  • Playwright browsers will auto-install on first run

Usage

When user provides URLs and asks to:

  • "capture these pages as markdown"

  • "save web content for documentation"

  • "fetch and convert these webpages"

  • "get markdown from these sites"

  • "download and convert to markdown"

Output: docs/web-captures/YYYYMMDD_HHMMSS.md containing all pages.

Workflow

Single Command

cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...

That's it! Script handles:

  • Creates timestamped output file with header

  • Launches headless Playwright browser

  • Scrapes each URL sequentially

  • Converts HTML → markdown (Turndown)

  • Appends to output file with formatted headers

  • Closes browser and reports summary

Examples

Single URL:

cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs

Output: docs/web-captures/20251103_143052.md

Multiple URLs (Batch):

cd skills/web-to-markdown pnpm tsx scripts/scrape-and-convert.ts
https://example.com/guide
https://example.com/api
https://example.com/faq

Output: docs/web-captures/20251103_143052.md (all 3 pages)

From project root:

pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>

Output Format

Web Captures - YYYY-MM-DD HH:MM:SS

Generated: YYYYMMDD_HHMMSS URLs: N


📄 https://example.com/page1

[Converted markdown content...]


📄 https://example.com/page2

[Converted markdown content...]


Implementation Notes

TypeScript + FP Patterns:

  • Pure functions (no classes except custom errors)

  • Explicit error handling with typed errors: BrowserError , FileError , HtmlConversionError

  • Small, focused functions

  • Side effects isolated at edges

  • CLI-style logging with chalk/ora

File Structure:

skills/web-to-markdown/ ├── SKILL.md # This file (workflow instructions) ├── package.json # pnpm workspace config ├── tsconfig.json # TypeScript config ├── scripts/ │ ├── scrape-and-convert.ts # Main CLI (Playwright + Turndown) │ ├── html-to-markdown.ts # Pure conversion function (Turndown wrapper) │ └── convert-and-append.ts # Legacy CLI (deprecated, kept for reference) └── tests/ └── html-to-markdown.test.ts # Unit tests

Error Handling

  • Browser launch failure: Check Playwright installation, run pnpm exec playwright install chromium

  • Page not found (404): Logs error, continues with other URLs

  • Timeout (>30s): Reports slow page, continues to next URL

  • Navigation error: Logs error, continues to next URL

  • Conversion failure: Reports malformed HTML, skips page

Configuration

Default timeout: 30 seconds per page

To customize, edit DEFAULT_CONFIG in scripts/scrape-and-convert.ts :

const DEFAULT_CONFIG = { outputDir: 'docs/web-captures', timeout: 30000, // milliseconds };

Performance

  • Headless mode (no GUI overhead)

  • Sequential processing (one URL at a time for stability)

  • Browser reuse across URLs (faster than launching per-page)

  • ~2-5 seconds per page (depends on site complexity)

Comparison with Alternatives

Feature web-to-markdown scratchpad-fetch Jina AI Reader

Transport Playwright (headless) curl (HTTP) Cloud API

JavaScript ✅ Full rendering ❌ No ✅ Server-side

Conversion ✅ Turndown ❌ Raw HTML ✅ LLM-powered

Self-hosted ✅ Yes ✅ Yes ❌ Cloud only

Setup pnpm install None API key

Speed Medium (2-5s/page) Fast (<1s) Fast (~2s)

Visible browser ❌ No (headless) N/A N/A

Troubleshooting

"Executable doesn't exist" error:

cd skills/web-to-markdown pnpm exec playwright install chromium

Pages timing out:

  • Increase timeout in DEFAULT_CONFIG

  • Check network connectivity

  • Some sites may block automated browsers (use Jina AI Reader alternative)

Empty markdown output:

  • Site may use heavy client-side rendering

  • Try waiting longer (increase timeout)

  • Check if site blocks headless browsers (User-Agent detection)

Notes

  • One request = one file (all URLs aggregated)

  • Handles JavaScript-rendered content (React, Vue, Angular, etc.)

  • Headless by default (no visible browser window)

  • Browser auto-installs on first run

  • Ideal for documentation scraping and archival

  • No external API dependencies

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

fix-eslint

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

readwise-api

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

timestamp

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

claude-permissions

No summary provided by upstream source.

Repository SourceNeeds Review