documentation-scraper

Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "documentation-scraper" with this command: npx skills add ratacat/claude-skills/ratacat-claude-skills-documentation-scraper

Documentation Scraper with slurp-ai

Overview

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.

CRITICAL: Run Outside Sandbox

All commands in this skill MUST be run outside the sandbox. Use dangerouslyDisableSandbox: true for all Bash commands including:

  • which slurp (installation check)
  • node analyze-sitemap.js (sitemap analysis)
  • slurp (scraping)
  • File inspection commands (wc, head, cat, etc.)

The sandbox blocks network access and file operations required for web scraping.

Pre-Flight: Check Installation

Before scraping, verify slurp-ai is installed:

which slurp || echo "NOT INSTALLED"

If not installed, ask the user to run:

npm install -g slurp-ai

Requires: Node.js v20+

Do NOT proceed with scraping until slurp-ai is confirmed installed.

Commands

CommandPurpose
slurp <url>Fetch and compile in one step
slurp fetch <url> [version]Download docs to partials only
slurp compileCompile partials into single file
slurp read <package> [version]Read local documentation

Output: Creates slurp_compiled/compiled_docs.md from partials in slurp_partials/.

CRITICAL: Analyze Sitemap First

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your --base-path and --max decisions.

Step 1: Run Sitemap Analysis

Use the included analyze-sitemap.js script:

node analyze-sitemap.js https://docs.example.com

This outputs:

  • Total page count (informs --max)
  • URLs grouped by section (informs --base-path)
  • Suggested slurp commands with appropriate flags
  • Sample URLs to understand naming patterns

Step 2: Interpret the Output

Example output:

📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

Sitemap ShowsAction
< 50 pages totalScrape entire site: slurp <url> --max 60
50-200 pagesScope to relevant section with --base-path
200+ pagesMust scope down - pick specific subsection
No sitemap foundStart with --max 30, inspect partials, adjust

Step 4: Frame the Slurp Command

With sitemap data, you can now set accurate parameters:

# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
  --base-path https://docs.example.com/docs/api/ \
  --max 55

Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

Common Scraping Patterns

Library Documentation (versioned)

# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/

# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn

API Reference Only

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

slurp https://docs.example.com/

CLI Options

FlagDefaultPurpose
--max <n>20Maximum pages to scrape
--concurrency <n>5Parallel page requests
--headless <bool>trueUse headless browser
--base-path <url>start URLFilter links to this prefix
--output <dir>./slurp_partialsOutput directory for partials
--retry-count <n>3Retries for failed requests
--retry-delay <ms>1000Delay between retries
--yes-Skip confirmation prompts

Compile Options

FlagDefaultPurpose
--input <dir>./slurp_partialsInput directory
--output <file>./slurp_compiled/compiled_docs.mdOutput file
--preserve-metadatatrueKeep metadata blocks
--remove-navigationtrueStrip nav elements
--remove-duplicatestrueEliminate duplicates
--exclude <json>-JSON array of regex patterns to exclude

When to Disable Headless Mode

Use --headless false for:

  • Static HTML documentation sites
  • Faster scraping when JS rendering not needed

Default is headless (true) - works for most modern doc sites including SPAs.

Output Structure

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result

Quick Reference

# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com

# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80

# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes

# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100

Common Issues

ProblemCauseSolution
Wrong --max valueGuessing page countRun analyze-sitemap.js first
Too few pages scraped--max limit (default 20)Set --max based on sitemap analysis
Missing contentJS not renderingEnsure --headless true (default)
Crawl stuck/slowRate limitingReduce --concurrency 3
Duplicate sectionsSimilar contentUse --remove-duplicates (default)
Wrong pages includedBase path too broadUse sitemap to find correct --base-path
Prompts blocking automationInteractive modeAdd --yes flag

Post-Scrape Usage

The output markdown is designed for AI context injection:

# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md

# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30

# Use with Claude Code - reference in prompt or via @file

When NOT to Use

  • API specs in OpenAPI/Swagger: Use dedicated parsers instead
  • GitHub READMEs: Fetch directly via raw.githubusercontent.com
  • npm package docs: Often better to read source + README
  • Frequently updated docs: Consider caching strategy

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ripgrep

No summary provided by upstream source.

Repository SourceNeeds Review
General

annas-archive-ebooks

No summary provided by upstream source.

Repository SourceNeeds Review
General

web-interface-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

brave-search

No summary provided by upstream source.

Repository SourceNeeds Review