Documentation Scraper with slurp-ai

Overview

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.

CRITICAL: Run Outside Sandbox

All commands in this skill MUST be run outside the sandbox. Use dangerouslyDisableSandbox: true for all Bash commands including:

which slurp (installation check)
node analyze-sitemap.js (sitemap analysis)
slurp (scraping)
File inspection commands (wc, head, cat, etc.)

The sandbox blocks network access and file operations required for web scraping.

Pre-Flight: Check Installation

Before scraping, verify slurp-ai is installed:

which slurp || echo "NOT INSTALLED"

If not installed, ask the user to run:

npm install -g slurp-ai

Requires: Node.js v20+

Do NOT proceed with scraping until slurp-ai is confirmed installed.

Commands

Command	Purpose
`slurp <url>`	Fetch and compile in one step
`slurp fetch <url> [version]`	Download docs to partials only
`slurp compile`	Compile partials into single file
`slurp read <package> [version]`	Read local documentation

Output: Creates slurp_compiled/compiled_docs.md from partials in slurp_partials/.

CRITICAL: Analyze Sitemap First

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your --base-path and --max decisions.

Step 1: Run Sitemap Analysis

Use the included analyze-sitemap.js script:

node analyze-sitemap.js https://docs.example.com

This outputs:

Total page count (informs --max)
URLs grouped by section (informs --base-path)
Suggested slurp commands with appropriate flags
Sample URLs to understand naming patterns

Step 2: Interpret the Output

Example output:

📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

Sitemap Shows	Action
< 50 pages total	Scrape entire site: `slurp <url> --max 60`
50-200 pages	Scope to relevant section with `--base-path`
200+ pages	Must scope down - pick specific subsection
No sitemap found	Start with `--max 30`, inspect partials, adjust

Step 4: Frame the Slurp Command

With sitemap data, you can now set accurate parameters:

# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
  --base-path https://docs.example.com/docs/api/ \
  --max 55

Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

Common Scraping Patterns

Library Documentation (versioned)

# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/

# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn

API Reference Only

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

slurp https://docs.example.com/

CLI Options

Flag	Default	Purpose
`--max <n>`	20	Maximum pages to scrape
`--concurrency <n>`	5	Parallel page requests
`--headless <bool>`	true	Use headless browser
`--base-path <url>`	start URL	Filter links to this prefix
`--output <dir>`	`./slurp_partials`	Output directory for partials
`--retry-count <n>`	3	Retries for failed requests
`--retry-delay <ms>`	1000	Delay between retries
`--yes`	-	Skip confirmation prompts

Compile Options

Flag	Default	Purpose
`--input <dir>`	`./slurp_partials`	Input directory
`--output <file>`	`./slurp_compiled/compiled_docs.md`	Output file
`--preserve-metadata`	true	Keep metadata blocks
`--remove-navigation`	true	Strip nav elements
`--remove-duplicates`	true	Eliminate duplicates
`--exclude <json>`	-	JSON array of regex patterns to exclude

When to Disable Headless Mode

Use --headless false for:

Static HTML documentation sites
Faster scraping when JS rendering not needed

Default is headless (true) - works for most modern doc sites including SPAs.

Output Structure

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result

Quick Reference

# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com

# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80

# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes

# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100

Common Issues

Problem	Cause	Solution
Wrong `--max` value	Guessing page count	Run `analyze-sitemap.js` first
Too few pages scraped	`--max` limit (default 20)	Set `--max` based on sitemap analysis
Missing content	JS not rendering	Ensure `--headless true` (default)
Crawl stuck/slow	Rate limiting	Reduce `--concurrency 3`
Duplicate sections	Similar content	Use `--remove-duplicates` (default)
Wrong pages included	Base path too broad	Use sitemap to find correct `--base-path`
Prompts blocking automation	Interactive mode	Add `--yes` flag

Post-Scrape Usage

The output markdown is designed for AI context injection:

# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md

# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30

# Use with Claude Code - reference in prompt or via @file

When NOT to Use

API specs in OpenAPI/Swagger: Use dedicated parsers instead
GitHub READMEs: Fetch directly via raw.githubusercontent.com
npm package docs: Often better to read source + README
Frequently updated docs: Consider caching strategy