Website Crawler

High-performance web crawler with TypeScript/Bun frontend and Go backend for discovering and mapping website structure.

When to Use

Use this skill when users ask to:

Crawl a website or "spider a site"
Map site structure or "discover all pages"
Find all URLs on a website
Generate sitemap or site report
Analyze link relationships between pages
Audit website coverage or completeness
Extract page metadata (titles, status codes)

Keywords: crawl, spider, map, discover pages, site structure, sitemap, all URLs, website audit

Quick Start

Run the crawler from the scripts directory:

cd ~/.claude/scripts/crawler bun src/index.ts <URL> [options]

CLI Options

Option Short Default Description

--depth

-D

2 Maximum crawl depth

--workers

-w

20 Concurrent workers

--rate

-r

2 Rate limit (requests/second)

--profile

-p

Use preset profile (fast/deep/gentle)

--output

-o

auto Output directory

--sitemap

-s

true Use sitemap.xml for discovery

--domain

-d

auto Allowed domain (extracted from URL)

--debug

false Enable debug logging

Profiles

Three preset profiles for common use cases:

Profile Workers Depth Rate Use Case

fast

50 3 10 Quick site mapping

deep

20 10 3 Thorough crawling

gentle

5 5 1 Respect server limits

Usage Examples

Basic crawl

bun src/index.ts https://example.com

Deep crawl with high concurrency

bun src/index.ts https://example.com --depth 5 --workers 30 --rate 5

Using a profile

bun src/index.ts https://example.com --profile fast

Gentle crawl (avoid rate limiting)

bun src/index.ts https://example.com --profile gentle

Output

The crawler generates two files in the output directory:

results.json - Structured crawl data with all discovered pages
index.html - Dark-themed HTML report with statistics

Results JSON Structure

{ "stats": { "pages_found": 150, "pages_crawled": 147, "external_links": 23, "errors": 3, "duration": 45.2 }, "results": [ { "url": "https://example.com/page", "title": "Page Title", "status_code": 200, "depth": 1, "links": ["..."], "content_type": "text/html" } ] }

Features

Sitemap Discovery: Automatically finds and parses sitemap.xml
Checkpoint/Resume: Auto-saves progress every 30 seconds
Rate Limiting: Token bucket algorithm prevents server overload
Concurrent Crawling: Go worker pool for high performance
HTML Reports: Dark-themed, mobile-responsive reports

Troubleshooting

Rate limiting errors

Reduce the rate limit or use the gentle profile:

bun src/index.ts <url> --rate 1

or

bun src/index.ts <url> --profile gentle

Go binary not found

The TypeScript frontend auto-compiles the Go binary. If compilation fails:

cd ~/.claude/scripts/crawler/engine go build -o crawler main.go

Timeout on large sites

Reduce depth or increase workers:

bun src/index.ts <url> --depth 1 --workers 50

Architecture

For detailed architecture, Go engine specifications, and code conventions, see reference.md.

Related Files

Command: plugins/crawler/commands/crawler.md
Reference: plugins/crawler/skills/website-crawler/reference.md
Scripts: plugins/crawler/skills/website-crawler/scripts/
Profiles: plugins/crawler/skills/website-crawler/scripts/config/profiles/

website-crawler

Safety Notice

Copy this and send it to your AI assistant to learn

or

Source Transparency

Related Skills

web-crawler

image-processing

media-processor