URL Crawler
Crawls websites using Tavily Crawl API and saves each page as a separate markdown file in a flat directory structure.
Prerequisites
Tavily API Key Required - Get your key at https://tavily.com
Add to ~/.claude/settings.json :
{ "env": { "TAVILY_API_KEY": "tvly-your-api-key-here" } }
Restart Claude Code after adding your API key.
When to Use
Use this skill when the user wants to:
-
Crawl and extract content from a website
-
Download API documentation, framework docs, or knowledge bases
-
Save web content locally for offline access or analysis
Usage
Execute the crawl script with a URL and optional instruction:
python scripts/crawl_url.py <URL> [--instruction "guidance text"]
Required Parameters
- URL: The website to crawl (e.g., https://docs.stripe.com/api )
Optional Parameters
-
--instruction, -i : Natural language guidance for the crawler (e.g., "Focus on API endpoints only")
-
--output, -o : Output directory (default: <repo_root>/crawled_context/<domain> )
-
--depth, -d : Max crawl depth (default: 2, range: 1-5)
-
--breadth, -b : Max links per level (default: 50)
-
--limit, -l : Max total pages to crawl (default: 50)
Output
The script creates a flat directory structure at <repo_root>/crawled_context/<domain>/ with one markdown file per crawled page. Filenames are derived from URLs (e.g., docs_stripe_com_api_authentication.md ).
Each markdown file includes:
-
Frontmatter with source URL and crawl timestamp
-
The extracted content in markdown format
Examples
Basic Crawl
python scripts/crawl_url.py https://docs.anthropic.com
Crawls the Anthropic docs with default settings, saves to <repo_root>/crawled_context/docs_anthropic_com/ .
With Instruction
python scripts/crawl_url.py https://react.dev --instruction "Focus on API reference pages and hooks documentation"
Uses natural language instruction to guide the crawler toward specific content.
Custom Output Directory
python scripts/crawl_url.py https://docs.stripe.com/api -o ./stripe-api-docs
Saves results to a custom directory.
Adjust Crawl Parameters
python scripts/crawl_url.py https://nextjs.org/docs --depth 3 --breadth 100 --limit 200
Increases crawl depth, breadth, and page limit for more comprehensive coverage.
Important Notes
-
API Key Required: Set TAVILY_API_KEY environment variable (loads from .env if available)
-
Crawl Time: Deeper crawls take longer (depth 3+ may take many minutes)
-
Filename Safety: URLs are converted to safe filenames automatically
-
Flat Structure: All files saved in <repo_root>/crawled_context/<domain>/ directory regardless of original URL hierarchy
-
Duplicate Prevention: Files are overwritten if URLs generate identical filenames