Crawler Skill

Converts any URL into clean markdown using a robust 3-tier fallback chain.

Quick start

uv run scripts/crawl.py --url https://example.com --output reports/example.md

Markdown is saved to the file specified by --output . Progress/errors go to stderr. Exit code 0 on success, 1 if all scrapers fail.

How it works

The script tries each tier in order and returns the first success:

Tier Module Requires

1 Firecrawl (firecrawl_scraper.py ) FIRECRAWL_API_KEY env var (optional; falls back if missing)

2 Jina Reader (jina_reader.py ) Nothing — free, no key needed

3 Scrapling (scrapling_scraper.py ) Local headless browser (auto-installs via pip)

File layout

crawler-skill/ ├── SKILL.md ← this file ├── scripts/ │ ├── crawl.py ← main CLI entry point (PEP 723 inline deps) │ └── src/ │ ├── domain_router.py ← URL-to-tier routing rules │ ├── firecrawl_scraper.py ← Tier 1: Firecrawl API │ ├── jina_reader.py ← Tier 2: Jina r.jina.ai proxy │ └── scrapling_scraper.py ← Tier 3: local headless scraper └── tests/ └── test_crawl.py ← 70 pytest tests (all passing)

Usage examples

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

Always prefer using --output to avoid terminal encoding issues

uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md

If no --output is provided, markdown goes to stdout (not recommended on Windows)

uv run scripts/crawl.py --url https://example.com

With a Firecrawl API key for best results

FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md

URL requirements

Only http:// and https:// URLs are accepted. Passing any other scheme (ftp:// , file:// , javascript: , a bare path, etc.) exits with code 1

and prints a clear error — no scraping is attempted.

Saving Reports

When the user asks to save the crawled content or a summary to a file, ALWAYS use the --output argument and save the file into the reports/ directory at the project root (for example, {project_root}/reports ). If the directory does not exist, the script will create it.

Example: If asked to "save to result.md", you should run: uv run scripts/crawl.py --url <URL> --output reports/result.md

Point at a self-hosted Firecrawl instance

FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com

Content validation

Each scraper validates its output before returning success:

Minimum 100 characters of content (rejects empty/error pages)
Detection of CAPTCHA / bot-verification pages (Firecrawl)
Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)
Detection of Jina error page indicators (Error: , Access Denied , etc.)

Domain routing

Certain hostnames bypass one or more scraper tiers to avoid known compatibility issues. The logic lives in scripts/src/domain_router.py .

Domain Skipped tiers Active chain

medium.com (and subdomains) firecrawl jina → scrapling

mp.weixin.qq.com

firecrawl + jina scrapling only

everything else — firecrawl → jina → scrapling

Sub-domain matching follows a suffix rule: blog.medium.com matches the medium.com rule because its hostname ends with .medium.com . An exact sub-domain like other.weixin.qq.com does not match mp.weixin.qq.com .

Running tests

uv run pytest tests/ -v

All 70 tests use mocking — no network calls, no API keys required.

Dependencies (auto-installed by uv run )

firecrawl-py>=2.0 — Firecrawl Python SDK
httpx>=0.27 — HTTP client for Jina Reader
scrapling>=0.2 — Headless scraping with stealth support
html2text>=2024.2.26 — HTML-to-markdown conversion

When to invoke this skill

Invoke crawl.py whenever you need the text content of a web page:

result = subprocess.run( ["uv", "run", "scripts/crawl.py", "--url", url], capture_output=True, text=True ) if result.returncode == 0: markdown = result.stdout

Or simply run it directly from the terminal as shown in Quick start above.

crawler

Safety Notice

Copy this and send it to your AI assistant to learn

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

Always prefer using --output to avoid terminal encoding issues

If no --output is provided, markdown goes to stdout (not recommended on Windows)

With a Firecrawl API key for best results

Source Transparency

Related Skills

commit-message

pull-request

crawler

china-sportswear-outdoor-sourcing