doc-scraper

Documentation Scraper Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "doc-scraper" with this command: npx skills add jmagly/ai-writing-guide/jmagly-ai-writing-guide-doc-scraper

Documentation Scraper Skill

Purpose

Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

  • Target URL is accessible (test with curl -I )

  • Documentation structure is identifiable (inspect page for content selectors)

  • Output directory is writable

  • Rate limiting requirements are known (check robots.txt)

DO NOT proceed without verification. Inspect before scraping.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

  • Content selector is ambiguous (multiple <article> or <main> elements)

  • URL patterns unclear (can't determine include/exclude rules)

  • Category mapping uncertain (content doesn't fit predefined categories)

  • Rate limiting unknown (no robots.txt, unclear ToS)

NEVER substitute missing configuration with assumptions.

Context Scope (Archetype 3 Mitigation)

Context Type Included Excluded

RELEVANT Target URL, selectors, output path Unrelated documentation

PERIPHERAL Similar site examples for selector hints Historical scrape data

DISTRACTOR Other projects, unrelated URLs Previous failed attempts

Workflow Steps

Step 1: Verify Target (Grounding)

Test URL accessibility

curl -I <target-url>

Check robots.txt

curl <base-url>/robots.txt

Inspect page structure (use browser dev tools or fetch sample)

Step 2: Create Configuration

Generate scraper config based on inspection:

{ "name": "skill-name", "description": "When to use this skill", "base_url": "https://docs.example.com/", "selectors": { "main_content": "article", "title": "h1", "code_blocks": "pre code" }, "url_patterns": { "include": ["/docs", "/guide", "/api"], "exclude": ["/blog", "/changelog", "/releases"] }, "categories": { "getting_started": ["intro", "quickstart", "installation"], "api_reference": ["api", "reference", "methods"], "guides": ["guide", "tutorial", "how-to"] }, "rate_limit": 0.5, "max_pages": 500 }

Step 3: Execute Scraping

Option A: With skill-seekers (if installed)

Verify skill-seekers is available

pip show skill-seekers

Run scraper

skill-seekers scrape --config config.json

For large docs, use async mode

skill-seekers scrape --config config.json --async --workers 8

Option B: Manual scraping guidance

  • Use sitemap.xml or crawl starting URL

  • Extract content using configured selectors

  • Categorize pages based on URL patterns and keywords

  • Save to organized directory structure

Step 4: Validate Output

Check output structure

ls -la output/<skill-name>/

Verify content quality

head -50 output/<skill-name>/references/index.md

Count extracted pages

find output/<skill-name>_data/pages -name "*.json" | wc -l

Recovery Protocol (Archetype 4 Mitigation)

On error:

  • PAUSE - Stop scraping, preserve already-fetched pages

  • DIAGNOSE - Check error type:

  • Connection error → Verify URL, check network

  • Selector not found → Re-inspect page structure

  • Rate limited → Increase delay, reduce workers

  • Memory/disk → Reduce batch size, clear temp files

  • ADAPT - Adjust configuration based on diagnosis

  • RETRY - Resume from checkpoint (max 3 attempts)

  • ESCALATE - Ask user for guidance

Checkpoint Support

State saved to: .aiwg/working/checkpoints/doc-scraper/

Resume interrupted scrape:

skill-seekers scrape --config config.json --resume

Clear checkpoint and start fresh:

skill-seekers scrape --config config.json --fresh

Output Structure

output/<skill-name>/ ├── SKILL.md # Main skill description ├── references/ # Categorized documentation │ ├── index.md # Category index │ ├── getting_started.md │ ├── api_reference.md │ └── guides.md ├── scripts/ # (empty, for user additions) └── assets/ # (empty, for user additions)

output/<skill-name>_data/ ├── pages/ # Raw scraped JSON (one per page) └── summary.json # Scrape statistics

Configuration Templates

Minimal Config

{ "name": "myframework", "base_url": "https://docs.example.com/", "max_pages": 100 }

Full Config

{ "name": "myframework", "description": "MyFramework documentation for building web apps", "base_url": "https://docs.example.com/", "selectors": { "main_content": "article, main, div[role='main']", "title": "h1, .title", "code_blocks": "pre code, .highlight code", "navigation": "nav, .sidebar" }, "url_patterns": { "include": ["/docs/", "/api/", "/guide/"], "exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"] }, "categories": { "getting_started": ["intro", "quickstart", "install", "setup"], "concepts": ["concept", "overview", "architecture"], "api": ["api", "reference", "method", "function"], "guides": ["guide", "tutorial", "how-to", "example"], "advanced": ["advanced", "internals", "customize"] }, "rate_limit": 0.5, "max_pages": 1000, "checkpoint": { "enabled": true, "interval": 100 } }

Troubleshooting

Issue Diagnosis Solution

No content extracted Selector mismatch Inspect page, update main_content selector

Wrong pages scraped URL pattern issue Check include /exclude patterns

Rate limited Too aggressive Increase rate_limit to 1.0+ seconds

Memory issues Too many pages Add max_pages limit, enable checkpoints

Categories wrong Keyword mismatch Update category keywords in config

References

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ai-pattern-detection

No summary provided by upstream source.

Repository SourceNeeds Review
General

voice-apply

No summary provided by upstream source.

Repository SourceNeeds Review
General

llms-txt-support

No summary provided by upstream source.

Repository SourceNeeds Review
General

Neural Memory

Associative memory with spreading activation for persistent, intelligent recall. Use PROACTIVELY when: (1) You need to remember facts, decisions, errors, or...

Registry SourceRecently Updated