data-scraper

Extract data from websites and APIs for analysis. Use when user needs to collect product prices from e-commerce sites, gather news articles, extract structured data from web pages, build datasets from public sources, or automate data collection for research.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-scraper" with this command: npx skills add dinghaibin/scraper-pro

Data Scraper

Extract structured data from websites and APIs.

Quick Start

# Basic page scrape
python scripts/scrape.py https://example.com --output data.json

Core Features

  • CSS/XPath selectors: Target specific elements
  • Multiple output formats: JSON, CSV, Markdown
  • Pagination support: Scrape multiple pages
  • Rate limiting: Respect server limits
  • Authentication: Handle login/sessions

Usage

python scripts/scrape.py [OPTIONS]

Options:
  --url TEXT          URL to scrape (required)
  --selector TEXT     CSS selector for data extraction
  --output PATH       Output file path
  --format FORMAT     Output format: json, csv, markdown
  --limit NUM         Maximum items to scrape
  --wait SECS         Wait between requests
  --login URL         Login URL for authenticated scraping

Examples

Product Price Collection

python scripts/scrape.py \
  --url "https://example.com/products" \
  --selector ".product" \
  --output prices.json \
  --format json

News Article Aggregation

python scripts/scrape.py \
  --url "https://news.example.com/latest" \
  --selector "article" \
  --output news.md \
  --format markdown

Configuration File

Create scrape.yaml for complex scraping:

url: https://example.com/products
selectors:
  items: ".product-card"
  title: ".product-title"
  price: ".price::text"
  image: "img::attr(src)"
  link: "a::attr(href)"

pagination:
  type: click
  button: ".next-page"
  max_pages: 10

output:
  format: json
  file: products.json

Best Practices

  1. Check robots.txt before scraping
  2. Add delays between requests
  3. Cache responses for development
  4. Handle errors gracefully
  5. Store raw HTML for debugging

Legal Note

Ensure you have permission to scrape target websites. Check Terms of Service and robots.txt.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

AIsa Twitter Research Engage Relay

Run Twitter/X likes, follows, replies, and OAuth-gated posting through AIsa. Use when: the user already knows which account, tweet, or campaign to act on and...

Registry SourceRecently Updated
Research

AIsa Multi Search Engine

Multi-source search engine powered by AIsa API. Combines Tavily web search, Scholar academic search, Smart hybrid search, and Perplexity deep research — all...

Registry SourceRecently Updated
Research

Multi Source Search

Confidence-scored multi-source retrieval across web, scholar, Tavily, and Perplexity-backed research. Use when: the user needs cross-source verification, con...

Registry SourceRecently Updated
Research

Us Stock Analyst

Professional US stock analysis with financial data, news, social sentiment, and multi-model AI. Comprehensive reports at $0.02-0.10 per analysis. Use when: t...

Registry SourceRecently Updated