data-scraper

Web page data collection and structured text extraction

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-scraper" with this command: npx skills add mupengi-bot/data-scraper

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

  • Extract text content from web pages (articles, blogs, docs)
  • Scrape product prices, reviews, or listings
  • Monitor pages for changes (price drops, new content)
  • Batch-collect data from multiple URLs
  • Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

FormatFlagUse Case
Text--format textReading, summarization
JSON--format jsonData processing
CSV--format csvSpreadsheets
Markdown--format mdDocumentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

  • Default: 1 request per second per domain
  • Respects robots.txt when --polite flag is set
  • Configurable delay between requests
  • Stops on 429 (Too Many Requests) and backs off

Error Handling

ErrorBehavior
404Log and skip
403/401Warn about auth requirement
429Exponential backoff (max 3 retries)
TimeoutRetry once with longer timeout
SSL errorWarn, option to proceed with --insecure

Integration

  • web-claude: Use as fallback when web_fetch isn't enough
  • competitor-watch: Feed scraped data into competitor analysis
  • seo-audit: Scrape competitor pages for SEO comparison
  • performance-tracker: Collect social metrics from public profiles

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Content Refresher

Use when updating outdated content, fixing traffic/ranking decay, refreshing stats, adding new sections, or improving freshness signals. 内容更新/排名恢复

Registry SourceRecently Updated
General

AssemblyAI Transcriber

Transcribe audio files with speaker diarization (who speaks when). Supports 100+ languages, automatic language detection, and timestamps. Use for meetings, interviews, podcasts, or voice messages. Requires AssemblyAI API key.

Registry SourceRecently Updated
General

mac-node-snapshot

A robust, permission-friendly method to capture macOS screens via OpenClaw screen.record. Ideal for headless environments or ensuring capture reliability.

Registry SourceRecently Updated
1.4K0taozhe6
General

Amazon Asin Lookup Api Skill

This skill helps users extract structured product details from Amazon using a specific ASIN (Amazon Standard Identification Number). Use this skill when the...

Registry SourceRecently Updated
1.3K1phheng