Data Scraper
Extract structured data from websites and APIs.
Quick Start
# Basic page scrape
python scripts/scrape.py https://example.com --output data.json
Core Features
- CSS/XPath selectors: Target specific elements
- Multiple output formats: JSON, CSV, Markdown
- Pagination support: Scrape multiple pages
- Rate limiting: Respect server limits
- Authentication: Handle login/sessions
Usage
python scripts/scrape.py [OPTIONS]
Options:
--url TEXT URL to scrape (required)
--selector TEXT CSS selector for data extraction
--output PATH Output file path
--format FORMAT Output format: json, csv, markdown
--limit NUM Maximum items to scrape
--wait SECS Wait between requests
--login URL Login URL for authenticated scraping
Examples
Product Price Collection
python scripts/scrape.py \
--url "https://example.com/products" \
--selector ".product" \
--output prices.json \
--format json
News Article Aggregation
python scripts/scrape.py \
--url "https://news.example.com/latest" \
--selector "article" \
--output news.md \
--format markdown
Configuration File
Create scrape.yaml for complex scraping:
url: https://example.com/products
selectors:
items: ".product-card"
title: ".product-title"
price: ".price::text"
image: "img::attr(src)"
link: "a::attr(href)"
pagination:
type: click
button: ".next-page"
max_pages: 10
output:
format: json
file: products.json
Best Practices
- Check robots.txt before scraping
- Add delays between requests
- Cache responses for development
- Handle errors gracefully
- Store raw HTML for debugging
Legal Note
Ensure you have permission to scrape target websites. Check Terms of Service and robots.txt.