data-base

Mental Model

Data acquisition is converting unstructured web content into structured data. Choose tool based on page complexity: JS-heavy → chrome-devtools MCP, static → Python requests.

Tool Selection

Page Type Tool When to Use

Dynamic (JS-rendered, SPAs) chrome-devtools MCP React/Vue apps, infinite scroll, login gates

Static HTML Python requests Blogs, news sites, simple pages

Complex/reusable logic Python script Multi-step scraping, rate limiting, proxies

Anti-Patterns (NEVER)

Don't scrape without checking robots.txt
Don't overload servers (default: 1 req/sec)
Don't scrape personal data without consent
Don't use Chinese characters in output filenames (ASCII only)
Don't forget to identify bot with User-Agent

Output Format

JSON: Nested/hierarchical data
CSV: Tabular data
Filename: {source}_{timestamp}.{ext} (ASCII only, e.g., news_20250115.csv )

Workflow

Ask: What data? Which sites? How much?
Select tool based on page type
Extract and save structured data
Deliver file path to user or pass to data-analysis

Python Environment

Auto-initialize virtual environment if needed, then execute:

cd skills/data-base

if [ ! -f ".venv/bin/python" ]; then echo "Creating Python environment..." ./setup.sh fi

.venv/bin/python your_script.py

The setup script auto-installs: requests, beautifulsoup4, pandas, web scraping tools.

References (load on demand)

For detailed APIs and templates, load: references/REFERENCE.md , references/templates.md

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

data-analysis

weather

hotel