auto-scraping-to-csv

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM needed.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "auto-scraping-to-csv" with this command: npx skills add science-prof-robot/auto-scraping-to-csv

Auto Scraping to CSV — Page-Agent Bridge

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM required — Claude acts as the host model.

When to Use

  • Data extraction: "Extract all product names and prices from the listing"
  • Table scraping: "Get the top 10 rows from the pricing table"
  • News aggregation: "Scrape latest blog posts with titles, dates, and URLs"
  • Form & workflow testing: "Fill the signup form with test@example.com and submit"
  • UI verification: "Verify the dashboard shows 3 items in the table"
  • End-to-end journeys: "Login → add item to cart → checkout → confirm order"
  • Regression testing: Re-run natural language test scripts after deploys

How It Works

Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected
  1. Bridge launches a local Chromium browser via Playwright
  2. Page-Agent is injected as an IIFE script from CDN into the target page
  3. Page-Agent indexes the DOM and generates a simplified text representation of interactive elements with numeric indices:
    [5]<button>Submit</button>
    [12]<input placeholder="Email" type="email"/>
    
  4. Claude receives the text state, decides the next action, and instructs the bridge to execute it
  5. Loop continues until the task is complete or max steps reached

Key Design Decisions

DecisionRationale
Text-based DOMNo screenshots, no vision model needed. Faster and cheaper.
Host modelClaude is the reasoning engine. No OpenAI/Qwen API key needed.
HTTP bridgePlaywright runs in Node.js; Claude communicates via simple HTTP.
Turn-based loopCompatible with Claude Code's chat interaction model.
CDN injectionNo npm install of page-agent needed; auto-updates to latest.
CSV exportBuilt-in workflow to convert scraped JSON data to CSV files.

First-Time Setup

1. Install Playwright

npm install -D playwright
npx playwright install chromium

2. Place the Bridge Script

After installing this skill, copy the bundled bridge script to .claude/agents/:

cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/

3. Start the Bridge

In a separate terminal (the bridge must stay running):

node .claude/agents/page-agent-bridge.mjs

Default port: 9876. Custom port:

node .claude/agents/page-agent-bridge.mjs 8888

You should see:

🚀  Page-Agent Bridge (Host Model) running on http://localhost:9876

4. Verify Health

curl http://localhost:9876/health

Expected: { "status": "ok", "sessions": 0, "maxSessions": 5 }


Workflow

Phase 1: Initialize Session

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "headless": false}'

Response:

{ "id": "a1b2c3d4", "url": "https://example.com" }

Phase 2: Observe → Think → Act Loop

Step 2a. Observe (fetch DOM state)

curl http://localhost:9876/sessions/a1b2c3d4/state

Step 2b. Think (Claude decides)

Based on the content text, identify the target element index and choose an action.

Step 2c. Act (execute action)

curl -X POST http://localhost:9876/sessions/a1b2c3d4/act \
  -H "Content-Type: application/json" \
  -d '{"action": "clickElement", "params": {"index": 5}}'

Repeat observe → act until complete.

Phase 3: Close Session

curl -X DELETE http://localhost:9876/sessions/a1b2c3d4

Or stop the bridge:

curl -X POST http://localhost:9876/shutdown

Scraping to CSV Workflow

Step 1: Navigate and Get DOM State

Start a session on your target URL and fetch the DOM state:

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "headless": true}'

Step 2: Extract Structured Data via JavaScript

Use the executeJavascript action to extract data from the page:

cat > /tmp/extract.json << 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = Array.from(document.querySelectorAll('.product')).map(el => ({name: el.querySelector('.title').textContent.trim(), price: el.querySelector('.price').textContent.trim(), url: el.querySelector('a').href})); return JSON.stringify(items);"}}
EOF

curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
  -H "Content-Type: application/json" \
  -d @/tmp/extract.json

Step 3: Convert JSON to CSV

Option A — Python (recommended)

python3 << 'PYEOF'
import json, csv, sys, re

# The bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
# Extract the JSON array from the message
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
    data = json.loads(match.group(1))
    with open('output.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Wrote {len(data)} rows to output.csv")
PYEOF

Option B — jq + csvkit

# Install csvkit: pip install csvkit
# Extract JSON from bridge response
echo '[{"name":"A","price":"$10"},{"name":"B","price":"$20"}]' | \
  json2csv -k name,price > output.csv

Option C — Node.js (no extra deps)

const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
  headers.join(','),
  ...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\n');
fs.writeFileSync('output.csv', csv);

Complete Example: Scrape Anthropic News to CSV

# 1. Start bridge (in separate terminal)
node .claude/agents/page-agent-bridge.mjs

# 2. Create session
curl -s -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.anthropic.com/news", "headless": true}'
# → { "id": "abc123" }

# 3. Extract news data
cat > /tmp/extract.json << 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = []; document.querySelectorAll('a[href*=\"/news/\"]').forEach(a => { const href = a.href; if (!href.includes('anthropic.com/news/')) return; const h2 = a.querySelector('h2, h3'); const title = h2 ? h2.textContent.trim() : ''; const time = a.querySelector('time'); const date = time ? time.textContent.trim() : ''; if (title && date) items.push({title, date, url: href}); }); return JSON.stringify(items.slice(0, 15));"}}
EOF

curl -s -X POST http://localhost:9876/sessions/abc123/act \
  -H "Content-Type: application/json" -d @/tmp/extract.json

# 4. Convert to CSV (see Python script above)

# 5. Close session
curl -X DELETE http://localhost:9876/sessions/abc123

Available Actions

ActionParamsDescription
getBrowserStateRefresh DOM tree and return full page state
clickElement{ index: number }Click the interactive element at index
inputText{ index: number, text: string }Click then type into input element
selectOption{ index: number, optionText: string }Select dropdown option by visible text
scroll{ down?, num_pages?, pixels?, index? }Scroll vertically
scrollHorizontally{ right?, pixels, index? }Scroll horizontally
executeJavascript{ script: string }Run arbitrary JS in page context (async/await supported)
wait{ seconds: number }Pause execution
cleanUpHighlightsRemove all Page-Agent visual highlights
updateTreeRe-index the DOM manually

Natural Language Commands

When this skill is active, Claude accepts natural language commands:

/scrape-to-csv <url> <description>

General scraping task with CSV export.

/scrape-to-csv https://example.com/products
  "Extract all product names, prices, and availability. Save as CSV."

/scrape-table <url> <selector>

Extract a specific HTML table.

/scrape-table https://example.com/pricing ".pricing-table"

/scrape-news <url>

Extract news/blog articles with titles, dates, and URLs.

/scrape-news https://www.anthropic.com/news

/test-frontend <url> <task>

Test forms, workflows, or UI interactions.

/test-frontend https://staging.example.com/signup
  "Fill the form with test data, submit, and verify welcome page"

Output Format

Claude produces a structured markdown report:

## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 4.2s | **Rows:** 12

### Task
Extract all product names, prices, and availability. Save as CSV.

### Execution Log

| Step | Action | Target | Result |
|------|--------|--------|--------|
| 1 | getBrowserState | — | 24 interactive elements found |
| 2 | executeJavascript | products | ✅ 12 items extracted |
| 3 | — | — | ✅ CSV written: 12 rows |

### Sample Data

| name | price | availability |
|------|-------|-------------|
| Widget A | $19.99 | In stock |
| Widget B | $29.99 | Out of stock |

### File
`./output.csv` — 12 rows, 3 columns

Troubleshooting

Bridge won't start

Error: Cannot find module 'playwright'

Fix: npm install -D playwright && npx playwright install chromium

Browser page is blank

Cause: Page didn't finish loading before Page-Agent injection.
Fix: The bridge already uses waitUntil: 'networkidle'. For SPAs, add a wait action after navigation.

Element index not found

Cause: DOM tree stale; element was added after last updateTree().
Fix: Call getBrowserState (which refreshes the tree) before acting.

CORS errors in browser console

Cause: Page-Agent IIFE loaded from CDN on a strict CSP page.
Fix: The bridge injects via page.addScriptTag({ url: CDN_URL }) which usually bypasses CSP.

Headless vs headed mode

  • Headed (headless: false): You can watch the browser. Good for debugging.
  • Headless (headless: true): Faster, good for CI.

Comparison with Other Tools

ToolDOM TypeLLM RequiredSpeedBest For
Page-Agent BridgeTextHost (Claude)FastPrecise UI tasks, forms, data extraction
/browse (gstack)Visual + DOMHost (Claude)MediumGeneral QA, screenshots, visual checks
Playwright E2ECodeNoneFastestRepeatable CI suites, regression
Browser-UseTextExternal APIMediumComplex multi-page research
ScrapyCodeNoneFastLarge-scale crawling, pipelines

Use this skill when:

  • You want natural language scraping commands
  • You don't want to write CSS selectors or XPath
  • You need quick one-off data extraction to CSV
  • You're iterating on frontend behavior and need verification
  • You want structured text evidence (DOM snapshots) instead of screenshots

Bridge API Reference

POST /sessions

Launch a new browser session.

Body:

{ "url": "https://example.com", "headless": false, "viewport": { "width": 1280, "height": 720 } }

Response: { "id": "abc123", "url": "https://example.com" }

GET /sessions/:id/state

Get current browser state including simplified DOM text.

Response: BrowserState object with url, title, header, content, footer.

POST /sessions/:id/act

Execute a Page-Agent action.

Body:

{ "action": "executeJavascript", "params": { "script": "return document.title;" } }

Response: { "success": true, "message": "✅ Executed JavaScript. Result: ..." }

POST /sessions/:id/navigate

Navigate to a new URL within the same session.

Body: { "url": "https://example.com/other" }

DELETE /sessions/:id

Close the browser tab and session.

POST /shutdown

Stop the bridge server and close all sessions.

GET /health

Health check. Returns { "status": "ok", "sessions": 0, "maxSessions": 5 }.


Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Web3

Mcp Tool Integrator

AI assistant to scaffold, configure, connect, debug, and optimize Model Context Protocol (MCP) servers integrating 50+ tools for AI agent development.

Registry SourceRecently Updated
00Profile unavailable
Automation

Ai News Collector

AI 新闻聚合与热度排序工具。当用户询问 AI 领域最新动态时触发,如:"今天有什么 AI 新闻?""总结一下这周的 AI 动态""最近有什么火的 AI 产品?""AI 圈最近在讨论什么?"。覆盖:新产品发布、研究论文、行业动态、融资新闻、开源项目更新、社区病毒传播现象、AI 工具/Agent 热门项目。输出中文...

Registry SourceRecently Updated
Automation

boheng-investment-workflow

投资研究多智能体决策系统 - 8位专业分析师并行研究,加权投票给出投资建议。支持A股股票/基金/ETF/可转债。支持真实财报数据(baostock)或基础行情数据。⚠️ 风险提示:分析结果仅供学习参考,不构成投资建议。

Registry SourceRecently Updated
Automation

Flashrev Mailer

Use this skill when an AI agent needs to prepare, personalize, validate, queue, send, monitor, or export FlashRev-powered email outreach campaigns through th...

Registry SourceRecently Updated
821Profile unavailable