Auto Scraping to CSV — Page-Agent Bridge
Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM required — Claude acts as the host model.
When to Use
- Data extraction: "Extract all product names and prices from the listing"
- Table scraping: "Get the top 10 rows from the pricing table"
- News aggregation: "Scrape latest blog posts with titles, dates, and URLs"
- Form & workflow testing: "Fill the signup form with test@example.com and submit"
- UI verification: "Verify the dashboard shows 3 items in the table"
- End-to-end journeys: "Login → add item to cart → checkout → confirm order"
- Regression testing: Re-run natural language test scripts after deploys
How It Works
Claude (Host Model)
↕ HTTP
Bridge Server (Node.js + Playwright)
↕ page.evaluate()
Browser (Chromium) ← Page-Agent injected
- Bridge launches a local Chromium browser via Playwright
- Page-Agent is injected as an IIFE script from CDN into the target page
- Page-Agent indexes the DOM and generates a simplified text representation of interactive elements with numeric indices:
[5]<button>Submit</button> [12]<input placeholder="Email" type="email"/> - Claude receives the text state, decides the next action, and instructs the bridge to execute it
- Loop continues until the task is complete or max steps reached
Key Design Decisions
| Decision | Rationale |
|---|---|
| Text-based DOM | No screenshots, no vision model needed. Faster and cheaper. |
| Host model | Claude is the reasoning engine. No OpenAI/Qwen API key needed. |
| HTTP bridge | Playwright runs in Node.js; Claude communicates via simple HTTP. |
| Turn-based loop | Compatible with Claude Code's chat interaction model. |
| CDN injection | No npm install of page-agent needed; auto-updates to latest. |
| CSV export | Built-in workflow to convert scraped JSON data to CSV files. |
First-Time Setup
1. Install Playwright
npm install -D playwright
npx playwright install chromium
2. Place the Bridge Script
After installing this skill, copy the bundled bridge script to .claude/agents/:
cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/
3. Start the Bridge
In a separate terminal (the bridge must stay running):
node .claude/agents/page-agent-bridge.mjs
Default port: 9876. Custom port:
node .claude/agents/page-agent-bridge.mjs 8888
You should see:
🚀 Page-Agent Bridge (Host Model) running on http://localhost:9876
4. Verify Health
curl http://localhost:9876/health
Expected: { "status": "ok", "sessions": 0, "maxSessions": 5 }
Workflow
Phase 1: Initialize Session
curl -X POST http://localhost:9876/sessions \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "headless": false}'
Response:
{ "id": "a1b2c3d4", "url": "https://example.com" }
Phase 2: Observe → Think → Act Loop
Step 2a. Observe (fetch DOM state)
curl http://localhost:9876/sessions/a1b2c3d4/state
Step 2b. Think (Claude decides)
Based on the content text, identify the target element index and choose an action.
Step 2c. Act (execute action)
curl -X POST http://localhost:9876/sessions/a1b2c3d4/act \
-H "Content-Type: application/json" \
-d '{"action": "clickElement", "params": {"index": 5}}'
Repeat observe → act until complete.
Phase 3: Close Session
curl -X DELETE http://localhost:9876/sessions/a1b2c3d4
Or stop the bridge:
curl -X POST http://localhost:9876/shutdown
Scraping to CSV Workflow
Step 1: Navigate and Get DOM State
Start a session on your target URL and fetch the DOM state:
curl -X POST http://localhost:9876/sessions \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/products", "headless": true}'
Step 2: Extract Structured Data via JavaScript
Use the executeJavascript action to extract data from the page:
cat > /tmp/extract.json << 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = Array.from(document.querySelectorAll('.product')).map(el => ({name: el.querySelector('.title').textContent.trim(), price: el.querySelector('.price').textContent.trim(), url: el.querySelector('a').href})); return JSON.stringify(items);"}}
EOF
curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
-H "Content-Type: application/json" \
-d @/tmp/extract.json
Step 3: Convert JSON to CSV
Option A — Python (recommended)
python3 << 'PYEOF'
import json, csv, sys, re
# The bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
# Extract the JSON array from the message
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
data = json.loads(match.group(1))
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
print(f"Wrote {len(data)} rows to output.csv")
PYEOF
Option B — jq + csvkit
# Install csvkit: pip install csvkit
# Extract JSON from bridge response
echo '[{"name":"A","price":"$10"},{"name":"B","price":"$20"}]' | \
json2csv -k name,price > output.csv
Option C — Node.js (no extra deps)
const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
headers.join(','),
...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\n');
fs.writeFileSync('output.csv', csv);
Complete Example: Scrape Anthropic News to CSV
# 1. Start bridge (in separate terminal)
node .claude/agents/page-agent-bridge.mjs
# 2. Create session
curl -s -X POST http://localhost:9876/sessions \
-H "Content-Type: application/json" \
-d '{"url": "https://www.anthropic.com/news", "headless": true}'
# → { "id": "abc123" }
# 3. Extract news data
cat > /tmp/extract.json << 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = []; document.querySelectorAll('a[href*=\"/news/\"]').forEach(a => { const href = a.href; if (!href.includes('anthropic.com/news/')) return; const h2 = a.querySelector('h2, h3'); const title = h2 ? h2.textContent.trim() : ''; const time = a.querySelector('time'); const date = time ? time.textContent.trim() : ''; if (title && date) items.push({title, date, url: href}); }); return JSON.stringify(items.slice(0, 15));"}}
EOF
curl -s -X POST http://localhost:9876/sessions/abc123/act \
-H "Content-Type: application/json" -d @/tmp/extract.json
# 4. Convert to CSV (see Python script above)
# 5. Close session
curl -X DELETE http://localhost:9876/sessions/abc123
Available Actions
| Action | Params | Description |
|---|---|---|
getBrowserState | — | Refresh DOM tree and return full page state |
clickElement | { index: number } | Click the interactive element at index |
inputText | { index: number, text: string } | Click then type into input element |
selectOption | { index: number, optionText: string } | Select dropdown option by visible text |
scroll | { down?, num_pages?, pixels?, index? } | Scroll vertically |
scrollHorizontally | { right?, pixels, index? } | Scroll horizontally |
executeJavascript | { script: string } | Run arbitrary JS in page context (async/await supported) |
wait | { seconds: number } | Pause execution |
cleanUpHighlights | — | Remove all Page-Agent visual highlights |
updateTree | — | Re-index the DOM manually |
Natural Language Commands
When this skill is active, Claude accepts natural language commands:
/scrape-to-csv <url> <description>
General scraping task with CSV export.
/scrape-to-csv https://example.com/products
"Extract all product names, prices, and availability. Save as CSV."
/scrape-table <url> <selector>
Extract a specific HTML table.
/scrape-table https://example.com/pricing ".pricing-table"
/scrape-news <url>
Extract news/blog articles with titles, dates, and URLs.
/scrape-news https://www.anthropic.com/news
/test-frontend <url> <task>
Test forms, workflows, or UI interactions.
/test-frontend https://staging.example.com/signup
"Fill the form with test data, submit, and verify welcome page"
Output Format
Claude produces a structured markdown report:
## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 4.2s | **Rows:** 12
### Task
Extract all product names, prices, and availability. Save as CSV.
### Execution Log
| Step | Action | Target | Result |
|------|--------|--------|--------|
| 1 | getBrowserState | — | 24 interactive elements found |
| 2 | executeJavascript | products | ✅ 12 items extracted |
| 3 | — | — | ✅ CSV written: 12 rows |
### Sample Data
| name | price | availability |
|------|-------|-------------|
| Widget A | $19.99 | In stock |
| Widget B | $29.99 | Out of stock |
### File
`./output.csv` — 12 rows, 3 columns
Troubleshooting
Bridge won't start
Error: Cannot find module 'playwright'
Fix: npm install -D playwright && npx playwright install chromium
Browser page is blank
Cause: Page didn't finish loading before Page-Agent injection.
Fix: The bridge already uses waitUntil: 'networkidle'. For SPAs, add a wait action after navigation.
Element index not found
Cause: DOM tree stale; element was added after last updateTree().
Fix: Call getBrowserState (which refreshes the tree) before acting.
CORS errors in browser console
Cause: Page-Agent IIFE loaded from CDN on a strict CSP page.
Fix: The bridge injects via page.addScriptTag({ url: CDN_URL }) which usually bypasses CSP.
Headless vs headed mode
- Headed (
headless: false): You can watch the browser. Good for debugging. - Headless (
headless: true): Faster, good for CI.
Comparison with Other Tools
| Tool | DOM Type | LLM Required | Speed | Best For |
|---|---|---|---|---|
| Page-Agent Bridge | Text | Host (Claude) | Fast | Precise UI tasks, forms, data extraction |
/browse (gstack) | Visual + DOM | Host (Claude) | Medium | General QA, screenshots, visual checks |
| Playwright E2E | Code | None | Fastest | Repeatable CI suites, regression |
| Browser-Use | Text | External API | Medium | Complex multi-page research |
| Scrapy | Code | None | Fast | Large-scale crawling, pipelines |
Use this skill when:
- You want natural language scraping commands
- You don't want to write CSS selectors or XPath
- You need quick one-off data extraction to CSV
- You're iterating on frontend behavior and need verification
- You want structured text evidence (DOM snapshots) instead of screenshots
Bridge API Reference
POST /sessions
Launch a new browser session.
Body:
{ "url": "https://example.com", "headless": false, "viewport": { "width": 1280, "height": 720 } }
Response: { "id": "abc123", "url": "https://example.com" }
GET /sessions/:id/state
Get current browser state including simplified DOM text.
Response: BrowserState object with url, title, header, content, footer.
POST /sessions/:id/act
Execute a Page-Agent action.
Body:
{ "action": "executeJavascript", "params": { "script": "return document.title;" } }
Response: { "success": true, "message": "✅ Executed JavaScript. Result: ..." }
POST /sessions/:id/navigate
Navigate to a new URL within the same session.
Body: { "url": "https://example.com/other" }
DELETE /sessions/:id
Close the browser tab and session.
POST /shutdown
Stop the bridge server and close all sessions.
GET /health
Health check. Returns { "status": "ok", "sessions": 0, "maxSessions": 5 }.
Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright