Auto Scraping to CSV — Page-Agent Bridge

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM required — Claude acts as the host model.

When to Use

Data extraction: "Extract all product names and prices from the listing"
Table scraping: "Get the top 10 rows from the pricing table"
News aggregation: "Scrape latest blog posts with titles, dates, and URLs"
Form & workflow testing: "Fill the signup form with test@example.com and submit"
UI verification: "Verify the dashboard shows 3 items in the table"
End-to-end journeys: "Login → add item to cart → checkout → confirm order"
Regression testing: Re-run natural language test scripts after deploys

How It Works

Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected

Bridge launches a local Chromium browser via Playwright
Page-Agent is injected as an IIFE script from CDN into the target page
Page-Agent indexes the DOM and generates a simplified text representation of interactive elements with numeric indices:
```
[5]<button>Submit</button>
[12]<input placeholder="Email" type="email"/>
```
Claude receives the text state, decides the next action, and instructs the bridge to execute it
Loop continues until the task is complete or max steps reached

Key Design Decisions

Decision	Rationale
Text-based DOM	No screenshots, no vision model needed. Faster and cheaper.
Host model	Claude is the reasoning engine. No OpenAI/Qwen API key needed.
HTTP bridge	Playwright runs in Node.js; Claude communicates via simple HTTP.
Turn-based loop	Compatible with Claude Code's chat interaction model.
CDN injection	No npm install of page-agent needed; auto-updates to latest.
CSV export	Built-in workflow to convert scraped JSON data to CSV files.

First-Time Setup

1. Install Playwright

npm install -D playwright
npx playwright install chromium

2. Place the Bridge Script

After installing this skill, copy the bundled bridge script to .claude/agents/:

cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/

3. Start the Bridge

In a separate terminal (the bridge must stay running):

node .claude/agents/page-agent-bridge.mjs

Default port: 9876. Custom port:

node .claude/agents/page-agent-bridge.mjs 8888

You should see:

🚀  Page-Agent Bridge (Host Model) running on http://localhost:9876

4. Verify Health

curl http://localhost:9876/health

Expected: { "status": "ok", "sessions": 0, "maxSessions": 5 }

Workflow

Phase 1: Initialize Session

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "headless": false}'

Response:

{ "id": "a1b2c3d4", "url": "https://example.com" }

Phase 2: Observe → Think → Act Loop

Step 2a. Observe (fetch DOM state)

curl http://localhost:9876/sessions/a1b2c3d4/state

Step 2b. Think (Claude decides)

Based on the content text, identify the target element index and choose an action.

Step 2c. Act (execute action)

curl -X POST http://localhost:9876/sessions/a1b2c3d4/act \
  -H "Content-Type: application/json" \
  -d '{"action": "clickElement", "params": {"index": 5}}'

Repeat observe → act until complete.

Phase 3: Close Session

curl -X DELETE http://localhost:9876/sessions/a1b2c3d4

Or stop the bridge:

curl -X POST http://localhost:9876/shutdown

Scraping to CSV Workflow

Step 1: Navigate and Get DOM State

Start a session on your target URL and fetch the DOM state:

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "headless": true}'

Step 2: Extract Structured Data via JavaScript

Use the executeJavascript action to extract data from the page:

cat > /tmp/extract.json << 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = Array.from(document.querySelectorAll('.product')).map(el => ({name: el.querySelector('.title').textContent.trim(), price: el.querySelector('.price').textContent.trim(), url: el.querySelector('a').href})); return JSON.stringify(items);"}}
EOF

curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
  -H "Content-Type: application/json" \
  -d @/tmp/extract.json

Step 3: Convert JSON to CSV

Option A — Python (recommended)

python3 << 'PYEOF'
import json, csv, sys, re

# The bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
# Extract the JSON array from the message
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
    data = json.loads(match.group(1))
    with open('output.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Wrote {len(data)} rows to output.csv")
PYEOF

Option B — jq + csvkit

# Install csvkit: pip install csvkit
# Extract JSON from bridge response
echo '[{"name":"A","price":"$10"},{"name":"B","price":"$20"}]' | \
  json2csv -k name,price > output.csv

Option C — Node.js (no extra deps)

const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
  headers.join(','),
  ...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\n');
fs.writeFileSync('output.csv', csv);

Complete Example: Scrape Anthropic News to CSV

# 1. Start bridge (in separate terminal)
node .claude/agents/page-agent-bridge.mjs

# 2. Create session
curl -s -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.anthropic.com/news", "headless": true}'
# → { "id": "abc123" }

# 3. Extract news data
cat > /tmp/extract.json << 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = []; document.querySelectorAll('a[href*=\"/news/\"]').forEach(a => { const href = a.href; if (!href.includes('anthropic.com/news/')) return; const h2 = a.querySelector('h2, h3'); const title = h2 ? h2.textContent.trim() : ''; const time = a.querySelector('time'); const date = time ? time.textContent.trim() : ''; if (title && date) items.push({title, date, url: href}); }); return JSON.stringify(items.slice(0, 15));"}}
EOF

curl -s -X POST http://localhost:9876/sessions/abc123/act \
  -H "Content-Type: application/json" -d @/tmp/extract.json

# 4. Convert to CSV (see Python script above)

# 5. Close session
curl -X DELETE http://localhost:9876/sessions/abc123

Available Actions

Action	Params	Description
`getBrowserState`	—	Refresh DOM tree and return full page state
`clickElement`	`{ index: number }`	Click the interactive element at index
`inputText`	`{ index: number, text: string }`	Click then type into input element
`selectOption`	`{ index: number, optionText: string }`	Select dropdown option by visible text
`scroll`	`{ down?, num_pages?, pixels?, index? }`	Scroll vertically
`scrollHorizontally`	`{ right?, pixels, index? }`	Scroll horizontally
`executeJavascript`	`{ script: string }`	Run arbitrary JS in page context (async/await supported)
`wait`	`{ seconds: number }`	Pause execution
`cleanUpHighlights`	—	Remove all Page-Agent visual highlights
`updateTree`	—	Re-index the DOM manually

Natural Language Commands

When this skill is active, Claude accepts natural language commands:

`/scrape-to-csv <url> <description>`

General scraping task with CSV export.

/scrape-to-csv https://example.com/products
  "Extract all product names, prices, and availability. Save as CSV."

`/scrape-table <url> <selector>`

Extract a specific HTML table.

/scrape-table https://example.com/pricing ".pricing-table"

`/scrape-news <url>`

Extract news/blog articles with titles, dates, and URLs.

/scrape-news https://www.anthropic.com/news

`/test-frontend <url> <task>`

Test forms, workflows, or UI interactions.

/test-frontend https://staging.example.com/signup
  "Fill the form with test data, submit, and verify welcome page"

Output Format

Claude produces a structured markdown report:

## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 4.2s | **Rows:** 12

### Task
Extract all product names, prices, and availability. Save as CSV.

### Execution Log

| Step | Action | Target | Result |
|------|--------|--------|--------|
| 1 | getBrowserState | — | 24 interactive elements found |
| 2 | executeJavascript | products | ✅ 12 items extracted |
| 3 | — | — | ✅ CSV written: 12 rows |

### Sample Data

| name | price | availability |
|------|-------|-------------|
| Widget A | $19.99 | In stock |
| Widget B | $29.99 | Out of stock |

### File
`./output.csv` — 12 rows, 3 columns

Troubleshooting

Bridge won't start

Error: Cannot find module 'playwright'

Fix: npm install -D playwright && npx playwright install chromium

Browser page is blank

Cause: Page didn't finish loading before Page-Agent injection.
Fix: The bridge already uses waitUntil: 'networkidle'. For SPAs, add a wait action after navigation.

Element index not found

Cause: DOM tree stale; element was added after last updateTree().
Fix: Call getBrowserState (which refreshes the tree) before acting.

CORS errors in browser console

Cause: Page-Agent IIFE loaded from CDN on a strict CSP page.
Fix: The bridge injects via page.addScriptTag({ url: CDN_URL }) which usually bypasses CSP.

Headless vs headed mode

Headed (headless: false): You can watch the browser. Good for debugging.
Headless (headless: true): Faster, good for CI.

Comparison with Other Tools

Tool	DOM Type	LLM Required	Speed	Best For
Page-Agent Bridge	Text	Host (Claude)	Fast	Precise UI tasks, forms, data extraction
`/browse` (gstack)	Visual + DOM	Host (Claude)	Medium	General QA, screenshots, visual checks
Playwright E2E	Code	None	Fastest	Repeatable CI suites, regression
Browser-Use	Text	External API	Medium	Complex multi-page research
Scrapy	Code	None	Fast	Large-scale crawling, pipelines

Use this skill when:

You want natural language scraping commands
You don't want to write CSS selectors or XPath
You need quick one-off data extraction to CSV
You're iterating on frontend behavior and need verification
You want structured text evidence (DOM snapshots) instead of screenshots

Bridge API Reference

`POST /sessions`

Launch a new browser session.

Body:

{ "url": "https://example.com", "headless": false, "viewport": { "width": 1280, "height": 720 } }

Response: { "id": "abc123", "url": "https://example.com" }

`GET /sessions/:id/state`

Get current browser state including simplified DOM text.

Response: BrowserState object with url, title, header, content, footer.

`POST /sessions/:id/act`

Execute a Page-Agent action.

Body:

{ "action": "executeJavascript", "params": { "script": "return document.title;" } }

Response: { "success": true, "message": "✅ Executed JavaScript. Result: ..." }

`POST /sessions/:id/navigate`

Navigate to a new URL within the same session.

Body: { "url": "https://example.com/other" }

`DELETE /sessions/:id`

Close the browser tab and session.

`POST /shutdown`

Stop the bridge server and close all sessions.

`GET /health`

Health check. Returns { "status": "ok", "sessions": 0, "maxSessions": 5 }.

Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright