Authenticated Web Scraper
Purpose
Scrapes content from websites that require authentication (2FA, SSO, corporate login) by leveraging the user's Windows Edge browser via Chrome DevTools Protocol (CDP). Designed for WSL2 environments where Playwright/Puppeteer can't directly reach Windows browser ports.
When to Use
-
Mirroring internal documentation sites behind corporate auth
-
Scraping content from sites requiring 2FA/SSO that can't be automated
-
Extracting structured content (text, HTML, links) from authenticated web pages
-
Crawling site navigation trees and following links to a configurable depth
Architecture
WSL2 Windows ┌─────────────────┐ ┌──────────────────────┐ │ Claude Code │ │ Edge Browser │ │ │ kill │ (user's profile) │ │ 1. Kill Edge ───┼──────────>│ │ │ │ launch │ │ │ 2. Launch Edge ─┼──────────>│ --remote-debug:9222 │ │ │ │ --debug-addr:0.0.0.0 │ │ [User auths │ │ │ │ in browser] │ │ CDP WebSocket on :9222│ │ │ cmd.exe │ │ │ 3. Run scraper ─┼──────────>│ node scraper.mjs │ │ │ │ connects localhost:9222│ │ 4. Read output <┼───────────│ writes to C:\Temp... │ └─────────────────┘ └──────────────────────┘
Key insight: WSL2 cannot reach Windows localhost:9222 directly. The scraper script must run on the Windows side via cmd.exe /c "node script.mjs" .
Quick Start
When a user asks to scrape an authenticated website:
-
Kill existing Edge processes and relaunch with debug flags
-
User authenticates in the headed browser
-
Copy scraper script to Windows temp and run via cmd.exe
-
Script connects to CDP, navigates pages, extracts content
-
Read results from shared filesystem (/mnt/c/Temp/... )
Core Workflow
Phase 0: Prerequisites
-
Node.js must be installed on Windows (cmd.exe /c "where node" )
-
The ws npm package on Windows side (cmd.exe /c "cd C:\Temp && npm install ws" )
-
Edge browser installed (check /mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe )
Phase 1: Launch Edge with Remote Debugging
import { execSync, spawn } from "child_process";
// CRITICAL: Kill ALL Edge processes first, otherwise debug flags are ignored execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"'); await sleep(3000);
const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe"; spawn( EDGE, [ "--remote-debugging-port=9222", "--remote-debugging-address=0.0.0.0", "--remote-allow-origins=*", targetUrl, ], { detached: true, stdio: "ignore" } ).unref();
Phase 2: Verify CDP and User Auth
Verify CDP is running (must query from Windows side)
powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"
Tell user to authenticate, then confirm they can see content.
Phase 3: Scrape via CDP
Write a Node.js script that:
-
Queries http://localhost:9222/json/list for open pages
-
Connects to the target page via WebSocket (ws package)
-
Uses Runtime.evaluate to extract DOM content
-
Uses Page.navigate
- Page.enable for crawling
- Saves .txt (clean text), .html (full), _links.json per page
Run on Windows side:
cp script.mjs /mnt/c/Temp/scraper.mjs cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1
Phase 4: Crawl Navigation
-
Extract sidebar/nav links from the initial page
-
Filter to same-domain pages (skip anchor links)
-
Visit each nav page, extract content + links
-
Follow discovered links one level deep (deduplicating)
-
Write summary JSON with page inventory
CDP Command Reference
// Navigate to a page await cdpSend(ws, "Page.navigate", { url });
// Extract text content await cdpSend(ws, "Runtime.evaluate", { expression: 'document.querySelector("main").innerText', returnByValue: true, });
// Extract links as JSON await cdpSend(ws, "Runtime.evaluate", { expression: 'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))', returnByValue: true, });
// Get full HTML await cdpSend(ws, "Runtime.evaluate", { expression: "document.documentElement.outerHTML", returnByValue: true, });
Critical Details
-
Must kill Edge first: If Edge is already running, new instances join the existing process and ignore --remote-debugging-port
-
WSL2 networking: WSL2 has its own network stack; 127.0.0.1 in WSL does NOT reach Windows. Scripts must run on Windows via cmd.exe
-
Respectful crawling: Add 2-second delays between page loads
-
Auth persistence: Edge uses the user's default profile with saved sessions
-
Output path: Use Windows paths (C:\Temp... ) in scripts, read via /mnt/c/Temp/... from WSL
Integration Points
-
Works with any documentation site behind corporate auth (SSO, SAML, FIDO2, etc.)
-
Output can be fed to other skills for analysis, summarization, or knowledge base building
-
Pairs well with investigation-workflow and knowledge-builder skills