authenticated-web-scraper

Authenticated Web Scraper

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "authenticated-web-scraper" with this command: npx skills add rysweet/amplihack/rysweet-amplihack-authenticated-web-scraper

Authenticated Web Scraper

Purpose

Scrapes content from websites that require authentication (2FA, SSO, corporate login) by leveraging the user's Windows Edge browser via Chrome DevTools Protocol (CDP). Designed for WSL2 environments where Playwright/Puppeteer can't directly reach Windows browser ports.

When to Use

  • Mirroring internal documentation sites behind corporate auth

  • Scraping content from sites requiring 2FA/SSO that can't be automated

  • Extracting structured content (text, HTML, links) from authenticated web pages

  • Crawling site navigation trees and following links to a configurable depth

Architecture

WSL2 Windows ┌─────────────────┐ ┌──────────────────────┐ │ Claude Code │ │ Edge Browser │ │ │ kill │ (user's profile) │ │ 1. Kill Edge ───┼──────────>│ │ │ │ launch │ │ │ 2. Launch Edge ─┼──────────>│ --remote-debug:9222 │ │ │ │ --debug-addr:0.0.0.0 │ │ [User auths │ │ │ │ in browser] │ │ CDP WebSocket on :9222│ │ │ cmd.exe │ │ │ 3. Run scraper ─┼──────────>│ node scraper.mjs │ │ │ │ connects localhost:9222│ │ 4. Read output <┼───────────│ writes to C:\Temp... │ └─────────────────┘ └──────────────────────┘

Key insight: WSL2 cannot reach Windows localhost:9222 directly. The scraper script must run on the Windows side via cmd.exe /c "node script.mjs" .

Quick Start

When a user asks to scrape an authenticated website:

  • Kill existing Edge processes and relaunch with debug flags

  • User authenticates in the headed browser

  • Copy scraper script to Windows temp and run via cmd.exe

  • Script connects to CDP, navigates pages, extracts content

  • Read results from shared filesystem (/mnt/c/Temp/... )

Core Workflow

Phase 0: Prerequisites

  • Node.js must be installed on Windows (cmd.exe /c "where node" )

  • The ws npm package on Windows side (cmd.exe /c "cd C:\Temp && npm install ws" )

  • Edge browser installed (check /mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe )

Phase 1: Launch Edge with Remote Debugging

import { execSync, spawn } from "child_process";

// CRITICAL: Kill ALL Edge processes first, otherwise debug flags are ignored execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"'); await sleep(3000);

const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe"; spawn( EDGE, [ "--remote-debugging-port=9222", "--remote-debugging-address=0.0.0.0", "--remote-allow-origins=*", targetUrl, ], { detached: true, stdio: "ignore" } ).unref();

Phase 2: Verify CDP and User Auth

Verify CDP is running (must query from Windows side)

powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"

Tell user to authenticate, then confirm they can see content.

Phase 3: Scrape via CDP

Write a Node.js script that:

  • Queries http://localhost:9222/json/list for open pages

  • Connects to the target page via WebSocket (ws package)

  • Uses Runtime.evaluate to extract DOM content

  • Uses Page.navigate

  • Page.enable for crawling
  • Saves .txt (clean text), .html (full), _links.json per page

Run on Windows side:

cp script.mjs /mnt/c/Temp/scraper.mjs cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1

Phase 4: Crawl Navigation

  • Extract sidebar/nav links from the initial page

  • Filter to same-domain pages (skip anchor links)

  • Visit each nav page, extract content + links

  • Follow discovered links one level deep (deduplicating)

  • Write summary JSON with page inventory

CDP Command Reference

// Navigate to a page await cdpSend(ws, "Page.navigate", { url });

// Extract text content await cdpSend(ws, "Runtime.evaluate", { expression: 'document.querySelector("main").innerText', returnByValue: true, });

// Extract links as JSON await cdpSend(ws, "Runtime.evaluate", { expression: 'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))', returnByValue: true, });

// Get full HTML await cdpSend(ws, "Runtime.evaluate", { expression: "document.documentElement.outerHTML", returnByValue: true, });

Critical Details

  • Must kill Edge first: If Edge is already running, new instances join the existing process and ignore --remote-debugging-port

  • WSL2 networking: WSL2 has its own network stack; 127.0.0.1 in WSL does NOT reach Windows. Scripts must run on Windows via cmd.exe

  • Respectful crawling: Add 2-second delays between page loads

  • Auth persistence: Edge uses the user's default profile with saved sessions

  • Output path: Use Windows paths (C:\Temp... ) in scripts, read via /mnt/c/Temp/... from WSL

Integration Points

  • Works with any documentation site behind corporate auth (SSO, SAML, FIDO2, etc.)

  • Output can be fed to other skills for analysis, summarization, or knowledge base building

  • Pairs well with investigation-workflow and knowledge-builder skills

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pptx

No summary provided by upstream source.

Repository SourceNeeds Review
General

lawyer-analyst

No summary provided by upstream source.

Repository SourceNeeds Review
General

economist-analyst

No summary provided by upstream source.

Repository SourceNeeds Review