phoenix-scraper

Resilient multi-layer web scraper with automatic failover. Use when scraping web content that may be JS-rendered, behind bot protection, or on sites that block standard HTTP requests. Supports three-tier failover chain: Brave Search API, then Bright Data Web Unlocker (with optional browser render mode), then Playwright headless browser. Use for job boards (LinkedIn, Glassdoor, Reed, Indeed, CWJobs, TotalJobs), social media monitoring (X/Twitter via API v2), news sites, and any page requiring JS rendering. Includes zone routing logic for Bright Data (web_unlocker vs job_search_scraper zones).

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "phoenix-scraper" with this command: npx skills add stevojarvisai-star/phoenix-scraper

Phoenix Scraper

Resilient three-tier failover scraper. Never returns empty — if one method fails, the next activates automatically.

Failover Chain

Tier 1: Brave Search API (fast, free tier, 2k req/month)
    ↓ (on block/empty/timeout)
Tier 2: Bright Data Web Unlocker (residential proxy, JS-render optional)
    ↓ (on block/429/timeout)
Tier 3: Playwright headless browser (full JS execution)

Quick Start

from scripts.phoenix_scraper import scrape

# Basic fetch
result = scrape("https://example.com/page")

# With JS rendering (for SPA/dynamic sites)
result = scrape("https://example.com/page", render_js=True)

# With specific Bright Data zone
result = scrape("https://linkedin.com/jobs/...", zone="job_search_scraper")

Zone Routing

Use CaseZone
Job boards (LinkedIn, Glassdoor, Reed, Indeed)job_search_scraper
Social media, news, general webweb_unlocker
X.com / TwitterUse X API v2 (see references/x-api.md)

Bright Data render_js

Set render_js=True for JS-heavy sites (CWJobs, TotalJobs, ContractorUK). Adds "render": True (boolean) to payload and uses 60s timeout.

Critical: Use boolean True, not string "html" — Bright Data validation rejects strings.

Bright Data Premium Domains (Cost Note)

LinkedIn, Glassdoor, and other heavily-protected job boards may be classified as Premium Domains in your Bright Data zone (updated quarterly). API call syntax is identical — but cost per request is higher. Check your zone's Premium Domains list if costs spike unexpectedly.

Playwright Stealth (2026 Enhancement)

For Tier 3, consider installing playwright-stealth to patch headless browser fingerprints — reduces detection on Cloudflare/advanced bot-protected sites:

pip install playwright-stealth
# Optional enhancement in phoenix_scraper.py Tier 3:
from playwright_stealth import stealth_sync
stealth_sync(page)

The base Playwright tier works without this, but stealth patching significantly improves success rates on heavily protected sites (Coupang, Naver, etc.) as of 2026.

URL Formatting

  • CWJobs/TotalJobs: use hyphen-slugs — finance-systems-consultant NOT finance+systems+consultant
  • Glassdoor: https://www.glassdoor.co.uk/Job/united-kingdom-{slug}-jobs-SRCH_IL.0,14_IN2_KO15,{end}.htm

Environment Variables

BRIGHT_DATA_API_KEY=<key>          # Bright Data API key
BRIGHT_DATA_ZONE=job_search_scraper # Default zone (override per-call)
BRAVE_API_KEY=<key>                # Brave Search API key
X_BEARER_TOKEN=<token>             # X API v2 bearer token (for X.com)

X.com Monitoring

For X/Twitter, use X API v2 (not scraping). See references/x-api.md for endpoint details and rate limits.

Error Handling

All tiers log failures before escalating. On total failure, returns {"success": False, "html": "", "method": "all_failed", "error": "<reason>"}.

Never raises exceptions — always returns a result dict.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

AutoClaw Browser Automation

Complete browser automation skill with MCP protocol support and Chrome extension

Registry SourceRecently Updated
1K0Profile unavailable
Automation

AgentGo Cloud Browser

Automates browser interactions using AgentGo's distributed cloud browser cluster via playwright@1.51.0. Use when the user needs to navigate websites, interac...

Registry Source
3660Profile unavailable
Automation

Riddle

Hosted browser automation API for agents. Screenshots, Playwright scripts, workflows — no local Chrome needed.

Registry Source
1.5K1Profile unavailable
Automation

Browser Cash

Spin up unblocked browser sessions via Browser.cash for web automation. Sessions bypass anti-bot protections (Cloudflare, DataDome, etc.) making them ideal for scraping and automation.

Registry Source
3.8K4Profile unavailable