web-scraping

This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-scraping" with this command: npx skills add yfe404/web-scraper/yfe404-web-scraper-web-scraping

Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

  • "Scrape [website]"
  • "Extract data from [site]"
  • "Get product information from [URL]"
  • "Find all links/pages on [site]"
  • "I'm getting blocked" or "Getting 403 errors" (loads strategies/anti-blocking.md)
  • "Make this an Apify Actor" (loads apify/ subdirectory)
  • "Productionize this scraper"

Proactive Workflow

This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.

Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)

When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:

DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.

Use Playwright MCP & Chrome DevTools MCP:

1. Open site in real browser (Playwright MCP)

  • Navigate like a real user
  • Observe page loading behavior (SSR? SPA? Loading states?)
  • Take screenshots for reference
  • Test basic interactions

2. Monitor network traffic (Chrome DevTools via Playwright)

  • Watch XHR/Fetch requests in real-time
  • Find API endpoints returning JSON (10-100x faster than HTML scraping!)
  • Analyze request/response patterns
  • Document headers, cookies, authentication tokens
  • Extract pagination parameters

3. Test site interactions

  • Pagination: URL-based? API? Infinite scroll?
  • Filtering and search: How do they work?
  • Dynamic content loading: Triggers and patterns
  • Authentication flows: Required? Optional?

4. Assess protection mechanisms

  • Cloudflare/bot detection
  • CAPTCHA requirements
  • Rate limiting behavior (test with multiple requests)
  • Fingerprinting scripts

5. Generate Intelligence Report

  • Site architecture (framework, rendering method)
  • Discovered APIs/endpoints with full specs
  • Protection mechanisms and required countermeasures
  • Optimal extraction strategy (API > Sitemap > HTML)
  • Time/complexity estimates

See: workflows/reconnaissance.md for complete reconnaissance guide with MCP examples

Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.

Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)

After Phase 1 reconnaissance, validate findings with automated checks:

1. Check for Sitemaps

# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml

Log findings clearly:

  • ✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
  • ✓ "Found sitemap index with 5 sub-sitemaps"
  • ✗ "No sitemap detected at common locations"

Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)

2. Investigate APIs

Prompt user:

Should I check for JSON APIs first? (Highly recommended)

Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain

Check for APIs? [Y/n]

If yes, guide user:

  1. Open browser DevTools → Network tab
  2. Navigate the target website
  3. Look for XHR/Fetch requests
  4. Check for endpoints: /api/, /v1/, /v2/, /graphql, /_next/data/
  5. Analyze request/response format (JSON, GraphQL, REST)

Log findings:

  • ✓ "Found API: GET /api/products/{id} (returns JSON)"
  • ✓ "Found GraphQL endpoint: /graphql"
  • ✗ "No obvious public APIs detected"

3. Analyze Site Structure

Automatically assess:

  • JavaScript-heavy? (Look for React, Vue, Angular indicators)
  • Authentication required? (Login walls, auth tokens)
  • Page count estimate (from sitemap or site exploration)
  • Rate limiting indicators (robots.txt directives)

Phase 3: STRATEGY RECOMMENDATION

Based on Phases 1-2 findings, present 2-3 options with clear reasoning:

Example Output Template:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required

Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
   ✓ Use sitemap to get all 1,234 product URLs instantly
   ✓ Extract product IDs from URLs
   ✓ Fetch data via API (fast, reliable JSON)

   Estimated time: 8-12 minutes
   Complexity: Low-Medium
   Data quality: Excellent
   Speed: Very Fast

⚡ Option 2: Sitemap + Playwright
   ✓ Use sitemap for URLs
   ✓ Scrape HTML with Playwright

   Estimated time: 15-20 minutes
   Complexity: Medium
   Data quality: Good
   Speed: Fast

🔧 Option 3: Pure API (if sitemap fails)
   ✓ Discover product IDs through API exploration
   ✓ Fetch all data via API

   Estimated time: 10-15 minutes
   Complexity: Medium
   Data quality: Excellent
   Speed: Fast

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds

Proceed with Option 1? [Y/n]

Key principles:

  • Always recommend the SIMPLEST approach that works
  • Sitemap > API > Playwright (in terms of simplicity)
  • Show time estimates and complexity
  • Explain reasoning clearly

Phase 4: ITERATIVE IMPLEMENTATION

Implement scraper incrementally, starting simple and adding complexity only as needed.

Core Pattern:

  1. Implement recommended approach (minimal code)
  2. Test with small batch (5-10 items)
  3. Validate data quality
  4. Scale to full dataset or fallback
  5. Handle blocking if encountered
  6. Add robustness (error handling, retries, logging)

See: workflows/implementation.md for complete implementation patterns and code examples

Phase 5: PRODUCTIONIZATION (On Request)

Convert scraper to production-ready Apify Actor.

Activation triggers:

  • "Make this an Apify Actor"
  • "Productionize this scraper"
  • "Deploy to Apify"
  • "Create an actor from this"

Core Pattern:

  1. Confirm TypeScript preference (STRONGLY RECOMMENDED)
  2. Initialize with apify create command (CRITICAL)
  3. Port scraping logic to Actor format
  4. Test locally and deploy

See: workflows/productionization.md for complete productionization workflow and apify/ directory for all Actor development guides

Quick Reference

TaskPattern/CommandDocumentation
ReconnaissancePlaywright + DevTools MCPworkflows/reconnaissance.md
Find sitemapsRobotsFile.find(url)strategies/sitemap-discovery.md
Filter sitemap URLsRequestList + regexreference/regex-patterns.md
Discover APIsDevTools → Network tabstrategies/api-discovery.md
Playwright scrapingPlaywrightCrawlerstrategies/playwright-scraping.md
HTTP scrapingCheerioCrawlerstrategies/cheerio-scraping.md
Hybrid approachSitemap + APIstrategies/hybrid-approaches.md
Handle blockingfingerprint-suite + proxiesstrategies/anti-blocking.md
Fingerprint configsQuick patternsreference/fingerprint-patterns.md
Create Apify Actorapify createapify/cli-workflow.md
Template selectionCheerio vs Playwrightworkflows/productionization.md
Input schema.actor/input_schema.jsonapify/input-schemas.md
Deploy actorapify pushapify/deployment.md

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const data = await page.evaluate(() => ({
            title: document.title,
            // ... extract data
        }));
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See examples/sitemap-basic.js for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See examples/api-scraper.js for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See examples/hybrid-sitemap-api.js for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For: Step-by-step workflow guides for each phase

  • workflows/reconnaissance.md - Phase 1 interactive reconnaissance (CRITICAL)
  • workflows/implementation.md - Phase 4 iterative implementation patterns
  • workflows/productionization.md - Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For: Detailed guides on specific scraping approaches

  • strategies/sitemap-discovery.md - Complete sitemap guide (4 patterns)
  • strategies/api-discovery.md - Finding and using APIs
  • strategies/playwright-scraping.md - Browser-based scraping
  • strategies/cheerio-scraping.md - HTTP-only scraping
  • strategies/hybrid-approaches.md - Combining strategies
  • strategies/anti-blocking.md - Fingerprinting & proxies for blocked sites

Examples (Runnable Code)

For: Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

  • examples/sitemap-basic.js - Simple sitemap scraper
  • examples/api-scraper.js - Pure API approach
  • examples/playwright-basic.js - Basic Playwright scraper
  • examples/hybrid-sitemap-api.js - Combined approach
  • examples/iterative-fallback.js - Try sitemap→API→Playwright

TypeScript Production Examples (Complete Actors):

  • apify/examples/basic-scraper/ - Sitemap + Playwright
  • apify/examples/anti-blocking/ - Fingerprinting + proxies
  • apify/examples/hybrid-api/ - Sitemap + API (optimal)

Reference (Quick Lookup)

For: Quick patterns and troubleshooting

  • reference/regex-patterns.md - Common URL regex patterns
  • reference/selector-guide.md - Playwright selector strategies
  • reference/fingerprint-patterns.md - Common fingerprint configurations
  • reference/anti-patterns.md - What NOT to do

Apify (Production Deployment)

For: Creating production Apify Actors

  • apify/README.md - When and how to use Apify
  • apify/typescript-first.md - Why TypeScript for actors
  • apify/cli-workflow.md - apify create workflow (CRITICAL)
  • apify/initialization.md - Complete setup guide
  • apify/input-schemas.md - Input validation patterns
  • apify/configuration.md - actor.json setup
  • apify/deployment.md - Testing and deployment
  • apify/templates/ - TypeScript boilerplate

Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Progressive Enhancement

Start with the simplest approach that works:

  • Sitemap > API > Playwright
  • Static > Dynamic
  • HTTP > Browser

2. Proactive Discovery

Always investigate before implementing:

  • Check for sitemaps automatically
  • Look for APIs (ask user to check DevTools)
  • Analyze site structure

3. Iterative Implementation

Build incrementally:

  • Small test batch first (5-10 items)
  • Validate quality
  • Scale or fallback
  • Add robustness last

4. Production-Ready Code

When productionizing:

  • Use TypeScript (strongly recommended)
  • Use apify create (never manual setup)
  • Add proper error handling
  • Include logging and monitoring

Remember: Sitemaps first, APIs second, scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Scrapling Fetch

支持自动绕过 Cloudflare Turnstile 和微信公众号反爬机制的网页内容抓取工具,输出干净Markdown或纯文本。

Registry SourceRecently Updated
1620Profile unavailable
Coding

Skrape

Ethical web data extraction with robots exclusion protocol adherence, throttled scraping requests, and privacy-compliant handling ("Scrape responsibly!").

Registry SourceRecently Updated
600Profile unavailable
Coding

Playwright CLI Automation

官方Microsoft Playwright CLI网页自动化工具,支持所有主流浏览器的无头/有头自动化操作,包括页面导航、元素交互、截图、录制、测试等功能。当用户提到网页自动化、浏览器操作、爬虫、截图、录制用户操作、E2E测试时触发。

Registry SourceRecently Updated
2252Profile unavailable
Coding

Browser Use Pro

AI-powered browser automation for complex multi-step web workflows. Uses Browser-Use framework when OpenClaw's built-in browser tool can't handle login flows...

Registry SourceRecently Updated
2040Profile unavailable