Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

"Scrape [website]"
"Extract data from [site]"
"Get product information from [URL]"
"Find all links/pages on [site]"
"I'm getting blocked" or "Getting 403 errors" (loads strategies/anti-blocking.md)
"Make this an Apify Actor" (loads apify/ subdirectory)
"Productionize this scraper"

Proactive Workflow

This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.

Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)

When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:

DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.

Use Playwright MCP & Chrome DevTools MCP:

1. Open site in real browser (Playwright MCP)

Navigate like a real user
Observe page loading behavior (SSR? SPA? Loading states?)
Take screenshots for reference
Test basic interactions

2. Monitor network traffic (Chrome DevTools via Playwright)

Watch XHR/Fetch requests in real-time
Find API endpoints returning JSON (10-100x faster than HTML scraping!)
Analyze request/response patterns
Document headers, cookies, authentication tokens
Extract pagination parameters

3. Test site interactions

Pagination: URL-based? API? Infinite scroll?
Filtering and search: How do they work?
Dynamic content loading: Triggers and patterns
Authentication flows: Required? Optional?

4. Assess protection mechanisms

Cloudflare/bot detection
CAPTCHA requirements
Rate limiting behavior (test with multiple requests)
Fingerprinting scripts

5. Generate Intelligence Report

Site architecture (framework, rendering method)
Discovered APIs/endpoints with full specs
Protection mechanisms and required countermeasures
Optimal extraction strategy (API > Sitemap > HTML)
Time/complexity estimates

See: workflows/reconnaissance.md for complete reconnaissance guide with MCP examples

Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.

Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)

After Phase 1 reconnaissance, validate findings with automated checks:

1. Check for Sitemaps

# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml

Log findings clearly:

✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
✓ "Found sitemap index with 5 sub-sitemaps"
✗ "No sitemap detected at common locations"

Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)

2. Investigate APIs

Prompt user:

Should I check for JSON APIs first? (Highly recommended)

Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain

Check for APIs? [Y/n]

If yes, guide user:

Open browser DevTools → Network tab
Navigate the target website
Look for XHR/Fetch requests
Check for endpoints: /api/, /v1/, /v2/, /graphql, /_next/data/
Analyze request/response format (JSON, GraphQL, REST)

Log findings:

✓ "Found API: GET /api/products/{id} (returns JSON)"
✓ "Found GraphQL endpoint: /graphql"
✗ "No obvious public APIs detected"

3. Analyze Site Structure

Automatically assess:

JavaScript-heavy? (Look for React, Vue, Angular indicators)
Authentication required? (Login walls, auth tokens)
Page count estimate (from sitemap or site exploration)
Rate limiting indicators (robots.txt directives)

Phase 3: STRATEGY RECOMMENDATION

Based on Phases 1-2 findings, present 2-3 options with clear reasoning:

Example Output Template:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required

Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
   ✓ Use sitemap to get all 1,234 product URLs instantly
   ✓ Extract product IDs from URLs
   ✓ Fetch data via API (fast, reliable JSON)

   Estimated time: 8-12 minutes
   Complexity: Low-Medium
   Data quality: Excellent
   Speed: Very Fast

⚡ Option 2: Sitemap + Playwright
   ✓ Use sitemap for URLs
   ✓ Scrape HTML with Playwright

   Estimated time: 15-20 minutes
   Complexity: Medium
   Data quality: Good
   Speed: Fast

🔧 Option 3: Pure API (if sitemap fails)
   ✓ Discover product IDs through API exploration
   ✓ Fetch all data via API

   Estimated time: 10-15 minutes
   Complexity: Medium
   Data quality: Excellent
   Speed: Fast

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds

Proceed with Option 1? [Y/n]

Key principles:

Always recommend the SIMPLEST approach that works
Sitemap > API > Playwright (in terms of simplicity)
Show time estimates and complexity
Explain reasoning clearly

Phase 4: ITERATIVE IMPLEMENTATION

Implement scraper incrementally, starting simple and adding complexity only as needed.

Core Pattern:

Implement recommended approach (minimal code)
Test with small batch (5-10 items)
Validate data quality
Scale to full dataset or fallback
Handle blocking if encountered
Add robustness (error handling, retries, logging)

See: workflows/implementation.md for complete implementation patterns and code examples

Phase 5: PRODUCTIONIZATION (On Request)

Convert scraper to production-ready Apify Actor.

Activation triggers:

"Make this an Apify Actor"
"Productionize this scraper"
"Deploy to Apify"
"Create an actor from this"

Core Pattern:

Confirm TypeScript preference (STRONGLY RECOMMENDED)
Initialize with apify create command (CRITICAL)
Port scraping logic to Actor format
Test locally and deploy

See: workflows/productionization.md for complete productionization workflow and apify/ directory for all Actor development guides

Quick Reference

Task	Pattern/Command	Documentation
Reconnaissance	Playwright + DevTools MCP	`workflows/reconnaissance.md`
Find sitemaps	`RobotsFile.find(url)`	`strategies/sitemap-discovery.md`
Filter sitemap URLs	`RequestList + regex`	`reference/regex-patterns.md`
Discover APIs	DevTools → Network tab	`strategies/api-discovery.md`
Playwright scraping	`PlaywrightCrawler`	`strategies/playwright-scraping.md`
HTTP scraping	`CheerioCrawler`	`strategies/cheerio-scraping.md`
Hybrid approach	Sitemap + API	`strategies/hybrid-approaches.md`
Handle blocking	fingerprint-suite + proxies	`strategies/anti-blocking.md`
Fingerprint configs	Quick patterns	`reference/fingerprint-patterns.md`
Create Apify Actor	`apify create`	`apify/cli-workflow.md`
Template selection	Cheerio vs Playwright	`workflows/productionization.md`
Input schema	`.actor/input_schema.json`	`apify/input-schemas.md`
Deploy actor	`apify push`	`apify/deployment.md`

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const data = await page.evaluate(() => ({
            title: document.title,
            // ... extract data
        }));
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See examples/sitemap-basic.js for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See examples/api-scraper.js for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See examples/hybrid-sitemap-api.js for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For: Step-by-step workflow guides for each phase

workflows/reconnaissance.md - Phase 1 interactive reconnaissance (CRITICAL)
workflows/implementation.md - Phase 4 iterative implementation patterns
workflows/productionization.md - Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For: Detailed guides on specific scraping approaches

strategies/sitemap-discovery.md - Complete sitemap guide (4 patterns)
strategies/api-discovery.md - Finding and using APIs
strategies/playwright-scraping.md - Browser-based scraping
strategies/cheerio-scraping.md - HTTP-only scraping
strategies/hybrid-approaches.md - Combining strategies
strategies/anti-blocking.md - Fingerprinting & proxies for blocked sites

Examples (Runnable Code)

For: Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

examples/sitemap-basic.js - Simple sitemap scraper
examples/api-scraper.js - Pure API approach
examples/playwright-basic.js - Basic Playwright scraper
examples/hybrid-sitemap-api.js - Combined approach
examples/iterative-fallback.js - Try sitemap→API→Playwright

TypeScript Production Examples (Complete Actors):

apify/examples/basic-scraper/ - Sitemap + Playwright
apify/examples/anti-blocking/ - Fingerprinting + proxies
apify/examples/hybrid-api/ - Sitemap + API (optimal)

Reference (Quick Lookup)

For: Quick patterns and troubleshooting

reference/regex-patterns.md - Common URL regex patterns
reference/selector-guide.md - Playwright selector strategies
reference/fingerprint-patterns.md - Common fingerprint configurations
reference/anti-patterns.md - What NOT to do

Apify (Production Deployment)

For: Creating production Apify Actors

apify/README.md - When and how to use Apify
apify/typescript-first.md - Why TypeScript for actors
apify/cli-workflow.md - apify create workflow (CRITICAL)
apify/initialization.md - Complete setup guide
apify/input-schemas.md - Input validation patterns
apify/configuration.md - actor.json setup
apify/deployment.md - Testing and deployment
apify/templates/ - TypeScript boilerplate

Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Progressive Enhancement

Start with the simplest approach that works:

Sitemap > API > Playwright
Static > Dynamic
HTTP > Browser

2. Proactive Discovery

Always investigate before implementing:

Check for sitemaps automatically
Look for APIs (ask user to check DevTools)
Analyze site structure

3. Iterative Implementation

Build incrementally:

Small test batch first (5-10 items)
Validate quality
Scale or fallback
Add robustness last

4. Production-Ready Code

When productionizing:

Use TypeScript (strongly recommended)
Use apify create (never manual setup)
Add proper error handling
Include logging and monitoring

Remember: Sitemaps first, APIs second, scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.

web-scraping

Safety Notice

Copy this and send it to your AI assistant to learn