scrape-webpage

Extract content, metadata, and images from a webpage for import/migration.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "scrape-webpage" with this command: npx skills add adobe/skills/adobe-skills-scrape-webpage

Scrape Webpage

Extract content, metadata, and images from a webpage for import/migration.

When to Use This Skill

Use this skill when:

  • Starting a page import and need to extract content from source URL

  • Need webpage analysis with local image downloads

  • Want metadata extraction (Open Graph, JSON-LD, etc.)

Invoked by: page-import skill (Step 1)

Prerequisites

Before using this skill, ensure:

  • ✅ Node.js is available

  • ✅ npm playwright is installed (npm install playwright )

  • ✅ Chromium browser is installed (npx playwright install chromium )

  • ✅ Sharp image library is installed (cd .claude/skills/scrape-webpage/scripts && npm install )

Related Skills

  • page-import - Orchestrator that invokes this skill

  • identify-page-structure - Uses this skill's output (screenshot, HTML, metadata)

  • generate-import-html - Uses image mapping and paths from this skill

Scraping Workflow

Step 1: Run Analysis Script

Command:

node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work

What the script does:

  • Sets up network interception to capture all images

  • Loads page in headless Chromium

  • Scrolls through entire page to trigger lazy-loaded images

  • Downloads all images locally (converts WebP/AVIF/SVG to PNG)

  • Captures full-page screenshot for visual reference

  • Extracts metadata (title, description, Open Graph, JSON-LD, canonical)

  • Fixes images in DOM (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)

  • Extracts cleaned HTML (removes scripts/styles)

  • Replaces image URLs in HTML with local paths (./images/...)

  • Generates document paths (sanitized, lowercase, no .html extension)

  • Saves complete analysis with image mapping to metadata.json

For detailed explanation: See resources/web-page-analysis.md

Step 2: Verify Output

Output files:

  • ./import-work/metadata.json

  • Complete analysis with paths and image mapping

  • ./import-work/screenshot.png

  • Visual reference for layout comparison

  • ./import-work/cleaned.html

  • Main content HTML with local image paths

  • ./import-work/images/

  • All downloaded images (WebP/AVIF/SVG converted to PNG)

Verify files exist:

ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html ls -lh ./import-work/images/ | head -5

Step 3: Review Metadata JSON

Output JSON structure:

{ "url": "https://example.com/page", "timestamp": "2025-01-12T10:30:00.000Z", "paths": { "documentPath": "/us/en/about", "htmlFilePath": "us/en/about.plain.html", "mdFilePath": "us/en/about.md", "dirPath": "us/en", "filename": "about" }, "screenshot": "./import-work/screenshot.png", "html": { "filePath": "./import-work/cleaned.html", "size": 45230 }, "metadata": { "title": "Page Title", "description": "Page description", "og:image": "https://example.com/image.jpg", "canonical": "https://example.com/page" }, "images": { "count": 15, "mapping": { "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg", "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png" }, "stats": { "total": 15, "converted": 3, "skipped": 12, "failed": 0 } } }

Key fields:

  • paths.documentPath

  • Used for browser preview URL

  • paths.htmlFilePath

  • Where to save final HTML file

  • images.mapping

  • Original URLs → local paths

  • metadata

  • Extracted page metadata

Output

This skill provides:

  • ✅ metadata.json with paths, metadata, image mapping

  • ✅ screenshot.png for visual reference

  • ✅ cleaned.html with local image references

  • ✅ images/ folder with all downloaded images

Next step: Pass these outputs to identify-page-structure skill

Troubleshooting

Browser not installed:

npx playwright install chromium

Sharp not installed:

cd .claude/skills/scrape-webpage/scripts && npm install

Image download failures:

  • Check images.stats.failed count in metadata.json

  • Some images may require authentication or be blocked by CORS

  • Failed images will be noted but won't stop the scraping process

Lazy-loaded images not captured:

  • Script scrolls through page to trigger lazy loading

  • Some advanced lazy-loading may need customization in scripts/analyze-webpage.js

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

preview-import

No summary provided by upstream source.

Repository SourceNeeds Review
General

identify-page-structure

No summary provided by upstream source.

Repository SourceNeeds Review
General

page-decomposition

No summary provided by upstream source.

Repository SourceNeeds Review
General

page-import

No summary provided by upstream source.

Repository SourceNeeds Review