web-scraping

Web scraping tools for fetching and extracting data from web pages

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-scraping" with this command: npx skills add paulgnz/xpr-web-scraping

Web Scraping

You have web scraping tools for fetching and extracting data from web pages:

Single page:

  • scrape_url — fetch a URL and get cleaned text content + metadata (title, description, link count)
    • Use format="text" (default) for most tasks — strips all HTML
    • Use format="markdown" to preserve headings, links, lists, bold/italic
    • Use format="html" only when you need raw HTML

Link discovery:

  • extract_links — fetch a page and extract all links with text and type (internal/external)
    • Use the pattern parameter to filter by regex (e.g. "\\.pdf$" for PDF links)
    • Links are deduplicated and resolved to absolute URLs

Multi-page research:

  • scrape_multiple — fetch up to 10 URLs in parallel for comparison/research
    • One failure doesn't block others (uses Promise.allSettled)

Best practices:

  • Prefer "text" format for content extraction, "markdown" for preserving structure
  • Don't scrape the same domain more than 5 times per minute
  • Combine with store_deliverable to save scraped content as job evidence
  • For very large pages, the content is limited to 5MB

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Scrapling

Web scraping and data extraction using the Python Scrapling library. Use to scrape static HTML pages, JavaScript-rendered pages (Playwright), and anti-bot or...

Registry Source
5961Profile unavailable
General

Url Images To Pdf

从URL提取图片并生成PDF(保持原文顺序,不排序)

Registry SourceRecently Updated
4990Profile unavailable
General

claw-text-and-pics

Extract text and embedded images from scanned documents, PDFs, and photos via Mistral OCR API. Use when reading receipts, invoices, contracts, handwritten no...

Registry SourceRecently Updated
900Profile unavailable
General

Huo15 Js Scraper

JavaScript渲染网站抓取工具。当需要抓取JS渲染的页面(如企微文档、Vue/React SPA)、企查查企业数据获取)、绕过反爬、或者普通curl/wget/web_fetch无法获取内容的网站时使用此技能。支持Playwright和scrapling双引擎自动切换。

Registry SourceRecently Updated
1720Profile unavailable