scraping

Web scraping using nu-shell and browser tools for data extraction.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "scraping" with this command: npx skills add knoopx/pi/knoopx-pi-scraping

Scraping

Web scraping using nu-shell and browser tools for data extraction.

Prerequisites

  • nu-shell installed (nu )

  • query web plugin installed (for HTML scraping): nu -c "plugin add query web"

  • Browser extension enabled (for dynamic content): Enable the browser extension in your agent configuration

Common Tasks

Fetching Web Pages

Use http get to retrieve HTML content:

Simple GET request

nu -c 'http get https://example.com'

With headers

nu -c 'http get -H [User-Agent "My Scraper"] https://example.com'

HTML Parsing and Data Extraction

Use the query web plugin to parse HTML and extract data using CSS selectors:

Extract text from elements

nu -c 'http get https://example.com | query web -q "h1, h2" | str trim'

Extract attributes

nu -c 'http get https://example.com | query web -a href "a"'

Parse tables as structured data

nu -c 'http get https://example.com/table-page | query web --as-table ["Column1" "Column2"]'

Browser-Based Scraping for Dynamic Content

For websites requiring JavaScript execution or complex DOM interactions, use browser automation tools.

Start browser

start-browser

Navigate to page

navigate-browser --url https://example.com

Extract data with JavaScript evaluation

evaluate-javascript --code "Array.from(document.querySelectorAll('selector')).map(e => e.textContent)"

Screenshot for visual inspection

take-screenshot

Query HTML fragments

query-html-elements --selector ".content"

API Interactions

For JSON APIs, use http get and parse with from json :

GET JSON API

nu -c 'http get https://api.example.com/data | from json'

POST requests

nu -c 'http post https://api.example.com/submit -t application/json {key: value}'

Handling Authentication

Basic auth

nu -c 'http get -u username:password https://api.example.com'

Bearer token

nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com'

Custom headers

nu -c 'http get -H [X-API-Key "YOUR_KEY" User-Agent "Scraper"] https://api.example.com'

Rate Limiting and Delays

Add delays between requests

nu -c '$urls | each { |url| http get $url; sleep 1sec }'

Parallel Processing

Scrape multiple pages in parallel

nu -c '$urls | par-each { |url| http get $url | query web -q ".data" }'

One-liner Examples

Basic HTML Scraping

Extract all h1 titles

nu -c 'http get https://example.com | query web -q "h1"'

Get all links

nu -c 'http get https://example.com | query web -a href "a"'

Scrape product prices

nu -c 'http get https://store.example.com | query web -q ".price"'

HTML Scraping Example: Hacker News

Scrape HN front page titles and URLs

nu -c 'http get https://news.ycombinator.com/ | query web -q ".titleline a" | get text | zip (http get https://news.ycombinator.com/ | query web -a href ".titleline a" | get href) | each { |pair| echo $"($pair.0) - ($pair.1)" }'

For static sites like HN, use http get directly. Reserve browser tools for dynamic content requiring JavaScript execution.

GitHub Stars Scraper

Get star count for a repo

nu -c 'http get https://api.github.com/repos/nushell/nushell | get stargazers_count'

API Data Extraction

Fetch JSON and extract fields

nu -c 'http get https://api.example.com/users | from json | get -i 0.name'

API Authentication

Bearer token

nu -c 'http get -H [Authorization "Bearer YOUR_TOKEN"] https://api.example.com/data'

API key

nu -c 'http get -H [X-API-Key "YOUR_API_KEY"] https://api.example.com/data'

Basic auth

nu -c 'http get -u username:password https://api.example.com/protected'

Related Skills

  • nu-shell: Core nu-shell scripting patterns and commands.

Related Tools

  • start-browser: Start Cromite browser via Puppeteer.

  • navigate-browser: Navigate to a URL in the browser.

  • evaluate-javascript: Evaluate JavaScript code in the active browser tab.

  • take-screenshot: Take a screenshot of the active browser tab.

  • query-html-elements: Extract HTML elements by CSS selector.

  • list-browser-tabs: List all open browser tabs with their titles and URLs.

  • close-tab: Close a browser tab by index or title.

  • switch-tab: Switch to a specific tab by index.

  • refresh-tab: Refresh the current tab.

  • current-url: Get the URL of the current active tab.

  • page-title: Get the title of the current active tab.

  • wait-for-element: Wait for a CSS selector to appear on the page.

  • click-element: Click on an element by CSS selector.

  • type-text: Type text into an input field.

  • extract-text: Extract text content from elements by CSS selector.

  • search-web: Perform web searches and extract information from search results.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

jujutsu

No summary provided by upstream source.

Repository SourceNeeds Review
General

podman

No summary provided by upstream source.

Repository SourceNeeds Review
General

jscpd

No summary provided by upstream source.

Repository SourceNeeds Review
General

nix-flakes

No summary provided by upstream source.

Repository SourceNeeds Review