article-extractor

This skill should be used when the user wants to "download article", "extract article", "save blog post", "get article text", or provides a web URL and asks to extract the main content without ads, navigation, or clutter. Saves clean, readable text from web articles and blog posts.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "article-extractor" with this command: npx skills add ryanhudson/tapestry-skills-for-claude-code/ryanhudson-tapestry-skills-for-claude-code-article-extractor

Article Extractor

Extract main content from web articles and blog posts, removing navigation, ads, and clutter. Saves clean, readable text.

Prerequisites

This skill requires UV for dependency management. Run from the tapestry-skills project root.

Workflow

URL → Validate → Detect Tool → Extract Content → Sanitize Filename → Save

Tools (in priority order):

  1. reader (Mozilla Readability) - best for most articles
  2. trafilatura - excellent for blogs/news (included in dependencies)
  3. Fallback (tapestry-extract-html) - works without external tools

Security Requirements

All security utilities are available via UV from the project root.

URL Validation

# Run security validation (checks protocol, blocks SSRF, etc.)
uv run tapestry-validate-url "$URL" || exit 1

Filename Sanitization

# Use tapestry sanitization utility
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

Step 1: Check Available Tools

if command -v reader &> /dev/null; then
    TOOL="reader"
else
    # trafilatura is included in project dependencies
    TOOL="trafilatura"
fi

echo "Using: $TOOL"

Optional: Install reader (npm)

For best results, install reader separately:

npm install -g @aspect/readability-cli
# or
npm install -g reader-cli

Step 2: Extract Content

Using reader

TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT

reader "$URL" > "$TEMP_FILE"

# Get title (first line in markdown format)
TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

Using trafilatura (via UV)

TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT

# Get content
uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"

# Get title from metadata
TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \
    python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")

Fallback Method (tapestry-extract-html)

Use the built-in HTML extractor when other tools aren't available:

# Extract content using tapestry's HTML extractor
uv run tapestry-extract-html "$URL" --output article.txt

Or get title and content separately:

TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT

# Fetch and extract
uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"

# Title is on first line (after "# ")
TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')

Step 3: Save with Clean Filename

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")
CONTENT_FILE="${SAFE_TITLE}.txt"

# Verify content was extracted
if [ ! -s "$TEMP_FILE" ]; then
    echo "Error: No content extracted"
    exit 1
fi

mv "$TEMP_FILE" "$CONTENT_FILE"
trap - EXIT

WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')

echo "Extracted: $TITLE"
echo "Saved to: $CONTENT_FILE"
echo "Words: $WORD_COUNT"

Complete Workflow Script

#!/bin/bash
set -e

URL="$1"

# Validate URL
uv run tapestry-validate-url "$URL" || exit 1

# Detect tool
if command -v reader &> /dev/null; then
    TOOL="reader"
else
    TOOL="trafilatura"
fi

echo "Extracting with: $TOOL"

# Create temp file
TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT

# Extract based on tool
case $TOOL in
    reader)
        reader "$URL" > "$TEMP_FILE"
        TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
        ;;
    trafilatura)
        uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"
        TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \
            python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")
        ;;
esac

# If extraction failed, try fallback
if [ ! -s "$TEMP_FILE" ]; then
    echo "Primary extraction failed, trying fallback..."
    uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"
    TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
fi

# Verify extraction
if [ ! -s "$TEMP_FILE" ]; then
    echo "Error: No content extracted. Site may require authentication."
    exit 1
fi

# Save with clean filename
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")
CONTENT_FILE="${SAFE_TITLE}.txt"
mv "$TEMP_FILE" "$CONTENT_FILE"
trap - EXIT

# Show results
WORD_COUNT=$(wc -w < "$CONTENT_FILE" | tr -d ' ')
echo ""
echo "Extracted: $TITLE"
echo "Saved to: $CONTENT_FILE"
echo "Words: $WORD_COUNT"
echo ""
echo "Preview:"
head -n 10 "$CONTENT_FILE"

Error Handling

IssueSolution
UV not installedInstall with curl -LsSf https://astral.sh/uv/install.sh | sh
Invalid URLReject with clear message
Internal URL (SSRF)Block localhost/private IPs
Paywall/login requiredInform user, cannot extract
Empty extractionTry fallback method, inform user
TimeoutFallback uses 30s timeout

What Gets Extracted

Included:

  • Article title
  • Author (if available)
  • Main text content
  • Section headings

Removed:

  • Navigation menus
  • Ads and promotions
  • Newsletter signups
  • Related articles
  • Comment sections
  • Social buttons
  • Cookie notices

Tool Comparison

ToolStrengthsAvailability
readerBest overall, Firefox algorithmnpm install separately
trafilaturaNews/blogs, multi-languageIncluded in dependencies
tapestry-extract-htmlNo external dependenciesBuilt-in fallback

Dependencies

All dependencies are managed via UV and pyproject.toml:

  • trafilatura: Article extraction (pinned version)
  • tapestry-extract-html: Built-in fallback extractor

Optional (install separately):

  • reader: Mozilla Readability CLI (npm install -g reader-cli)

Security Reference

For complete security guidelines: ../shared/references/security-guidelines.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

article-extractor

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

tapestry

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ship-learn-next

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

youtube-transcript

No summary provided by upstream source.

Repository SourceNeeds Review