Web Fetch: Article Download & Image Handling
Overview
This skill provides comprehensive patterns for downloading web articles with images and converting them to clean markdown for offline reference. It covers multiple extraction methods with Jina AI Reader as the primary approach.
Context
User provides a URL to extract as markdown with images. This skill is appropriate when:
- Downloading blog posts, news articles, or documentation for offline reference
- Preserving articles with their images in a local directory structure
- Converting web content to markdown for knowledge management
Process
- Fetch article content using Jina AI Reader (preferred) or fallback methods
- Create directory structure for article and images
- Download all images locally with parallel requests
- Update markdown paths to reference local images
- Add source metadata to the markdown file
- Verification: Confirm all files exist and markdown renders correctly
Detailed Steps
1. Fetch Article Content
Option A: Jina AI Reader (Recommended)
curl "https://r.jina.ai/https://example.com/article" > article.md
Option B: WebFetch (Claude Code)
WebFetch(
url: "https://example.com/article",
prompt: "Convert this entire article to clean, well-formatted markdown.
Include all headings, paragraphs, code blocks, lists, and
preserve all image URLs with their alt text in markdown format
. Capture the full article including title,
author, date, and all sections."
)
Option C: curl + pandoc
# Download HTML
curl "https://example.com/article" -o article.html
# Convert to markdown
pandoc -f html -t markdown article.html -o article.md
# Install pandoc if needed:
# macOS: brew install pandoc
# Ubuntu: apt install pandoc
Option D: lynx
lynx -dump -nolist "https://example.com/article" > article.md
# Install if needed:
# macOS: brew install lynx
# Ubuntu: apt install lynx
Option E: html2text
curl "https://example.com/article" | html2text > article.md
# Install if needed:
# pip install html2text
2. Create Directory Structure
mkdir -p references/article-name/images
3. Download Images
Parallel download (recommended for multiple images):
curl -s -o "references/article-name/images/01-image.png" "https://cdn.example.com/image1.png" &
curl -s -o "references/article-name/images/02-image.png" "https://cdn.example.com/image2.png" &
curl -s -o "references/article-name/images/03-image.png" "https://cdn.example.com/image3.png" &
wait
Flags: -s (silent), -o (output file), & (background), wait (wait for all)
4. Update Markdown Paths
Replace remote URLs with local paths:
# Before:

# After:

Add source metadata:
# Article Title
**Source:** [Original Article](https://example.com/article)
**Downloaded:** 2024-12-11
**Authors:** Name Here
---
[Content...]
5. Verify
ls -lh references/article-name/
ls -lh references/article-name/images/
Complete Example
Using Jina AI Reader:
# 1. Setup
mkdir -p references/building-effective-agents/images
# 2. Fetch article (using Jina AI - no install needed)
curl "https://r.jina.ai/https://www.anthropic.com/engineering/building-effective-agents" > temp.md
# 3. Download images (parallel)
curl -s -o "references/building-effective-agents/images/01-augmented-llm.png" \
"https://cdn.sanity.io/images/4zrzovbb/website/d3083d3f40bb2b6f477901cc9a240738d3dd1371-2401x1000.png" &
curl -s -o "references/building-effective-agents/images/02-prompt-chaining.png" \
"https://cdn.sanity.io/images/4zrzovbb/website/7418719e3dab222dccb379b8879e1dc08ad34c78-2401x1000.png" &
wait
# 4. Update paths (manual edit or sed)
# Use Write tool or text editor to replace URLs with local paths
# 5. Verify
ls -lh references/building-effective-agents/
Advanced: Auto-Extract Image URLs
Extract all image URLs from markdown:
grep -o '!\[.*\](https://[^)]*)' article.md | sed 's/!\[.*\](\(.*\))/\1/'
Extract and download automatically:
grep -o '!\[.*\](https://[^)]*)' article.md | \
sed 's/!\[.*\](\(.*\))/\1/' | \
while IFS= read -r url; do
filename=$(basename "$url")
curl -s -o "images/$filename" "$url" &
done
wait
Tool Recommendations
Priority Order
-
Jina AI Reader (Best) ✅
- Works reliably across most sites
- Converts HTML to clean markdown automatically
- Preserves image URLs with alt text
- No installation required
- Handles redirects well
-
WebFetch (Good for exploration)
- Useful for initial investigation
- Sometimes provides better formatting than Jina
- Good fallback for sites that block curl
-
Local tools (When needed)
- pandoc, lynx, html2text → More control but require installation
- Use only if Jina fails
Source-Specific Patterns
Academic/Journal Sites (MDPI)
- Jina conversion works well for content
- Challenge: Embedded figures are in HTML, not directly downloadable
- Solution: Extract images directly from HTML source or search for figure URLs
Medium Articles
- Jina handles content extraction well
- Challenge: Article content may reference images as "Press enter or click to view..."
- Solution: Search for actual image URLs in the markdown output
grep -oE "https://miro\.medium\.com/[^[:space:]]*\.(png|jpg)" article.md
Blog Posts (Analytics Vidhya, Google Cloud, Anthropic)
- Jina works extremely well
- Images are usually directly referenced and downloadable
- Use auto-extract method
Notion Pages
- Requires JavaScript to render
- WebFetch may fail with 403 or render JavaScript placeholder
- Workaround: Copy-paste from browser or use headless browser tools
Image Extraction Patterns
For images in markdown:
grep -oE '!\[.*\]\(https://[^)]*\)' article.md | sed 's/.*(\(.*\))/\1/' | sort -u
For generic image URLs in HTML:
grep -oE 'https://[^[:space:]]*\.(png|jpg|jpeg|gif|webp)' article.md | sort -u
For CDN images with special characters:
# Some CDNs use % encoding - decode them:
grep -oE "https://[^[:space:]]*%20[^[:space:]]*\.(png|jpg)" article.md | \
sed 's/%20/ /g'
Image Download Best Practices
Sequential numbering:
# Always use 01-, 02-, 03- format for easy sorting and reference
curl -s -L -o "images/01-first-image.png" "URL1" &
curl -s -L -o "images/02-second-image.png" "URL2" &
wait
Use -L flag for redirects:
# Many CDNs redirect - always include -L
curl -L -o "image.png" "https://cdn.example.com/image.png"
Timeout for slow/failing downloads:
# Add timeout to prevent hanging
curl -m 10 --connect-timeout 5 -L -o "image.png" "URL" &
Markdown Metadata Format
Maintain consistency with this template:
# Article Title
**Source:** [Full Link](https://example.com/article)
**Published:** Month DD, YYYY
**Author(s):** Name(s)
---
## Content...
Final Checklist
✅ Create directory: mkdir -p references/article-name/images
✅ Try Jina first: curl "https://r.jina.ai/FULL_URL" > article.md
✅ Extract image URLs: grep -oE 'https://.*\.(png|jpg)'
✅ Download images with -L flag
✅ Update markdown paths: replace https://... with images/01-name.png
✅ Add source metadata block at top
✅ Verify: ls -lh both directories
✅ Test markdown renders locally
Batch Processing Example
Scenario: Extracting 4+ articles from presentation references
# 1. Batch fetch using Jina (non-interactive, reliable)
mkdir -p references/{article1,article2,article3,article4}/images
curl "https://r.jina.ai/https://www.analyticsvidhya.com/blog/2023/05/..." > /tmp/analytics.md
curl "https://r.jina.ai/https://cloud.google.com/blog/..." > /tmp/google.md
curl "https://r.jina.ai/https://medium.com/..." > /tmp/medium.md
# 2. Extract image URLs from each
grep -oE "https://[^[:space:]]*\.(png|jpg|jpeg|gif)" /tmp/analytics.md > /tmp/urls-analytics.txt
grep -oE "https://[^[:space:]]*\.(png|jpg|jpeg|gif)" /tmp/google.md > /tmp/urls-google.txt
grep -oE "https://[^[:space:]]*\.(png|jpg|jpeg|gif)" /tmp/medium.md > /tmp/urls-medium.txt
# 3. Download images in parallel (watch for 0-byte files)
while read url; do
filename=$(echo "$url" | sed 's/.*\///' | sed 's/%20/-/g')
curl -s -L -m 10 -o "references/analytics/images/$filename" "$url" &
done < /tmp/urls-analytics.txt
wait
# 4. Check for failed downloads
find references/ -name "*.png" -size 0 -delete # Remove 0-byte files
find references/ -name "*.png" -exec ls -lh {} \; # Verify sizes
# 5. In markdown: replace https://cdn... with images/filename.png
sed -i '' 's|https://cdn\.analyticsvidhya\.com[^)]*|images/01-diagram.png|g' article.md
Troubleshooting
404 errors or 0-byte files:
# Remove -s to see errors
curl -o "image.png" "https://example.com/image.png"
# Follow redirects
curl -L -o "image.png" "https://example.com/image.png"
# Add user agent
curl -A "Mozilla/5.0" -o "image.png" "https://example.com/image.png"
Images don't render:
# Use relative paths (images/01.png) not absolute paths
# Verify files exist
ls -la images/
JavaScript-required sites (Notion, etc.):
# Some sites won't render content without JavaScript execution
# Solutions:
# 1. Use a headless browser: puppeteer, playwright, or selenium
# 2. Access the article through alternative sources
# 3. Fall back to copy-paste from browser rendering
# Not recommended: These tools require Node.js/Python and significant setup
Quick Reference
# 1. Create structure
mkdir -p references/ARTICLE_NAME/images
# 2. Fetch content (choose one):
# Jina AI (no install):
curl "https://r.jina.ai/URL" > article.md
# OR pandoc (local):
curl "URL" | pandoc -f html -t markdown -o article.md
# OR lynx:
lynx -dump -nolist "URL" > article.md
# OR html2text:
curl "URL" | html2text > article.md
# 3. Download images (parallel with &)
curl -s -o "images/01.png" "IMAGE_URL" &
# 4. Update markdown: replace URLs with images/01.png
# 5. Verify
ls -lh references/ARTICLE_NAME/
Guidelines
- Always try Jina AI Reader first - it's the most reliable and requires no installation
- Use sequential numbering (01-, 02-, 03-) for images to maintain order
- Always include
-Lflag with curl to follow redirects - Add source metadata (URL, date, author) at the top of markdown files
- Verify downloads by checking file sizes - 0-byte files indicate failures
- For JavaScript-heavy sites, fall back to headless browser tools or manual copy-paste
This skill provides comprehensive patterns for extracting web articles with images for offline reference and knowledge management.