Firecrawl Web Scraper Skill
Status: Production Ready Last Updated: 2026-01-20 Official Docs: https://docs.firecrawl.dev API Version: v2 SDK Versions: firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+
What is Firecrawl?
Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles:
-
JavaScript rendering - Executes client-side JavaScript to capture dynamic content
-
Anti-bot bypass - Gets past CAPTCHA and bot detection systems
-
Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries
-
Document parsing - Processes PDFs, DOCX files, and images
-
Autonomous agents - AI-powered web data gathering without URLs
-
Change tracking - Monitor content changes over time
-
Branding extraction - Extract color schemes, typography, logos
API Endpoints Overview
Endpoint Purpose Use Case
/scrape
Single page Extract article, product page
/crawl
Full site Index docs, archive sites
/map
URL discovery Find all pages, plan strategy
/search
Web search + scrape Research with live data
/extract
Structured data Product prices, contacts
/agent
Autonomous gathering No URLs needed, AI navigates
/batch-scrape
Multiple URLs Bulk processing
- Scrape Endpoint (/v2/scrape )
Scrapes a single webpage and returns clean, structured content.
Basic Usage
from firecrawl import Firecrawl import os
app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))
Basic scrape
doc = app.scrape( url="https://example.com/article", formats=["markdown", "html"], only_main_content=True )
print(doc.markdown) print(doc.metadata)
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });
const result = await app.scrapeUrl('https://example.com/article', { formats: ['markdown', 'html'], onlyMainContent: true });
console.log(result.markdown);
Output Formats
Format Description
markdown
LLM-optimized content
html
Full HTML
rawHtml
Unprocessed HTML
screenshot
Page capture (with viewport options)
links
All URLs on page
json
Structured data extraction
summary
AI-generated summary
branding
Design system data
changeTracking
Content change detection
Advanced Options
doc = app.scrape( url="https://example.com", formats=["markdown", "screenshot"], only_main_content=True, remove_base64_images=True, wait_for=5000, # Wait 5s for JS timeout=30000, # Location & language location={"country": "AU", "languages": ["en-AU"]}, # Cache control max_age=0, # Fresh content (no cache) store_in_cache=True, # Stealth mode for complex sites stealth=True, # Custom headers headers={"User-Agent": "Custom Bot 1.0"} )
Browser Actions
Perform interactions before scraping:
doc = app.scrape( url="https://example.com", actions=[ {"type": "click", "selector": "button.load-more"}, {"type": "wait", "milliseconds": 2000}, {"type": "scroll", "direction": "down"}, {"type": "write", "selector": "input#search", "text": "query"}, {"type": "press", "key": "Enter"}, {"type": "screenshot"} # Capture state mid-action ] )
JSON Mode (Structured Extraction)
With schema
doc = app.scrape( url="https://example.com/product", formats=["json"], json_options={ "schema": { "type": "object", "properties": { "title": {"type": "string"}, "price": {"type": "number"}, "in_stock": {"type": "boolean"} } } } )
Without schema (prompt-only)
doc = app.scrape( url="https://example.com/product", formats=["json"], json_options={ "prompt": "Extract the product name, price, and availability" } )
Branding Extraction
Extract design system and brand identity:
doc = app.scrape( url="https://example.com", formats=["branding"] )
Returns:
- Color schemes and palettes
- Typography (fonts, sizes, weights)
- Spacing and layout metrics
- UI component styles
- Logo and imagery URLs
- Brand personality traits
- Crawl Endpoint (/v2/crawl )
Crawls all accessible pages from a starting URL.
result = app.crawl( url="https://docs.example.com", limit=100, max_depth=3, allowed_domains=["docs.example.com"], exclude_paths=["/api/", "/admin/"], scrape_options={ "formats": ["markdown"], "only_main_content": True } )
for page in result.data: print(f"Scraped: {page.metadata.source_url}") print(f"Content: {page.markdown[:200]}...")
Async Crawl with Webhooks
Start crawl (returns immediately)
job = app.start_crawl( url="https://docs.example.com", limit=1000, webhook="https://your-domain.com/webhook" )
print(f"Job ID: {job.id}")
Or poll for status
status = app.check_crawl_status(job.id)
- Map Endpoint (/v2/map )
Rapidly discover all URLs on a website without scraping content.
urls = app.map(url="https://example.com")
print(f"Found {len(urls)} pages") for url in urls[:10]: print(url)
Use for: sitemap discovery, crawl planning, website audits.
- Search Endpoint (/search ) - NEW
Perform web searches and optionally scrape the results in one operation.
Basic search
results = app.search( query="best practices for React server components", limit=10 )
for result in results: print(f"{result.title}: {result.url}")
Search + scrape results
results = app.search( query="React server components tutorial", limit=5, scrape_options={ "formats": ["markdown"], "only_main_content": True } )
for result in results: print(f"{result.title}") print(result.markdown[:500])
Search Options
results = app.search( query="machine learning papers", limit=20, # Filter by source type sources=["web", "news", "images"], # Filter by category categories=["github", "research", "pdf"], # Location location={"country": "US"}, # Time filter tbs="qdr:m", # Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year) timeout=30000 )
Cost: 2 credits per 10 results + scraping costs if enabled.
- Extract Endpoint (/v2/extract )
AI-powered structured data extraction from single pages, multiple pages, or entire domains.
Single Page
from pydantic import BaseModel
class Product(BaseModel): name: str price: float description: str in_stock: bool
result = app.extract( urls=["https://example.com/product"], schema=Product, system_prompt="Extract product information" )
print(result.data)
Multi-Page / Domain Extraction
Extract from entire domain using wildcard
result = app.extract( urls=["example.com/*"], # All pages on domain schema=Product, system_prompt="Extract all products" )
Enable web search for additional context
result = app.extract( urls=["example.com/products"], schema=Product, enable_web_search=True # Follow external links )
Prompt-Only Extraction (No Schema)
result = app.extract( urls=["https://example.com/about"], prompt="Extract the company name, founding year, and key executives" )
LLM determines output structure
- Agent Endpoint (/agent ) - NEW
Autonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.
Basic agent usage
result = app.agent( prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features" )
print(result.data)
With schema for structured output
from pydantic import BaseModel from typing import List
class CMSPricing(BaseModel): name: str free_tier: bool starter_price: float features: List[str]
result = app.agent( prompt="Find pricing for Contentful, Sanity, and Strapi", schema=CMSPricing )
Optional: focus on specific URLs
result = app.agent( prompt="Extract the enterprise pricing details", urls=["https://contentful.com/pricing", "https://sanity.io/pricing"] )
Agent Models
Model Best For Cost
spark-1-mini (default) Simple extractions, high volume Standard
spark-1-pro
Complex analysis, ambiguous data 60% more
result = app.agent( prompt="Analyze competitive positioning...", model="spark-1-pro" # For complex tasks )
Async Agent
Start agent (returns immediately)
job = app.start_agent( prompt="Research market trends..." )
Poll for results
status = app.check_agent_status(job.id) if status.status == "completed": print(status.data)
Note: Agent is in Research Preview. 5 free daily requests, then credit-based billing.
- Batch Scrape - NEW
Process multiple URLs efficiently in a single operation.
Synchronous (waits for completion)
results = app.batch_scrape( urls=[ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3" ], formats=["markdown"], only_main_content=True )
for page in results.data: print(f"{page.metadata.source_url}: {len(page.markdown)} chars")
Asynchronous (with webhooks)
job = app.start_batch_scrape( urls=url_list, formats=["markdown"], webhook="https://your-domain.com/webhook" )
Webhook receives events: started, page, completed, failed
const job = await app.startBatchScrape(urls, { formats: ['markdown'], webhook: 'https://your-domain.com/webhook' });
// Poll for status const status = await app.checkBatchScrapeStatus(job.id);
- Change Tracking - NEW
Monitor content changes over time by comparing scrapes.
Enable change tracking
doc = app.scrape( url="https://example.com/pricing", formats=["markdown", "changeTracking"] )
Response includes:
print(doc.change_tracking.status) # new, same, changed, removed print(doc.change_tracking.previous_scrape_at) print(doc.change_tracking.visibility) # visible, hidden
Comparison Modes
Git-diff mode (default)
doc = app.scrape( url="https://example.com/docs", formats=["markdown", "changeTracking"], change_tracking_options={ "mode": "diff" } ) print(doc.change_tracking.diff) # Line-by-line changes
JSON mode (structured comparison)
doc = app.scrape( url="https://example.com/pricing", formats=["markdown", "changeTracking"], change_tracking_options={ "mode": "json", "schema": {"type": "object", "properties": {"price": {"type": "number"}}} } )
Costs 5 credits per page
Change States:
-
new
-
Page not seen before
-
same
-
No changes since last scrape
-
changed
-
Content modified
-
removed
-
Page no longer accessible
Authentication
Get API key from https://www.firecrawl.dev/app
Store in environment
FIRECRAWL_API_KEY=fc-your-api-key-here
Never hardcode API keys!
Cloudflare Workers Integration
The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly:
interface Env { FIRECRAWL_API_KEY: string; }
export default { async fetch(request: Request, env: Env): Promise<Response> { const { url } = await request.json<{ url: string }>();
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['markdown'],
onlyMainContent: true
})
});
const result = await response.json();
return Response.json(result);
} };
Rate Limits & Pricing
Warning: Stealth Mode Pricing Change (May 2025)
Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails.
Recommended pattern:
Use auto mode (default) - only charges 5 credits if stealth is needed
doc = app.scrape(url, formats=["markdown"])
Or conditionally enable stealth for specific errors
if error_status_code in [401, 403, 500]: doc = app.scrape(url, formats=["markdown"], proxy="stealth")
Unified Billing (November 2025)
Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).
Pricing Tiers
Tier Credits/Month Notes
Free 500 Good for testing
Hobby 3,000 $19/month
Standard 100,000 $99/month
Growth 500,000 $399/month
Credit Costs:
-
Scrape: 1 credit (basic), 5 credits (stealth)
-
Crawl: 1 credit per page
-
Search: 2 credits per 10 results
-
Extract: 5 credits per page (changed from tokens in v2.6.0)
-
Agent: Dynamic (complexity-based)
-
Change Tracking JSON mode: +5 credits
Common Issues & Solutions
Issue Cause Solution
Empty content JS not loaded Add wait_for: 5000 or use actions
Rate limit exceeded Over quota Check dashboard, upgrade plan
Timeout error Slow page Increase timeout , use stealth: true
Bot detection Anti-scraping Use stealth: true , add location
Invalid API key Wrong format Must start with fc-
Known Issues Prevention
This skill prevents 10 documented issues:
Issue #1: Stealth Mode Pricing Change (May 2025)
Error: Unexpected credit costs when using stealth mode Source: Stealth Mode Docs | Changelog Why It Happens: Starting May 8th, 2025, Stealth Mode proxy requests cost 5 credits per request (previously included in standard pricing). This is a significant billing change. Prevention: Use auto mode (default) which only charges stealth credits if basic fails
RECOMMENDED: Use auto mode (default)
doc = app.scrape(url, formats=['markdown'])
Auto retries with stealth (5 credits) only if basic fails
Or conditionally enable based on error status
try: doc = app.scrape(url, formats=['markdown'], proxy='basic') except Exception as e: if e.status_code in [401, 403, 500]: doc = app.scrape(url, formats=['markdown'], proxy='stealth')
Stealth Mode Options:
-
auto (default): Charges 5 credits only if stealth succeeds after basic fails
-
basic : Standard proxies, 1 credit cost
-
stealth : 5 credits per request when actively used
Issue #2: v2.0.0 Breaking Changes - Method Renames
Error: AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'
Source: v2.0.0 Release | Migration Guide Why It Happens: v2.0.0 (August 2025) renamed SDK methods across all languages Prevention: Use new method names
JavaScript/TypeScript:
-
scrapeUrl() → scrape()
-
crawlUrl() → crawl() or startCrawl()
-
asyncCrawlUrl() → startCrawl()
-
checkCrawlStatus() → getCrawlStatus()
Python:
-
scrape_url() → scrape()
-
crawl_url() → crawl() or start_crawl()
OLD (v1)
doc = app.scrape_url("https://example.com")
NEW (v2)
doc = app.scrape("https://example.com")
Issue #3: v2.0.0 Breaking Changes - Format Changes
Error: 'extract' is not a valid format
Source: v2.0.0 Release Why It Happens: Old "extract" format renamed to "json" in v2.0.0 Prevention: Use new object format for JSON extraction
OLD (v1)
doc = app.scrape_url( url="https://example.com", params={ "formats": ["extract"], "extract": {"prompt": "Extract title"} } )
NEW (v2)
doc = app.scrape( url="https://example.com", formats=[{"type": "json", "prompt": "Extract title"}] )
With schema
doc = app.scrape( url="https://example.com", formats=[{ "type": "json", "prompt": "Extract product info", "schema": { "type": "object", "properties": { "title": {"type": "string"}, "price": {"type": "number"} } } }] )
Screenshot format also changed:
NEW: Screenshot as object
formats=[{ "type": "screenshot", "fullPage": True, "quality": 80, "viewport": {"width": 1920, "height": 1080} }]
Issue #4: v2.0.0 Breaking Changes - Crawl Options
Error: 'allowBackwardCrawling' is not a valid parameter
Source: v2.0.0 Release Why It Happens: Several crawl parameters renamed or removed in v2.0.0 Prevention: Use new parameter names
Parameter Changes:
-
allowBackwardCrawling → Use crawlEntireDomain instead
-
maxDepth → Use maxDiscoveryDepth instead
-
ignoreSitemap (bool) → sitemap ("only", "skip", "include")
OLD (v1)
app.crawl_url( url="https://docs.example.com", params={ "allowBackwardCrawling": True, "maxDepth": 3, "ignoreSitemap": False } )
NEW (v2)
app.crawl( url="https://docs.example.com", crawl_entire_domain=True, max_discovery_depth=3, sitemap="include" # "only", "skip", or "include" )
Issue #5: v2.0.0 Default Behavior Changes
Error: Stale cached content returned unexpectedly Source: v2.0.0 Release Why It Happens: v2.0.0 changed several defaults Prevention: Be aware of new defaults
Default Changes:
-
maxAge now defaults to 2 days (cached by default)
-
blockAds , skipTlsVerification , removeBase64Images enabled by default
Force fresh data if needed
doc = app.scrape(url, formats=['markdown'], max_age=0)
Disable cache entirely
doc = app.scrape(url, formats=['markdown'], store_in_cache=False)
Issue #6: Job Status Race Condition
Error: "Job not found" when checking crawl status immediately after creation Source: GitHub Issue #2662 Why It Happens: Database replication delay between job creation and status endpoint availability Prevention: Wait 1-3 seconds before first status check, or implement retry logic
import time
Start crawl
job = app.start_crawl(url="https://docs.example.com") print(f"Job ID: {job.id}")
REQUIRED: Wait before first status check
time.sleep(2) # 1-3 seconds recommended
Now status check succeeds
status = app.get_crawl_status(job.id)
Or implement retry logic
def get_status_with_retry(job_id, max_retries=3, delay=1): for attempt in range(max_retries): try: return app.get_crawl_status(job_id) except Exception as e: if "Job not found" in str(e) and attempt < max_retries - 1: time.sleep(delay) continue raise
status = get_status_with_retry(job.id)
Issue #7: DNS Errors Return HTTP 200
Error: DNS resolution failures return success: false with HTTP 200 status instead of 4xx Source: GitHub Issue #2402 | Fixed in v2.7.0 Why It Happens: Changed in v2.7.0 for consistent error handling Prevention: Check success field and code field, don't rely on HTTP status alone
const result = await app.scrape('https://nonexistent-domain-xyz.com');
// DON'T rely on HTTP status code // Response: HTTP 200 with { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }
// DO check success field if (!result.success) { if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') { console.error('DNS resolution failed'); } throw new Error(result.error); }
Note: DNS resolution errors still charge 1 credit despite failure.
Issue #8: Bot Detection Still Charges Credits
Error: Cloudflare error page returned as "successful" scrape, credits charged Source: GitHub Issue #2413 Why It Happens: Fire-1 engine charges credits even when bot detection prevents access Prevention: Validate content isn't an error page before processing; use stealth mode for protected sites
First attempt without stealth
doc = app.scrape(url="https://protected-site.com", formats=["markdown"])
Validate content isn't an error page
if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower(): # Retry with stealth (costs 5 credits if successful) doc = app.scrape(url, formats=["markdown"], stealth=True)
Cost Impact: Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.
Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness
Error: "All scraping engines failed!" (SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures Source: GitHub Issue #2257 Why It Happens: Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service Prevention: Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy
Self-hosted fails on Cloudflare-protected sites
curl -X POST 'http://localhost:3002/v2/scrape'
-H 'Authorization: Bearer YOUR_API_KEY'
-d '{
"url": "https://www.example.com/",
"pageOptions": { "engine": "playwright" }
}'
Error: "All scraping engines failed!"
Workaround: Use cloud service instead
Cloud service has better anti-fingerprinting
Note: This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."
Issue #10: Cache Performance Best Practices (Community-sourced)
Suboptimal: Not leveraging cache can make requests 500% slower Source: Fast Scraping Docs | Blog Post Why It Matters: Default maxAge is 2 days in v2+, but many use cases need different strategies Prevention: Use appropriate cache strategy for your content type
Fresh data (real-time pricing, stock prices)
doc = app.scrape(url, formats=["markdown"], max_age=0)
10-minute cache (news, blogs)
doc = app.scrape(url, formats=["markdown"], max_age=600000) # milliseconds
Use default cache (2 days) for static content
doc = app.scrape(url, formats=["markdown"]) # maxAge defaults to 172800000
Don't store in cache (one-time scrape)
doc = app.scrape(url, formats=["markdown"], store_in_cache=False)
Require minimum age before re-scraping (v2.7.0+)
doc = app.scrape(url, formats=["markdown"], min_age=3600000) # 1 hour minimum
Performance Impact:
-
Cached response: Milliseconds
-
Fresh scrape: Seconds
-
Speed difference: Up to 500%
Package Versions
Package Version Last Checked
firecrawl-py 4.13.0+ 2026-01-20
@mendable/firecrawl-js 4.11.1+ 2026-01-20
API Version v2 Current
Official Documentation
-
Python SDK: https://docs.firecrawl.dev/sdks/python
-
Node.js SDK: https://docs.firecrawl.dev/sdks/node
-
API Reference: https://docs.firecrawl.dev/api-reference
-
Dashboard: https://www.firecrawl.dev/app
Token Savings: ~65% vs manual integration Error Prevention: 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization) Production Ready: Yes Last verified: 2026-01-21 | Skill version: 2.0.0 | Changes: Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model