paper-fetch

Use when the user wants to download a paper PDF from a DOI, title, or URL via legal open-access sources. Tries Unpaywall, arXiv, bioRxiv/medRxiv, PubMed Central, and Semantic Scholar in order. Never uses Sci-Hub or paywall bypass.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "paper-fetch" with this command: npx skills add agents365-ai/paper-fetch

paper-fetch

Fetch the legal open-access PDF for a paper given a DOI (or title). Tries multiple OA sources in priority order and stops at the first hit.

Agent-native. Structured JSON envelope on stdout, NDJSON progress on stderr, stable exit codes, machine-readable schema, TTY-aware format default, idempotent retries.

Resolution order

  1. Unpaywallhttps://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL, read best_oa_location.url_for_pdf (skipped if UNPAYWALL_EMAIL not set)
  2. Semantic Scholarhttps://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds
  3. arXiv — if externalIds.ArXiv present, https://arxiv.org/pdf/{arxiv_id}.pdf
  4. PubMed Central OA — if PMCID present, https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/
  5. bioRxiv / medRxiv — if DOI prefix is 10.1101, query https://api.biorxiv.org/details/{server}/{doi} for the latest version PDF URL
  6. Otherwise → report failure with title/authors so the user can request via ILL

If only a title is given, resolve to a DOI first via Semantic Scholar search_paper_by_title (asta MCP) or Crossref.

Usage

python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # machine-readable self-description

Flags

FlagDefaultDescription
doiDOI to fetch (positional). Use - to read a single DOI from stdin
--batch FILEFile with one DOI per line for bulk download. Use - to read from stdin
--out DIRpdfsOutput directory
--dry-runoffResolve sources without downloading; preview PDF URL and destination
--formatautojson for agents, text for humans. Auto-detects: json when stdout is not a TTY, text when it is
--prettyoffPretty-print JSON with 2-space indent
--streamoffEmit one NDJSON per line on stdout as each DOI resolves, then a summary line (batch mode)
--overwriteoffRe-download even when destination file already exists
--idempotency-key KEYSafe-retry key. Re-running with the same key replays the original envelope from <out>/.paper-fetch-idem/ without network I/O
--timeout SECONDS30HTTP timeout per request
--versionPrint CLI + schema version and exit

Agent discovery: schema subcommand

python scripts/fetch.py schema

Emits a complete machine-readable description of the CLI on stdout (no network). Includes cli_version, schema_version, parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against schema_version, and re-read when the cached version drifts.

Output contract

stdout emits a single JSON envelope. Every envelope carries a meta slot.

Success (all DOIs resolved):

{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
        "file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
        "meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
        "sources_tried": ["unpaywall"]
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_a908f5156fc1",
    "latency_ms": 2036,
    "schema_version": "1.1.0",
    "cli_version": "0.3.0",
    "sources_tried": ["unpaywall"]
  }
}

Partial (batch mode — some DOIs failed, exit code reflects the failure class):

{
  "ok": "partial",
  "data": {
    "results": [
      { "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "sources_tried": ["unpaywall", "semantic_scholar"],
        "error": {
          "code": "not_found",
          "message": "No open-access PDF found",
          "retryable": true,
          "retry_after_hours": 168,
          "reason": "OA availability changes over time; retry after embargo lifts or preprint appears"
        }
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1},
    "next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
  },
  "meta": { ... }
}

The next slot is an array of suggested follow-up commands: re-invoking them retries only the failed subset. Combine with --idempotency-key to make the whole batch safely retriable without re-downloading the already-succeeded items.

Failure (bad arguments, exit code 3):

{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "Provide a DOI or --batch file",
    "retryable": false
  },
  "meta": { ... }
}

Per-item skipped (destination already exists, no --overwrite):

{
  "doi": "10.1038/s41586-021-03819-2",
  "success": true,
  "source": "unpaywall",
  "pdf_url": "https://...",
  "file": "pdfs/Jumper_2021_...pdf",
  "skipped": true,
  "skip_reason": "file_exists",
  "sources_tried": ["unpaywall"]
}

Idempotency replay (re-run with the same --idempotency-key):

The cached envelope is returned verbatim, but meta.request_id and meta.latency_ms are re-stamped for the current call, and meta.replayed_from_idempotency_key is set. No network I/O occurs.

Stderr progress (NDJSON)

When --format json, stderr emits one JSON object per line for liveness:

{"event": "start",       "request_id": "req_...", "elapsed_ms": 2,    "doi": "10.1038/..."}
{"event": "source_try",  "request_id": "req_...", "elapsed_ms": 2,    "doi": "...", "source": "unpaywall"}
{"event": "source_hit",  "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}

Event types: start, source_try, source_hit, source_miss, source_skip, source_enrich, source_enrich_failed, download_ok, download_error, download_skip, dry_run, not_found, update_check_spawned. All events share request_id and elapsed_ms, letting an orchestrator correlate progress across stderr and the final stdout envelope.

source_enrich fires when Semantic Scholar is called purely to backfill missing author / title after another source already provided the PDF URL; its fields array lists exactly which fields were filled in. source_enrich_failed fires when that enrichment call fails — the Unpaywall PDF URL is still used and the filename falls back to unknown_<year>_….

When --format text, stderr emits human-readable prose.

Exit codes

CodeMeaningRetryable class
0All DOIs resolved / previewed
1Unresolved — one or more DOIs had no OA copy; no transport failureNot now (retry after retry_after_hours)
2Reserved for auth errors (currently unused)
3Validation error (bad arguments, missing input)No
4Transport error (network / download / IO failure)Yes

The taxonomy lets an orchestrator route failures deterministically: exit 4 is worth retrying immediately, exit 1 is not, exit 3 is a bug in the caller.

Error codes in JSON

CodeMeaningRetryable
validation_errorBad arguments or empty inputNo
not_foundNo open-access PDF foundYes (retry after retry_after_hours: 168)
download_network_errorNetwork failure during downloadYes
download_not_a_pdfResponse was not a PDF (HTML landing page)No
download_host_not_allowedPDF URL host not in allowlistNo
download_size_exceededResponse exceeded 50 MB limitYes
download_io_errorLocal filesystem write failedYes
internal_errorUnexpected errorNo

Examples

# Single DOI (JSON output when piped; text when in a terminal)
python scripts/fetch.py 10.1038/s41586-020-2649-2

# Dry-run preview (resolve without downloading)
python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run

# Force JSON (for agents even inside a terminal)
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format json

# Human-readable with pretty colors in a pipeline
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text

# Batch download, safely retriable
python scripts/fetch.py --batch dois.txt --out ./papers \
    --idempotency-key monday-review-batch

# Pipe DOIs from another tool
zot -F ids.json query ... | jq -r '.[].doi' | python scripts/fetch.py --batch -

# Agent discovery
python scripts/fetch.py schema --pretty

# Streaming mode — one result per line as each DOI resolves
python scripts/fetch.py --batch dois.txt --stream

# Works without UNPAYWALL_EMAIL (skips Unpaywall, uses remaining 4 sources)
python scripts/fetch.py 10.1038/s41586-020-2649-2

Environment

VariableDefaultPurpose
UNPAYWALL_EMAILunsetContact email for Unpaywall API. Optional but recommended. Without it, Unpaywall is skipped (remaining 4 sources still work).
PAPER_FETCH_ALLOWED_HOSTSunsetComma-separated extra hostnames to extend the download allowlist
PAPER_FETCH_NO_AUTO_UPDATEunsetSet to any value to disable silent background self-update
PAPER_FETCH_UPDATE_INTERVAL86400Cooldown in seconds between update checks

Notes

  • Auth is delegated. The agent never runs a login subcommand. The human or the orchestrator sets UNPAYWALL_EMAIL in the environment; the agent inherits it. Missing email degrades gracefully to the remaining 4 sources.
  • Trust is directional. CLI arguments are validated once at the entry point. The host allowlist and 50 MB size cap are enforced in the environment layer, not at the agent's request. An agent cannot loosen safety by passing a flag — only by the operator setting PAPER_FETCH_ALLOWED_HOSTS.
  • Downloads are naturally idempotent. Re-running against the same --out skips files that already exist (deterministic filename: {first_author}_{year}_{short_title}.pdf). Pair with --idempotency-key to also replay the exact envelope without any network I/O.
  • Never attempts to bypass paywalls. If no OA copy exists, the skill reports failure honestly — do not suggest Sci-Hub or similar.
  • Default output directory: ./pdfs/.

Auto-update

When installed via git clone, the skill keeps itself in sync with upstream automatically. On each invocation, fetch.py spawns a detached background git pull --ff-only in the skill directory:

  • Non-blocking — the current invocation is not delayed; the pull runs in a new session and is fully detached
  • Silent — all git output goes to /dev/null, the stdout envelope is never polluted
  • Throttled — at most once every 24 hours (stamped via .git/.paper-fetch-last-update)
  • Safe--ff-only refuses to merge if you have local edits; conflicts never happen
  • Observable — when a pull is spawned, stderr emits {"event": "update_check_spawned", ...} (JSON mode) or a prose notice (text mode)
  • Convergence — updates apply on the next invocation, not the current one (because the pull is backgrounded)

Force an immediate check with rm <skill_dir>/.git/.paper-fetch-last-update.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

Document Workflow

一键实现学术论文的搜索、下载、分块提取文本及结构化总结,支持按年份和引用数筛选。

Registry SourceRecently Updated
4781Profile unavailable
Research

学术研究助手

学术研究全流程助手。提供论文写作指导、文献检索方法、学术工具推荐、期刊投稿指南、学术会议信息、科研项目管理等。适用于大学生、研究生和科研人员的学术工作辅助。支持家用(知识库)和商用(API扩展)双模式。触发器:用户提出论文/文献/期刊/投稿/学术/科研相关问题时使用。

Registry SourceRecently Updated
650Profile unavailable
Research

PaperCash — 论文全流程辅助 Skill

论文全流程辅助: 8源检索、综述生成、查重预检、降AI率、参考文献格式化、Word导出

Registry SourceRecently Updated
940Profile unavailable
Research

Auto Paper Writer

自动撰写科研论文的完整 AI pipeline。当用户说"撰写[方向]论文"、"自动写论文"、"生成论文"时触发。功能包括:(1) 搜索 ArXiv 最新论文并下载原始 PDF;(2) 总结现有工作局限性;(3) 设计创新模型框架;(4) 撰写双栏 LaTeX 论文(IEEEtran格式);(5) 生成科研级 m...

Registry SourceRecently Updated
1350Profile unavailable