arxiv-search

arXiv Search (metadata-first)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "arxiv-search" with this command: npx skills add willoscar/research-units-pipeline-skills/willoscar-research-units-pipeline-skills-arxiv-search

arXiv Search (metadata-first)

Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.

When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.

Load Order

Always read:

  • references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

  • assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topics

Script Boundary

Use scripts/run.py only for:

  • arXiv API retrieval and XML parsing

  • offline export conversion (CSV/JSON/JSONL normalization)

  • metadata enrichment via id_list backfill

Do not treat run.py as the place for:

  • hardcoded topic detection or query rewriting (use domain packs)

  • domain-specific pinned paper lists (externalize to assets/domain_packs/ )

Input

  • queries.md (keywords, excludes, time window)

Outputs

  • papers/papers_raw.jsonl (JSONL; 1 paper per line)

  • Each record includes at least: title , authors , year , url , abstract

  • When using the arXiv API online mode, records also include helpful metadata: arxiv_id , pdf_url , categories , primary_category , published , updated , doi , journal_ref , comment

  • Convenience index (optional but generated by the script):

  • papers/papers_raw.csv

Decision: online vs offline

  • If you have network access: run arXiv API retrieval.

  • If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.

  • Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv id_list using --enrich-metadata or queries.md enrich_metadata: true .

Workflow (heuristic)

  • Read queries.md and expand into concrete query strings.

  • Retrieve results (online) or import an export (offline).

  • Normalize every record to include at least:

  • title , authors (array), year , url , abstract

  • Keep the set broad at this stage; dedupe/ranking comes next.

  • Apply time window and max_results if specified.

Quality checklist

  • papers/papers_raw.jsonl exists.

  • Each line is valid JSON and contains title , authors , year , url .

Side effects

  • Allowed: create/overwrite papers/papers_raw.jsonl ; append notes to STATUS.md .

  • Not allowed: write prose sections in output/ before writing is approved.

Script

Quick Start

  • python .codex/skills/arxiv-search/scripts/run.py --help

  • Online: python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200

  • Offline import: python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>

All Options

  • --query <q> : repeatable; multiple queries are unioned

  • --exclude <term> : repeatable; excludes applied after retrieval

  • --max-results <n> : cap total retrieved

  • --input <export.*> : offline mode (CSV/JSON/JSONL)

  • --enrich-metadata : best-effort enrich via arXiv id_list (needs network)

  • queries.md also supports: keywords , exclude , time window , max_results , enrich_metadata

Examples

  • Online (multi-query + excludes):

  • python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300

  • Fetch a single paper by arXiv ID (direct id_list fetch):

  • python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1

  • Offline auto-detect (no flags):

  • Place papers/import.csv (or .json/.jsonl ) under the workspace, then run: python .codex/skills/arxiv-search/scripts/run.py --workspace <ws>

  • Offline import + time window (via queries.md ):

  • Set - time window: { from: 2022, to: 2025 } then run offline import normally

Troubleshooting

Common Issues

Issue: papers/papers_raw.jsonl is empty

Symptom:

  • Script exits with “No results returned …” or output file is empty.

Causes:

  • Network is blocked (online mode).

  • Queries are too narrow or queries.md is empty.

Solutions:

  • Use offline import: place papers/import.csv|json|jsonl in the workspace or pass --input .

  • Broaden keywords and reduce excludes in queries.md .

  • Run with explicit --query to sanity-check the parser.

Issue: Offline import records miss fields

Symptom:

  • Downstream steps fail because records miss authors/year/abstract/url .

Causes:

  • Export columns don’t match expected fields; upstream export is incomplete.

Solutions:

  • Ensure the export contains at least title , authors , year , url , abstract .

  • If you later have network, use --enrich-metadata to backfill missing fields (best effort).

Recovery Checklist

  • Confirm queries.md has non-empty keywords (or pass --query ).

  • If offline: confirm workspace has papers/import.* and rerun.

  • Spot-check 3–5 JSONL lines: valid JSON + required fields.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

pdf-text-extractor

No summary provided by upstream source.

Repository SourceNeeds Review
Research

latex-compile-qa

No summary provided by upstream source.

Repository SourceNeeds Review
Research

draft-polisher

No summary provided by upstream source.

Repository SourceNeeds Review
Research

citation-verifier

No summary provided by upstream source.

Repository SourceNeeds Review