arXiv Search (metadata-first)
Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.
When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.
Load Order
Always read:
- references/domain_pack_overview.md — how domain packs drive topic-specific behavior
Domain packs (loaded by topic match):
- assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topics
Script Boundary
Use scripts/run.py only for:
-
arXiv API retrieval and XML parsing
-
offline export conversion (CSV/JSON/JSONL normalization)
-
metadata enrichment via id_list backfill
Do not treat run.py as the place for:
-
hardcoded topic detection or query rewriting (use domain packs)
-
domain-specific pinned paper lists (externalize to assets/domain_packs/ )
Input
- queries.md (keywords, excludes, time window)
Outputs
-
papers/papers_raw.jsonl (JSONL; 1 paper per line)
-
Each record includes at least: title , authors , year , url , abstract
-
When using the arXiv API online mode, records also include helpful metadata: arxiv_id , pdf_url , categories , primary_category , published , updated , doi , journal_ref , comment
-
Convenience index (optional but generated by the script):
-
papers/papers_raw.csv
Decision: online vs offline
-
If you have network access: run arXiv API retrieval.
-
If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
-
Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv id_list using --enrich-metadata or queries.md enrich_metadata: true .
Workflow (heuristic)
-
Read queries.md and expand into concrete query strings.
-
Retrieve results (online) or import an export (offline).
-
Normalize every record to include at least:
-
title , authors (array), year , url , abstract
-
Keep the set broad at this stage; dedupe/ranking comes next.
-
Apply time window and max_results if specified.
Quality checklist
-
papers/papers_raw.jsonl exists.
-
Each line is valid JSON and contains title , authors , year , url .
Side effects
-
Allowed: create/overwrite papers/papers_raw.jsonl ; append notes to STATUS.md .
-
Not allowed: write prose sections in output/ before writing is approved.
Script
Quick Start
-
python .codex/skills/arxiv-search/scripts/run.py --help
-
Online: python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200
-
Offline import: python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>
All Options
-
--query <q> : repeatable; multiple queries are unioned
-
--exclude <term> : repeatable; excludes applied after retrieval
-
--max-results <n> : cap total retrieved
-
--input <export.*> : offline mode (CSV/JSON/JSONL)
-
--enrich-metadata : best-effort enrich via arXiv id_list (needs network)
-
queries.md also supports: keywords , exclude , time window , max_results , enrich_metadata
Examples
-
Online (multi-query + excludes):
-
python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
-
Fetch a single paper by arXiv ID (direct id_list fetch):
-
python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1
-
Offline auto-detect (no flags):
-
Place papers/import.csv (or .json/.jsonl ) under the workspace, then run: python .codex/skills/arxiv-search/scripts/run.py --workspace <ws>
-
Offline import + time window (via queries.md ):
-
Set - time window: { from: 2022, to: 2025 } then run offline import normally
Troubleshooting
Common Issues
Issue: papers/papers_raw.jsonl is empty
Symptom:
- Script exits with “No results returned …” or output file is empty.
Causes:
-
Network is blocked (online mode).
-
Queries are too narrow or queries.md is empty.
Solutions:
-
Use offline import: place papers/import.csv|json|jsonl in the workspace or pass --input .
-
Broaden keywords and reduce excludes in queries.md .
-
Run with explicit --query to sanity-check the parser.
Issue: Offline import records miss fields
Symptom:
- Downstream steps fail because records miss authors/year/abstract/url .
Causes:
- Export columns don’t match expected fields; upstream export is incomplete.
Solutions:
-
Ensure the export contains at least title , authors , year , url , abstract .
-
If you later have network, use --enrich-metadata to backfill missing fields (best effort).
Recovery Checklist
-
Confirm queries.md has non-empty keywords (or pass --query ).
-
If offline: confirm workspace has papers/import.* and rerun.
-
Spot-check 3–5 JSONL lines: valid JSON + required fields.