literature-engineer

Literature Engineer (evidence collector)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "literature-engineer" with this command: npx skills add willoscar/research-units-pipeline-skills/willoscar-research-units-pipeline-skills-literature-engineer

Literature Engineer (evidence collector)

Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.

This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.

Load Order

Always read:

  • references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

  • assets/domain_packs/llm_agents.json — pinned classic/survey arXiv IDs for LLM agent topics

Script Boundary

Use scripts/run.py only for:

  • multi-route offline import, normalization, and provenance tagging

  • online arXiv/Semantic Scholar API retrieval

  • snowball expansion and deduplication

  • retrieval report generation

Do not treat run.py as the place for:

  • hardcoded pinned arXiv ID lists (use domain packs)

  • hardcoded topic detection logic (use domain packs)

Inputs

  • queries.md

  • keywords , exclude , max_results , time window

  • Optional offline sources (any combination; all are merged):

  • papers/import.(csv|json|jsonl|bib)

  • papers/arxiv_export.(csv|json|jsonl|bib)

  • papers/imports/*.(csv|json|jsonl|bib)

  • Optional snowball exports (offline):

  • papers/snowball/*.(csv|json|jsonl|bib)

Outputs

  • papers/papers_raw.jsonl

  • 1 record per line; minimum fields:

  • title (str), authors (list[str]), year (int|""), url (str)

  • stable identifier(s): arxiv_id and/or doi

  • abstract (str; may be empty in offline mode)

  • source (str) + provenance (list[dict])

  • papers/papers_raw.csv (human scan)

  • papers/retrieval_report.md (route counts, missing-meta stats, next actions)

Workflow (multi-route)

  • Offline-first merge: ingest all available offline exports (and label provenance per file).

  • Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.

  • Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.

  • Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning provenance .

  • Report: write a concise retrieval report with coverage buckets and missing-meta counts.

Quality checklist

  • Candidate pool size target met (A150++: ≥1200) without fabrication.

  • Each record has a stable identifier (arxiv_id or doi , plus url ).

  • Each record has provenance: which route/file/API produced it.

Script

Quick Start

  • python .codex/skills/literature-engineer/scripts/run.py --help

All Options

  • See python .codex/skills/literature-engineer/scripts/run.py --help .

  • Reads retrieval config from queries.md .

  • Offline inputs (merged if present): papers/import.(csv|json|jsonl|bib) , papers/arxiv_export.(csv|json|jsonl|bib) , papers/imports/*.(csv|json|jsonl|bib) .

  • Optional offline snowball inputs: papers/snowball/*.(csv|json|jsonl|bib) .

  • Online expansion requires network: use --online and/or --snowball .

  • Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.

  • For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so ref.bib can include must-cite anchors even when keyword search misses them.

  • If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the r.jina.ai proxy so the pipeline can still self-boot without manual exports.

  • When an online run returns 0 records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).

Examples

Offline imports only:

  • Put exports under papers/imports/ then run:

  • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>

Explicit offline inputs (multi-route):

  • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl

Online arXiv retrieval (needs network):

  • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online

Snowballing (needs network unless you provide offline snowball exports):

  • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

Troubleshooting

Issue: can't reach ≥1200 papers

Symptom:

  • papers/papers_raw.jsonl size is far below target; later stages will fail mapping/bindings and citation density.

Causes:

  • Only a small offline export was provided.

  • Network is blocked so online retrieval/snowballing can't run.

Solutions:

  • Provide additional exports under papers/imports/ (multiple routes/queries).

  • Provide snowball exports under papers/snowball/ .

  • Enable network and rerun with --online --snowball .

Issue: many records missing stable IDs

Symptom:

  • Report shows many entries with empty arxiv_id and doi .

Solutions:

  • Prefer arXiv/OpenReview/ACL exports that include stable IDs.

  • If you have network, rerun with --online to backfill arXiv IDs.

  • Filter out ID-less entries before downstream citation generation.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

pdf-text-extractor

No summary provided by upstream source.

Repository SourceNeeds Review
Research

latex-compile-qa

No summary provided by upstream source.

Repository SourceNeeds Review
Research

draft-polisher

No summary provided by upstream source.

Repository SourceNeeds Review
Research

citation-verifier

No summary provided by upstream source.

Repository SourceNeeds Review