agent-survey-corpus

Agent Survey Corpus (arXiv PDFs → text extracts)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "agent-survey-corpus" with this command: npx skills add willoscar/research-units-pipeline-skills/willoscar-research-units-pipeline-skills-agent-survey-corpus

Agent Survey Corpus (arXiv PDFs → text extracts)

Goal: create a small, local reference library so you can learn from real agent surveys when refining:

  • C2 outline structure (paper-like sectioning)

  • C4 tables/claims organization

  • C5 writing style and density

This is intentionally not part of the pipeline; it is an optional, repo-level toolkit.

Inputs

  • ref/agent-surveys/arxiv_ids.txt

Outputs

  • ref/agent-surveys/pdfs/

  • ref/agent-surveys/text/

  • ref/agent-surveys/STYLE_REPORT.md (tracked; auto-generated summary)

Workflow

  • Edit ref/agent-surveys/arxiv_ids.txt (one arXiv id per line).

  • Run the downloader to fetch PDFs and extract the first N pages to text.

  • Skim the extracted text under ref/agent-surveys/text/ :

  • look at section counts (H2), subsection granularity (H3), and how they transition between chapters.

  • identify repeated rhetorical patterns you want the pipeline writer to imitate.

Script

Quick Start

  • python .codex/skills/agent-survey-corpus/scripts/run.py --help

  • python .codex/skills/agent-survey-corpus/scripts/run.py --workspace . --max-pages 20

All Options

  • --workspace <dir> (use . to write into repo root)

  • --inputs <semicolon-separated> (default: ref/agent-surveys/arxiv_ids.txt )

  • --max-pages <N> (default: 20)

  • --sleep <seconds> (default: 1.0)

  • --overwrite (re-download + re-extract)

Examples

  • Download/extract into repo root ref/ :

  • python .codex/skills/agent-survey-corpus/scripts/run.py --workspace . --max-pages 20

  • Download/extract into a specific folder (treated as workspace root):

  • python .codex/skills/agent-survey-corpus/scripts/run.py --workspace /tmp/surveys --max-pages 30

Troubleshooting

  • Download fails / timeout: rerun with a larger --sleep , or try fewer ids.

  • Text extract is empty: the PDF may be scanned; try another survey or increase --max-pages .

  • Files showing up in git status: PDFs/text are ignored via .gitignore (ref//pdfs/ , ref//text/ ).

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

pdf-text-extractor

No summary provided by upstream source.

Repository SourceNeeds Review
Research

latex-compile-qa

No summary provided by upstream source.

Repository SourceNeeds Review
Research

draft-polisher

No summary provided by upstream source.

Repository SourceNeeds Review
Research

citation-verifier

No summary provided by upstream source.

Repository SourceNeeds Review