Agent Survey Corpus (arXiv PDFs → text extracts)
Goal: create a small, local reference library so you can learn from real agent surveys when refining:
-
C2 outline structure (paper-like sectioning)
-
C4 tables/claims organization
-
C5 writing style and density
This is intentionally not part of the pipeline; it is an optional, repo-level toolkit.
Inputs
- ref/agent-surveys/arxiv_ids.txt
Outputs
-
ref/agent-surveys/pdfs/
-
ref/agent-surveys/text/
-
ref/agent-surveys/STYLE_REPORT.md (tracked; auto-generated summary)
Workflow
-
Edit ref/agent-surveys/arxiv_ids.txt (one arXiv id per line).
-
Run the downloader to fetch PDFs and extract the first N pages to text.
-
Skim the extracted text under ref/agent-surveys/text/ :
-
look at section counts (H2), subsection granularity (H3), and how they transition between chapters.
-
identify repeated rhetorical patterns you want the pipeline writer to imitate.
Script
Quick Start
-
python .codex/skills/agent-survey-corpus/scripts/run.py --help
-
python .codex/skills/agent-survey-corpus/scripts/run.py --workspace . --max-pages 20
All Options
-
--workspace <dir> (use . to write into repo root)
-
--inputs <semicolon-separated> (default: ref/agent-surveys/arxiv_ids.txt )
-
--max-pages <N> (default: 20)
-
--sleep <seconds> (default: 1.0)
-
--overwrite (re-download + re-extract)
Examples
-
Download/extract into repo root ref/ :
-
python .codex/skills/agent-survey-corpus/scripts/run.py --workspace . --max-pages 20
-
Download/extract into a specific folder (treated as workspace root):
-
python .codex/skills/agent-survey-corpus/scripts/run.py --workspace /tmp/surveys --max-pages 30
Troubleshooting
-
Download fails / timeout: rerun with a larger --sleep , or try fewer ids.
-
Text extract is empty: the PDF may be scanned; try another survey or increase --max-pages .
-
Files showing up in git status: PDFs/text are ignored via .gitignore (ref//pdfs/ , ref//text/ ).