Taxonomy Builder (router, compatibility mode)
Build outline/taxonomy.yml from papers/core_set.csv .
P0 compatibility note:
-
The output contract stays the same (outline/taxonomy.yml , YAML list, >=2 levels, concrete descriptions).
-
Curated domain taxonomies now live in assets/domain_packs/*.yaml instead of Python prose.
-
scripts/run.py stays a deterministic scaffold/helper: detect domain pack -> load pack when available -> otherwise fall back to the generic builder.
Load Order
-
references/overview.md
-
references/taxonomy_principles.md
-
If a domain pack applies, read its references/domain_pack_<domain>.md and assets/domain_packs/<domain>.yaml
-
Otherwise read references/archetypes_generic.md
-
Calibrate naming/description quality with references/examples_good.md and references/examples_bad.md
Current compatibility packs:
-
llm_agents
-
gen_image
-
embodied_ai
Inputs
-
papers/core_set.csv (required)
-
Optional: papers/papers_dedup.jsonl
-
Optional: DECISIONS.md , GOAL.md , queries.md
Outputs
- outline/taxonomy.yml
Asset contract
-
assets/taxonomy_schema.json : machine-readable shape for domain packs / output expectations
-
assets/domain_packs/*.yaml : compatibility domain packs for supported domains
Script role
Use scripts/run.py only for deterministic help:
-
never overwrite non-placeholder user taxonomy
-
preserve current CLI flags / output path
-
load supported domain taxonomies from assets instead of hard-coded Python prose
-
keep the generic fallback builder for non-packed domains
When to refine manually
Refine the generated taxonomy before marking the unit DONE if:
-
top-level buckets feel like keyword clusters instead of chapter-level questions
-
leaf names are generic (Overview , Benchmarks , Open Problems , Misc )
-
descriptions lack scope cues or representative paper anchors
-
domain detection chose the wrong pack
Quick start
-
python .codex/skills/taxonomy-builder/scripts/run.py --help
-
python .codex/skills/taxonomy-builder/scripts/run.py --workspace <workspace_dir>
Execution notes
When running in compatibility mode, scripts/run.py currently reads:
-
papers/core_set.csv as the required corpus input
-
papers/papers_dedup.jsonl when present for extra title/abstract signals
-
GOAL.md , queries.md , and DECISIONS.md as optional domain/profile hints during pack selection
Script
Quick Start
- python .codex/skills/taxonomy-builder/scripts/run.py --workspace <workspace_dir>
All Options
-
--workspace <dir>
-
--top-k <int>
-
--min-freq <int>
-
--unit-id <id>
-
--inputs <a;b;...>
-
--outputs <a;b;...>
-
--checkpoint <C*>
Examples
- python .codex/skills/taxonomy-builder/scripts/run.py --workspace workspaces/<ws>
Troubleshooting
-
If the wrong domain pack is chosen, inspect GOAL.md , queries.md , and the pack detect rules before changing Python.
-
If outline/taxonomy.yml already contains a real non-placeholder taxonomy, the script intentionally returns without overwriting it.
-
If no pack matches, the script falls back to the generic builder.