dedupe-rank

Turn a broad retrieved set into a smaller core set for taxonomy/outline building.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dedupe-rank" with this command: npx skills add willoscar/research-units-pipeline-skills/willoscar-research-units-pipeline-skills-dedupe-rank

Dedupe + Rank

Turn a broad retrieved set into a smaller core set for taxonomy/outline building.

This is a deterministic “curation” step: it should be stable and repeatable.

Load Order

Always read:

  • references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

  • assets/domain_packs/llm_agents.json — pinned classics, survey detection, ranking signals for LLM agent topics

Script Boundary

Use scripts/run.py only for:

  • title normalization and deduplication logic

  • relevance scoring from query tokens

  • core set CSV generation with stable paper_id values

Do not treat run.py as the place for:

  • hardcoded pinned paper IDs (use domain packs)

  • hardcoded survey detection rules (use domain packs)

  • domain-specific topic detection logic (use domain packs)

Input

  • papers/papers_raw.jsonl

Outputs

  • papers/papers_dedup.jsonl

  • papers/core_set.csv

Workflow (high level)

  • Dedupe by normalized (title, year) and keep the richest metadata per duplicate cluster.

  • Rank by relevance/recency signals (and optionally pin known classics for certain topics). For LLM-agent topics, also ensure a small quota of prior surveys/reviews is present to support a paper-like Related Work section.

  • Write papers/core_set.csv with stable paper_id values and useful metadata columns (arxiv_id , pdf_url , categories).

Quality checklist

  • papers/papers_dedup.jsonl exists and is valid JSONL.

  • papers/core_set.csv exists and has a header row.

Script

Quick Start

  • python .codex/skills/dedupe-rank/scripts/run.py --help

  • python .codex/skills/dedupe-rank/scripts/run.py --workspace <workspace_dir> --core-size 300

All Options

  • --core-size <n> : target size for papers/core_set.csv

  • queries.md also supports core_size / core_set_size / dedupe_core_size (overrides default when present)

Examples

  • Smaller core set for fast iteration (non-A150++):

  • python .codex/skills/dedupe-rank/scripts/run.py --workspace <ws> --core-size 25

Notes

  • This step may annotate papers/core_set.csv:reason with tags such as pinned_classic and prior_survey (deterministic, topic-aware guards for survey writing).

  • Systematic-review default: if the active pipeline is systematic-review and core_size is not specified, the script keeps the full deduped pool in papers/core_set.csv (so screening does not silently drop candidates).

  • This step is deterministic; reruns should be stable for the same inputs.

Troubleshooting

Common Issues

Issue: papers/core_set.csv is too small / empty

Symptom:

  • Core set has very few rows.

Causes:

  • Input papers/papers_raw.jsonl is small, or many rows are missing required fields.

Solutions:

  • Broaden retrieval (or provide a richer offline export) and rerun.

  • Lower --core-size only if you intentionally want a small core set.

Issue: Duplicates still appear after dedupe

Symptom:

  • Near-identical titles remain.

Causes:

  • Title normalization is defeated by noisy exports.

Solutions:

  • Clean title fields in the export (strip prefixes/suffixes, fix encoding) and rerun.

Recovery Checklist

  • papers/papers_raw.jsonl lines contain title/year/url .

  • papers/core_set.csv has stable paper_id values.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

pdf-text-extractor

No summary provided by upstream source.

Repository SourceNeeds Review
Research

latex-compile-qa

No summary provided by upstream source.

Repository SourceNeeds Review
Research

draft-polisher

No summary provided by upstream source.

Repository SourceNeeds Review
Research

citation-verifier

No summary provided by upstream source.

Repository SourceNeeds Review