ref-hallucination-arena

Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ref-hallucination-arena" with this command: npx skills add agentscope-ai/openjudge/agentscope-ai-openjudge-ref-hallucination-arena

Reference Hallucination Arena Skill

Evaluate how accurately LLMs recommend real academic references using the OpenJudge RefArenaPipeline:

  1. Load queries — from JSON/JSONL dataset
  2. Collect responses — BibTeX-formatted references from target models
  3. Extract references — parse BibTeX entries from model output
  4. Verify references — cross-check against Crossref / PubMed / arXiv / DBLP
  5. Score & rank — compute verification rate, per-field accuracy, discipline breakdown
  6. Generate report — Markdown report + visualization charts

Prerequisites

# Install OpenJudge
pip install py-openjudge

# Extra dependency for ref_hallucination_arena (chart generation)
pip install matplotlib

Gather from user before running

InfoRequired?Notes
Config YAML pathYesDefines endpoints, dataset, verification settings
Dataset pathYesJSON/JSONL file with queries (can be set in config)
API keysYesEnv vars: OPENAI_API_KEY, DASHSCOPE_API_KEY, etc.
CrossRef emailNoImproves API rate limits for verification
PubMed API keyNoImproves PubMed rate limits
Output directoryNoDefault: ./evaluation_results/ref_hallucination_arena
Report languageNo"en" (default) or "zh"
Tavily API keyNoRequired only if using tool-augmented mode

Quick start

CLI

# Run evaluation with config file
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (default behavior)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh, ignore checkpoint
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

# Override output directory
python -m cookbooks.ref_hallucination_arena --config config.yaml \
  --output_dir ./my_results --save

Python API

import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline

async def main():
    pipeline = RefArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    for rank, (model, score) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {score:.1%}")

asyncio.run(main())

CLI options

FlagDefaultDescription
--configPath to YAML configuration file (required)
--output_dirconfig valueOverride output directory
--saveFalseSave results to file
--freshFalseStart fresh, ignore checkpoint

Minimal config file

task:
  description: "Evaluate LLM reference recommendation capabilities"

dataset:
  path: "./data/queries.json"

target_endpoints:
  model_a:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

  model_b:
    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    api_key: "${DASHSCOPE_API_KEY}"
    model: "qwen3-max"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

Full config reference

task

FieldRequiredDescription
descriptionYesEvaluation task description
scenarioNoUsage scenario

dataset

FieldDefaultDescription
pathPath to JSON/JSONL dataset file (required)
shufflefalseShuffle queries before evaluation
max_queriesnullMax queries to use (null = all)

target_endpoints.<name>

FieldDefaultDescription
base_urlAPI base URL (required)
api_keyAPI key, supports ${ENV_VAR} (required)
modelModel name (required)
system_promptbuilt-inSystem prompt; use {num_refs} placeholder
max_concurrency5Max concurrent requests for this endpoint
extra_paramsExtra API request params (e.g. temperature)
tool_config.enabledfalseEnable ReAct agent with Tavily web search
tool_config.tavily_api_keyenv varTavily API key
tool_config.max_iterations10Max ReAct iterations (1–30)
tool_config.search_depth"advanced""basic" or "advanced"

verification

FieldDefaultDescription
crossref_mailtoEmail for Crossref polite pool
pubmed_api_keyPubMed API key
max_workers10Concurrent verification threads (1–50)
timeout30Per-request timeout in seconds
verified_threshold0.7Min composite score to count as VERIFIED

evaluation

FieldDefaultDescription
timeout120Model API request timeout in seconds
retry_times3Number of retry attempts

output

FieldDefaultDescription
output_dir./evaluation_results/ref_hallucination_arenaOutput directory
save_queriestrueSave loaded queries
save_responsestrueSave model responses
save_detailstrueSave verification details

report

FieldDefaultDescription
enabledtrueEnable report generation
language"zh"Report language: "zh" or "en"
include_examples3Examples per section (1–10)
chart.enabledtrueGenerate charts
chart.orientation"vertical""horizontal" or "vertical"
chart.show_valuestrueShow values on bars
chart.highlight_besttrueHighlight best model

Dataset format

Each query in the JSON/JSONL dataset:

{
  "query": "Please recommend papers on Transformer architectures for NLP.",
  "discipline": "computer_science",
  "num_refs": 5,
  "language": "en",
  "year_constraint": {"min_year": 2020}
}
FieldRequiredDescription
queryYesPrompt for reference recommendation
disciplineNocomputer_science, biomedical, physics, chemistry, social_science, interdisciplinary, other
num_refsNoExpected number of references (default: 5)
languageNo"zh" or "en" (default: "zh")
year_constraintNo{"exact": 2023}, {"min_year": 2020}, {"max_year": 2015}, or {"min_year": 2020, "max_year": 2024}

Official dataset: OpenJudge/ref-hallucination-arena

Interpreting results

Overall accuracy (verification rate):

  • > 75% — Excellent: model rarely hallucinates references
  • 60–75% — Good: most references are real, some fabrication
  • 40–60% — Fair: significant hallucination, use with caution
  • < 40% — Poor: model frequently fabricates references

Per-field accuracy:

  • title_accuracy — % of titles matching real papers
  • author_accuracy — % of correct author lists
  • year_accuracy — % of correct publication years
  • doi_accuracy — % of valid DOIs

Verification status:

  • VERIFIED — title + author + year all exactly match a real paper
  • SUSPECT — partial match (e.g. title matches but authors differ)
  • NOT_FOUND — no match in any database
  • ERROR — API timeout or network failure

Ranking order: overall accuracy → year compliance rate → avg confidence → completeness

Output files

evaluation_results/ref_hallucination_arena/
├── evaluation_report.md          # Detailed Markdown report
├── evaluation_results.json       # Rankings, per-field accuracy, scores
├── verification_chart.png        # Per-field accuracy bar chart
├── discipline_chart.png          # Per-discipline accuracy chart
├── queries.json                  # Loaded evaluation queries
├── responses.json                # Raw model responses
├── extracted_refs.json           # Extracted BibTeX references
├── verification_results.json     # Per-reference verification details
└── checkpoint.json               # Pipeline checkpoint for resume

API key by model

Model prefixEnvironment variable
gpt-*, o1-*, o3-*OPENAI_API_KEY
claude-*ANTHROPIC_API_KEY
qwen-*, dashscope/*DASHSCOPE_API_KEY
deepseek-*DEEPSEEK_API_KEY
Custom endpointset api_key + base_url in config

Additional resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

paper-review

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

auto-arena

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bib-verify

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

find-skills-combo

No summary provided by upstream source.

Repository SourceNeeds Review