nemo-evaluator

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "nemo-evaluator" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-nemo-evaluator

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Basic evaluation:

export NGC_API_KEY=nvapi-your-key-here

cat > config.yaml << 'EOF' defaults:

  • execution: local
  • deployment: none
  • self

execution: output_dir: ./results

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

evaluation: tasks: - name: ifeval EOF

nemo-evaluator-launcher run --config-dir . --config-name config

Common Workflows

Workflow 1: Standard Model Evaluation

Checklist:

  • Configure API endpoint (NVIDIA Build or self-hosted)
  • Select benchmarks (MMLU, GSM8K, IFEval, HumanEval)
  • Run evaluation
  • Check results

Step 1: Configure endpoint

For NVIDIA Build:

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

For self-hosted (vLLM, TRT-LLM):

target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: ""

Step 2: Select benchmarks

evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation

Step 3: Run and check results

nemo-evaluator-launcher run --config-dir . --config-name config nemo-evaluator-launcher status <invocation_id> cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Slurm HPC Evaluation

defaults:

  • execution: slurm
  • deployment: vllm
  • self

execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4

Workflow 3: Model Comparison

Same config, different models

nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct

nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3

Export results

nemo-evaluator-launcher export <id> --dest mlflow nemo-evaluator-launcher export <id> --dest wandb

Supported Harnesses

Harness Tasks Categories

lm-evaluation-harness 60+ MMLU, GSM8K, HellaSwag, ARC

simple-evals 20+ GPQA, MATH, AIME

bigcode-evaluation-harness 25+ HumanEval, MBPP, MultiPL-E

safety-harness 3 Aegis, WildGuard

vlmevalkit 6+ OCRBench, ChartQA, MMMU

bfcl 6 Function calling v2/v3

CLI Reference

Command Description

run

Execute evaluation with config

status <id>

Check job status

ls tasks

List available benchmarks

ls runs

List all invocations

export <id>

Export results (mlflow/wandb/local)

kill <id>

Terminate running job

When to Use vs Alternatives

Use NeMo Evaluator when:

  • Need 100+ benchmarks from 18+ harnesses

  • Running on Slurm HPC clusters

  • Requiring reproducible containerized evaluation

  • Evaluating against OpenAI-compatible APIs

Use alternatives instead:

  • lm-evaluation-harness: Simpler local evaluation

  • bigcode-evaluation-harness: Code-only benchmarks

  • HELM: Broader evaluation (fairness, efficiency)

Common Issues

Container pull fails: Configure NGC credentials

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Task requires env var: Add to task config

tasks:

  • name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN

Increase parallelism:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Requirements

  • Python 3.10-3.13

  • Docker (for local execution)

  • NGC API Key (for NVIDIA Build)

  • HF_TOKEN (for some benchmarks)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

stripe-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-organization

No summary provided by upstream source.

Repository SourceNeeds Review
General

literature-review

No summary provided by upstream source.

Repository SourceNeeds Review