NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Basic evaluation:

export NGC_API_KEY=nvapi-your-key-here

cat > config.yaml << 'EOF' defaults:

execution: local
deployment: none
self

execution: output_dir: ./results

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

evaluation: tasks: - name: ifeval EOF

nemo-evaluator-launcher run --config-dir . --config-name config

Common Workflows

Workflow 1: Standard Model Evaluation

Checklist:

Configure API endpoint (NVIDIA Build or self-hosted)
Select benchmarks (MMLU, GSM8K, IFEval, HumanEval)
Run evaluation
Check results

Step 1: Configure endpoint

For NVIDIA Build:

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

For self-hosted (vLLM, TRT-LLM):

target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: ""

Step 2: Select benchmarks

evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation

Step 3: Run and check results

nemo-evaluator-launcher run --config-dir . --config-name config nemo-evaluator-launcher status <invocation_id> cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Slurm HPC Evaluation

defaults:

execution: slurm
deployment: vllm
self

execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4

Workflow 3: Model Comparison

Same config, different models

nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct

nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3

Export results

nemo-evaluator-launcher export <id> --dest mlflow nemo-evaluator-launcher export <id> --dest wandb

Supported Harnesses

Harness Tasks Categories

lm-evaluation-harness 60+ MMLU, GSM8K, HellaSwag, ARC

simple-evals 20+ GPQA, MATH, AIME

bigcode-evaluation-harness 25+ HumanEval, MBPP, MultiPL-E

safety-harness 3 Aegis, WildGuard

vlmevalkit 6+ OCRBench, ChartQA, MMMU

bfcl 6 Function calling v2/v3

CLI Reference

Command Description

run

Execute evaluation with config

status <id>

Check job status

ls tasks

List available benchmarks

ls runs

List all invocations

export <id>

Export results (mlflow/wandb/local)

kill <id>

Terminate running job

When to Use vs Alternatives

Use NeMo Evaluator when:

Need 100+ benchmarks from 18+ harnesses
Running on Slurm HPC clusters
Requiring reproducible containerized evaluation
Evaluating against OpenAI-compatible APIs

Use alternatives instead:

lm-evaluation-harness: Simpler local evaluation
bigcode-evaluation-harness: Code-only benchmarks
HELM: Broader evaluation (fairness, efficiency)

Common Issues

Container pull fails: Configure NGC credentials

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Task requires env var: Add to task config

tasks:

name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN

Increase parallelism:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Requirements

Python 3.10-3.13
Docker (for local execution)
NGC API Key (for NVIDIA Build)
HF_TOKEN (for some benchmarks)

nemo-evaluator

Safety Notice

Copy this and send it to your AI assistant to learn

Same config, different models

Export results

Source Transparency

Related Skills

document-processing

stripe-payments

file-organization

literature-review