NeMo Evaluator SDK - Enterprise LLM Benchmarking
Quick Start
NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation:
pip install nemo-evaluator-launcher
Set API key and run evaluation:
export NGC_API_KEY=nvapi-your-key-here
Create minimal config
cat > config.yaml << 'EOF' defaults:
- execution: local
- deployment: none
- self
execution: output_dir: ./results
target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY
evaluation: tasks: - name: ifeval EOF
Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config
View available tasks:
nemo-evaluator-launcher ls tasks
Common Workflows
Workflow 1: Evaluate Model on Standard Benchmarks
Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.
Checklist:
Standard Evaluation:
- Step 1: Configure API endpoint
- Step 2: Select benchmarks
- Step 3: Run evaluation
- Step 4: Check results
Step 1: Configure API endpoint
config.yaml
defaults:
- execution: local
- deployment: none
- self
execution: output_dir: ./results
target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY
For self-hosted endpoints (vLLM, TRT-LLM):
target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: "" # No key needed for local
Step 2: Select benchmarks
Add tasks to your config:
evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN # Some tasks need HF token - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation
Step 3: Run evaluation
Run with config file
nemo-evaluator-launcher run
--config-dir .
--config-name config
Override output directory
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results
Limit samples for quick testing
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
Step 4: Check results
Check job status
nemo-evaluator-launcher status <invocation_id>
List all runs
nemo-evaluator-launcher ls runs
View results
cat results/<invocation_id>/<task>/artifacts/results.yml
Workflow 2: Run Evaluation on Slurm HPC Cluster
Execute large-scale evaluation on HPC infrastructure.
Checklist:
Slurm Evaluation:
- Step 1: Configure Slurm settings
- Step 2: Set up model deployment
- Step 3: Launch evaluation
- Step 4: Monitor job status
Step 1: Configure Slurm settings
slurm_config.yaml
defaults:
- execution: slurm
- deployment: vllm
- self
execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8
Step 2: Set up model deployment
deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4 max_model_len: 4096
target: api_endpoint: model_id: llama-3.1-8b # URL auto-generated by deployment
Step 3: Launch evaluation
nemo-evaluator-launcher run
--config-dir .
--config-name slurm_config
Step 4: Monitor job status
Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>
View detailed info
nemo-evaluator-launcher info <invocation_id>
Kill if needed
nemo-evaluator-launcher kill <invocation_id>
Workflow 3: Compare Multiple Models
Benchmark multiple models on the same tasks for comparison.
Checklist:
Model Comparison:
- Step 1: Create base config
- Step 2: Run evaluations with overrides
- Step 3: Export and compare results
Step 1: Create base config
base_eval.yaml
defaults:
- execution: local
- deployment: none
- self
execution: output_dir: ./comparison_results
evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval
Step 2: Run evaluations with model overrides
Evaluate Llama 3.1 8B
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
Evaluate Mistral 7B
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
Step 3: Export and compare
Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json
Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
Workflow 4: Safety and Vision-Language Evaluation
Evaluate models on safety benchmarks and VLM tasks.
Checklist:
Safety/VLM Evaluation:
- Step 1: Configure safety tasks
- Step 2: Set up VLM tasks (if applicable)
- Step 3: Run evaluation
Step 1: Configure safety tasks
evaluation: tasks: - name: aegis # Safety harness - name: wildguard # Safety classification - name: garak # Security probing
Step 2: Configure VLM tasks
For vision-language models
target: api_endpoint: type: vlm # Vision-language endpoint model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation: tasks: - name: ocrbench # OCR evaluation - name: chartqa # Chart understanding - name: mmmu # Multimodal understanding
When to Use vs Alternatives
Use NeMo Evaluator when:
-
Need 100+ benchmarks from 18+ harnesses in one platform
-
Running evaluations on Slurm HPC clusters or cloud
-
Requiring reproducible containerized evaluation
-
Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
-
Need enterprise-grade evaluation with result export (MLflow, W&B)
Use alternatives instead:
-
lm-evaluation-harness: Simpler setup for quick local evaluation
-
bigcode-evaluation-harness: Focused only on code benchmarks
-
HELM: Stanford's broader evaluation (fairness, efficiency)
-
Custom scripts: Highly specialized domain evaluation
Supported Harnesses and Tasks
Harness Task Count Categories
lm-evaluation-harness
60+ MMLU, GSM8K, HellaSwag, ARC
simple-evals
20+ GPQA, MATH, AIME
bigcode-evaluation-harness
25+ HumanEval, MBPP, MultiPL-E
safety-harness
3 Aegis, WildGuard
garak
1 Security probing
vlmevalkit
6+ OCRBench, ChartQA, MMMU
bfcl
6 Function calling v2/v3
mtbench
2 Multi-turn conversation
livecodebench
10+ Live coding evaluation
helm
15 Medical domain
nemo-skills
8 Math, science, agentic
Common Issues
Issue: Container pull fails
Ensure NGC credentials are configured:
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Issue: Task requires environment variable
Some tasks need HF_TOKEN or JUDGE_API_KEY:
evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN # Maps env var name to env var
Issue: Evaluation timeout
Increase parallelism or reduce samples:
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
Issue: Slurm job not starting
Check Slurm account and partition:
execution: account: correct_account partition: gpu qos: normal # May need specific QOS
Issue: Different results than expected
Verify configuration matches reported settings:
evaluation: nemo_evaluator_config: config: params: temperature: 0.0 # Deterministic num_fewshot: 5 # Check paper's fewshot count
CLI Reference
Command Description
run
Execute evaluation with config
status <id>
Check job status
info <id>
View detailed job info
ls tasks
List available benchmarks
ls runs
List all invocations
export <id>
Export results (mlflow/wandb/local)
kill <id>
Terminate running job
Configuration Override Examples
Override model endpoint
-o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
Change execution settings
-o execution.output_dir=/custom/path -o execution.mode=parallel
Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
Python API Usage
For programmatic evaluation without the CLI:
from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams )
Configure evaluation
eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) )
Configure target endpoint
target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) )
Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
Advanced Topics
Multi-backend execution: See references/execution-backends.md Configuration deep-dive: See references/configuration.md Adapter and interceptor system: See references/adapter-system.md Custom benchmark integration: See references/custom-benchmarks.md
Requirements
-
Python: 3.10-3.13
-
Docker: Required for local execution
-
NGC API Key: For pulling containers and using NVIDIA Build
-
HF_TOKEN: Required for some benchmarks (GPQA, MMLU)
Resources
-
NGC Containers: nvcr.io/nvidia/eval-factory/
-
NVIDIA Build: https://build.nvidia.com (free hosted models)
-
Documentation: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs