NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Set API key and run evaluation:

export NGC_API_KEY=nvapi-your-key-here

Create minimal config

cat > config.yaml << 'EOF' defaults:

execution: local
deployment: none
self

execution: output_dir: ./results

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

evaluation: tasks: - name: ifeval EOF

Run evaluation

nemo-evaluator-launcher run --config-dir . --config-name config

View available tasks:

nemo-evaluator-launcher ls tasks

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

Checklist:

Standard Evaluation:

Step 1: Configure API endpoint
Step 2: Select benchmarks
Step 3: Run evaluation
Step 4: Check results

Step 1: Configure API endpoint

config.yaml

defaults:

execution: local
deployment: none
self

execution: output_dir: ./results

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):

target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: "" # No key needed for local

Step 2: Select benchmarks

Add tasks to your config:

evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN # Some tasks need HF token - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation

Step 3: Run evaluation

Run with config file

nemo-evaluator-launcher run
--config-dir .
--config-name config

Override output directory

nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results

Limit samples for quick testing

nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

Step 4: Check results

Check job status

nemo-evaluator-launcher status <invocation_id>

List all runs

nemo-evaluator-launcher ls runs

View results

cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

Checklist:

Slurm Evaluation:

Step 1: Configure Slurm settings
Step 2: Set up model deployment
Step 3: Launch evaluation
Step 4: Monitor job status

Step 1: Configure Slurm settings

slurm_config.yaml

defaults:

execution: slurm
deployment: vllm
self

execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

Step 2: Set up model deployment

deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4 max_model_len: 4096

target: api_endpoint: model_id: llama-3.1-8b # URL auto-generated by deployment

Step 3: Launch evaluation

nemo-evaluator-launcher run
--config-dir .
--config-name slurm_config

Step 4: Monitor job status

Check status (queries sacct)

nemo-evaluator-launcher status <invocation_id>

View detailed info

nemo-evaluator-launcher info <invocation_id>

Kill if needed

nemo-evaluator-launcher kill <invocation_id>

Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

Checklist:

Model Comparison:

Step 1: Create base config
Step 2: Run evaluations with overrides
Step 3: Export and compare results

Step 1: Create base config

base_eval.yaml

defaults:

execution: local
deployment: none
self

execution: output_dir: ./comparison_results

evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval

Step 2: Run evaluations with model overrides

Evaluate Llama 3.1 8B

nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Evaluate Mistral 7B

nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Step 3: Export and compare

Export to MLflow

nemo-evaluator-launcher export <invocation_id_1> --dest mlflow nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

Export to local JSON

nemo-evaluator-launcher export <invocation_id> --dest local --format json

Export to Weights & Biases

nemo-evaluator-launcher export <invocation_id> --dest wandb

Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

Checklist:

Safety/VLM Evaluation:

Step 1: Configure safety tasks
Step 2: Set up VLM tasks (if applicable)
Step 3: Run evaluation

Step 1: Configure safety tasks

evaluation: tasks: - name: aegis # Safety harness - name: wildguard # Safety classification - name: garak # Security probing

Step 2: Configure VLM tasks

For vision-language models

target: api_endpoint: type: vlm # Vision-language endpoint model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation: tasks: - name: ocrbench # OCR evaluation - name: chartqa # Chart understanding - name: mmmu # Multimodal understanding

When to Use vs Alternatives

Use NeMo Evaluator when:

Need 100+ benchmarks from 18+ harnesses in one platform
Running evaluations on Slurm HPC clusters or cloud
Requiring reproducible containerized evaluation
Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
Need enterprise-grade evaluation with result export (MLflow, W&B)

Use alternatives instead:

lm-evaluation-harness: Simpler setup for quick local evaluation
bigcode-evaluation-harness: Focused only on code benchmarks
HELM: Stanford's broader evaluation (fairness, efficiency)
Custom scripts: Highly specialized domain evaluation

Supported Harnesses and Tasks

Harness Task Count Categories

lm-evaluation-harness

60+ MMLU, GSM8K, HellaSwag, ARC

simple-evals

20+ GPQA, MATH, AIME

bigcode-evaluation-harness

25+ HumanEval, MBPP, MultiPL-E

safety-harness

3 Aegis, WildGuard

garak

1 Security probing

vlmevalkit

6+ OCRBench, ChartQA, MMMU

bfcl

6 Function calling v2/v3

mtbench

2 Multi-turn conversation

livecodebench

10+ Live coding evaluation

helm

15 Medical domain

nemo-skills

8 Math, science, agentic

Common Issues

Issue: Container pull fails

Ensure NGC credentials are configured:

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Issue: Task requires environment variable

Some tasks need HF_TOKEN or JUDGE_API_KEY:

evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN # Maps env var name to env var

Issue: Evaluation timeout

Increase parallelism or reduce samples:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Issue: Slurm job not starting

Check Slurm account and partition:

execution: account: correct_account partition: gpu qos: normal # May need specific QOS

Issue: Different results than expected

Verify configuration matches reported settings:

evaluation: nemo_evaluator_config: config: params: temperature: 0.0 # Deterministic num_fewshot: 5 # Check paper's fewshot count

CLI Reference

Command Description

run

Execute evaluation with config

status <id>

Check job status

info <id>

View detailed job info

ls tasks

List available benchmarks

ls runs

List all invocations

export <id>

Export results (mlflow/wandb/local)

kill <id>

Terminate running job

Configuration Override Examples

Override model endpoint

-o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

Add evaluation parameters

-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

Change execution settings

-o execution.output_dir=/custom/path -o execution.mode=parallel

Dynamically set tasks

-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

Python API Usage

For programmatic evaluation without the CLI:

from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams )

Configure evaluation

eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) )

Configure target endpoint

target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) )

Run evaluation

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Advanced Topics

Multi-backend execution: See references/execution-backends.md Configuration deep-dive: See references/configuration.md Adapter and interceptor system: See references/adapter-system.md Custom benchmark integration: See references/custom-benchmarks.md

Requirements

Python: 3.10-3.13
Docker: Required for local execution
NGC API Key: For pulling containers and using NVIDIA Build
HF_TOKEN: Required for some benchmarks (GPQA, MMLU)

Resources

GitHub: https://github.com/NVIDIA-NeMo/Evaluator
NGC Containers: nvcr.io/nvidia/eval-factory/
NVIDIA Build: https://build.nvidia.com (free hosted models)
Documentation: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs

nemo-evaluator-sdk

Safety Notice

Copy this and send it to your AI assistant to learn

Create minimal config

Run evaluation

config.yaml

Run with config file

Override output directory

Limit samples for quick testing

Check job status

List all runs

View results

slurm_config.yaml

Check status (queries sacct)

View detailed info

Kill if needed

base_eval.yaml

Evaluate Llama 3.1 8B

Evaluate Mistral 7B

Export to MLflow

Export to local JSON

Export to Weights & Biases

For vision-language models

Override model endpoint

Add evaluation parameters

Change execution settings

Dynamically set tasks

Configure evaluation

Configure target endpoint

Run evaluation

Source Transparency

Related Skills

ml-paper-writing

faiss

mlflow

serving-llms-vllm