lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

pip install lm-eval

Evaluate any HuggingFace model:

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag
--device cuda:0
--batch_size 8

View available tasks:

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:

Step 1: Choose benchmark suite
Step 2: Configure model
Step 3: Run evaluation
Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
GSM8K - Grade school math word problems
HellaSwag - Common sense reasoning
TruthfulQA - Truthfulness and factuality
ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

HumanEval - Python code generation (164 problems)
MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16
--tasks mmlu
--device cuda:0
--batch_size auto # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True
--tasks mmlu
--device cuda:0

Custom checkpoint:

lm_eval --model hf
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer
--tasks mmlu
--device cuda:0

Step 3: Run evaluation

Full MMLU evaluation (57 subjects)

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot evaluation (standard) --batch_size 8
--output_path results/
--log_samples # Save individual predictions

Multiple benchmarks at once

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json :

{ "results": { "mmlu": { "acc": 0.459, "acc_stderr": 0.004 }, "gsm8k": { "exact_match": 0.142, "exact_match_stderr": 0.006 }, "hellaswag": { "acc_norm": 0.765, "acc_norm_stderr": 0.004 } }, "config": { "model": "hf", "model_args": "pretrained=meta-llama/Llama-2-7b-hf", "num_fewshot": 5 } }

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:

Step 1: Set up periodic evaluation
Step 2: Choose quick benchmarks
Step 3: Automate evaluation
Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

#!/bin/bash

eval_checkpoint.sh

CHECKPOINT_DIR=$1 STEP=$2

lm_eval --model hf
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot for speed --batch_size 16
--output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

HellaSwag: ~10 minutes on 1 GPU
GSM8K: ~5 minutes
PIQA: ~2 minutes

Avoid for frequent eval (too slow):

MMLU: ~2 hours (57 subjects)
HumanEval: Requires code execution

Step 3: Automate evaluation

Integrate with training script:

In training loop

if step % eval_interval == 0: model.save_pretrained(f"checkpoints/step-{step}")

# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback): def on_validation_epoch_end(self, trainer, pl_module): step = trainer.global_step checkpoint_path = f"checkpoints/step-{step}"

    # Save checkpoint
    trainer.save_checkpoint(checkpoint_path)

    # Run lm-eval
    os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

import json import matplotlib.pyplot as plt

Load all results

steps = [] mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")): with open(file) as f: data = json.load(f) step = int(file.split("-")[1].split(".")[0]) steps.append(step) mmlu_scores.append(data["results"]["mmlu"]["acc"])

Plot

plt.plot(steps, mmlu_scores) plt.xlabel("Training Step") plt.ylabel("MMLU Accuracy") plt.title("Training Progress") plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:

Step 1: Define model list
Step 2: Run evaluations
Step 3: Generate comparison table

Step 1: Define model list

models.txt

meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf mistralai/Mistral-7B-v0.1 microsoft/phi-2

Step 2: Run evaluations

#!/bin/bash

eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do echo "Evaluating $model"

# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')

lm_eval --model hf \
  --model_args pretrained=$model,dtype=bfloat16 \
  --tasks $TASKS \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

import json import pandas as pd

models = [ "meta-llama-Llama-2-7b-hf", "meta-llama-Llama-2-13b-hf", "mistralai-Mistral-7B-v0.1", "microsoft-phi-2" ]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = [] for model in models: with open(f"results/{model}.json") as f: data = json.load(f) row = {"Model": model.replace("-", "/")} for task in tasks: # Get primary metric for each task metrics = data["results"][task] if "acc" in metrics: row[task.upper()] = f"{metrics['acc']:.3f}" elif "exact_match" in metrics: row[task.upper()] = f"{metrics['exact_match']:.3f}" results.append(row)

df = pd.DataFrame(results) print(df.to_markdown(index=False))

Output:

Model	MMLU	GSM8K	HELLASWAG	TRUTHFULQA
meta-llama/Llama-2-7b	0.459	0.142	0.765	0.391
meta-llama/Llama-2-13b	0.549	0.287	0.801	0.430
mistralai/Mistral-7B	0.626	0.395	0.812	0.428
microsoft/phi-2	0.560	0.613	0.682	0.447

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:

Step 1: Install vLLM
Step 2: Configure vLLM backend
Step 3: Run evaluation

Step 1: Install vLLM

pip install vllm

Step 2: Configure vLLM backend

lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8
--tasks mmlu
--batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

Standard HF: ~2 hours for MMLU on 7B model

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8

vLLM: ~15-20 minutes for MMLU on 7B model

lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

Benchmarking models for academic papers
Comparing model quality across standard tasks
Tracking training progress
Reporting standardized metrics (everyone uses same prompts)
Need reproducible evaluation

Use alternatives instead:

HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
AlpacaEval: Instruction-following evaluation with LLM judges
MT-Bench: Conversational multi-turn evaluation
Custom scripts: Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

lm_eval --model vllm
--model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

--num_fewshot 0 # Instead of 5

Or evaluate subset of MMLU:

--tasks mmlu_stem # Only STEM subjects

Issue: Out of memory

Reduce batch size:

--batch_size 1 # Or --batch_size auto

Use quantization:

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

--num_fewshot 5 # Most papers use 5-shot

Check exact task name:

--tasks mmlu # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

pip install human-eval

Enable code execution:

lm_eval --model hf
--model_args pretrained=model-name
--tasks humaneval
--allow_code_execution # Required for HumanEval

Advanced topics

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
VRAM:
7B model: 16GB (bf16) or 8GB (8-bit)
13B model: 28GB (bf16) or 14GB (8-bit)
70B model: Requires multi-GPU or quantization
Time (7B model, single A100):
HellaSwag: 10 minutes
GSM8K: 5 minutes
MMLU (full): 2 hours
HumanEval: 20 minutes

Resources

GitHub: https://github.com/EleutherAI/lm-evaluation-harness
Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)

evaluating-llms-harness

Safety Notice

Copy this and send it to your AI assistant to learn

Full MMLU evaluation (57 subjects)

Multiple benchmarks at once

eval_checkpoint.sh

In training loop

Load all results

Plot

models.txt

eval_all_models.sh

Standard HF: ~2 hours for MMLU on 7B model

vLLM: ~15-20 minutes for MMLU on 7B model

Source Transparency

Related Skills

senior-data-scientist

senior-backend

senior-frontend

ui-ux-pro-max