evaluating-llms-harness

lm-evaluation-harness - LLM Benchmarking

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "evaluating-llms-harness" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-evaluating-llms-harness

lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

pip install lm-eval

Evaluate any HuggingFace model:

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag
--device cuda:0
--batch_size 8

View available tasks:

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:

  • Step 1: Choose benchmark suite
  • Step 2: Configure model
  • Step 3: Run evaluation
  • Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

  • MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice

  • GSM8K - Grade school math word problems

  • HellaSwag - Common sense reasoning

  • TruthfulQA - Truthfulness and factuality

  • ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

  • HumanEval - Python code generation (164 problems)

  • MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16
--tasks mmlu
--device cuda:0
--batch_size auto # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True
--tasks mmlu
--device cuda:0

Custom checkpoint:

lm_eval --model hf
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer
--tasks mmlu
--device cuda:0

Step 3: Run evaluation

Full MMLU evaluation (57 subjects)

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot evaluation (standard) --batch_size 8
--output_path results/
--log_samples # Save individual predictions

Multiple benchmarks at once

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json :

{ "results": { "mmlu": { "acc": 0.459, "acc_stderr": 0.004 }, "gsm8k": { "exact_match": 0.142, "exact_match_stderr": 0.006 }, "hellaswag": { "acc_norm": 0.765, "acc_norm_stderr": 0.004 } }, "config": { "model": "hf", "model_args": "pretrained=meta-llama/Llama-2-7b-hf", "num_fewshot": 5 } }

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:

  • Step 1: Set up periodic evaluation
  • Step 2: Choose quick benchmarks
  • Step 3: Automate evaluation
  • Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

#!/bin/bash

eval_checkpoint.sh

CHECKPOINT_DIR=$1 STEP=$2

lm_eval --model hf
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot for speed --batch_size 16
--output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

  • HellaSwag: ~10 minutes on 1 GPU

  • GSM8K: ~5 minutes

  • PIQA: ~2 minutes

Avoid for frequent eval (too slow):

  • MMLU: ~2 hours (57 subjects)

  • HumanEval: Requires code execution

Step 3: Automate evaluation

Integrate with training script:

In training loop

if step % eval_interval == 0: model.save_pretrained(f"checkpoints/step-{step}")

# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback): def on_validation_epoch_end(self, trainer, pl_module): step = trainer.global_step checkpoint_path = f"checkpoints/step-{step}"

    # Save checkpoint
    trainer.save_checkpoint(checkpoint_path)

    # Run lm-eval
    os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

import json import matplotlib.pyplot as plt

Load all results

steps = [] mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")): with open(file) as f: data = json.load(f) step = int(file.split("-")[1].split(".")[0]) steps.append(step) mmlu_scores.append(data["results"]["mmlu"]["acc"])

Plot

plt.plot(steps, mmlu_scores) plt.xlabel("Training Step") plt.ylabel("MMLU Accuracy") plt.title("Training Progress") plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:

  • Step 1: Define model list
  • Step 2: Run evaluations
  • Step 3: Generate comparison table

Step 1: Define model list

models.txt

meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf mistralai/Mistral-7B-v0.1 microsoft/phi-2

Step 2: Run evaluations

#!/bin/bash

eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do echo "Evaluating $model"

# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')

lm_eval --model hf \
  --model_args pretrained=$model,dtype=bfloat16 \
  --tasks $TASKS \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

import json import pandas as pd

models = [ "meta-llama-Llama-2-7b-hf", "meta-llama-Llama-2-13b-hf", "mistralai-Mistral-7B-v0.1", "microsoft-phi-2" ]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = [] for model in models: with open(f"results/{model}.json") as f: data = json.load(f) row = {"Model": model.replace("-", "/")} for task in tasks: # Get primary metric for each task metrics = data["results"][task] if "acc" in metrics: row[task.upper()] = f"{metrics['acc']:.3f}" elif "exact_match" in metrics: row[task.upper()] = f"{metrics['exact_match']:.3f}" results.append(row)

df = pd.DataFrame(results) print(df.to_markdown(index=False))

Output:

ModelMMLUGSM8KHELLASWAGTRUTHFULQA
meta-llama/Llama-2-7b0.4590.1420.7650.391
meta-llama/Llama-2-13b0.5490.2870.8010.430
mistralai/Mistral-7B0.6260.3950.8120.428
microsoft/phi-20.5600.6130.6820.447

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:

  • Step 1: Install vLLM
  • Step 2: Configure vLLM backend
  • Step 3: Run evaluation

Step 1: Install vLLM

pip install vllm

Step 2: Configure vLLM backend

lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8
--tasks mmlu
--batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

Standard HF: ~2 hours for MMLU on 7B model

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8

vLLM: ~15-20 minutes for MMLU on 7B model

lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

  • Benchmarking models for academic papers

  • Comparing model quality across standard tasks

  • Tracking training progress

  • Reporting standardized metrics (everyone uses same prompts)

  • Need reproducible evaluation

Use alternatives instead:

  • HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)

  • AlpacaEval: Instruction-following evaluation with LLM judges

  • MT-Bench: Conversational multi-turn evaluation

  • Custom scripts: Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

lm_eval --model vllm
--model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

--num_fewshot 0 # Instead of 5

Or evaluate subset of MMLU:

--tasks mmlu_stem # Only STEM subjects

Issue: Out of memory

Reduce batch size:

--batch_size 1 # Or --batch_size auto

Use quantization:

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

--num_fewshot 5 # Most papers use 5-shot

Check exact task name:

--tasks mmlu # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

pip install human-eval

Enable code execution:

lm_eval --model hf
--model_args pretrained=model-name
--tasks humaneval
--allow_code_execution # Required for HumanEval

Advanced topics

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)

  • VRAM:

  • 7B model: 16GB (bf16) or 8GB (8-bit)

  • 13B model: 28GB (bf16) or 14GB (8-bit)

  • 70B model: Requires multi-GPU or quantization

  • Time (7B model, single A100):

  • HellaSwag: 10 minutes

  • GSM8K: 5 minutes

  • MMLU (full): 2 hours

  • HumanEval: 20 minutes

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
1.2K-davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ui-ux-pro-max

No summary provided by upstream source.

Repository SourceNeeds Review