evaluating-code-models

BigCode Evaluation Harness - Code Model Benchmarking

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "evaluating-code-models" with this command: npx skills add orchestra-research/ai-research-skills/orchestra-research-ai-research-skills-evaluating-code-models

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).

Installation:

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git cd bigcode-evaluation-harness pip install -e . accelerate config

Evaluate on HumanEval:

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--n_samples 20
--batch_size 10
--allow_code_execution
--save_generations

View available tasks:

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).

Checklist:

Code Benchmark Evaluation:

  • Step 1: Choose benchmark suite
  • Step 2: Configure model and generation
  • Step 3: Run evaluation with code execution
  • Step 4: Analyze pass@k results

Step 1: Choose benchmark suite

Python code generation (most common):

  • HumanEval: 164 handwritten problems, function completion

  • HumanEval+: Same 164 problems with 80× more tests (stricter)

  • MBPP: 500 crowd-sourced problems, entry-level difficulty

  • MBPP+: 399 curated problems with 35× more tests

Multi-language (18 languages):

  • MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.

Advanced:

  • APPS: 10,000 problems (introductory/interview/competition)

  • DS-1000: 1,000 data science problems across 7 libraries

Step 2: Configure model and generation

Standard HuggingFace model

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--do_sample True
--n_samples 200
--batch_size 50
--allow_code_execution

Quantized model (4-bit)

accelerate launch main.py
--model codellama/CodeLlama-34b-hf
--tasks humaneval
--load_in_4bit
--max_length_generation 512
--allow_code_execution

Custom/private model

accelerate launch main.py
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution

Step 3: Run evaluation

Full evaluation with pass@k estimation (k=1,10,100)

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--temperature 0.8
--n_samples 200
--batch_size 50
--allow_code_execution
--save_generations
--metric_output_path results/starcoder2-humaneval.json

Step 4: Analyze results

Results in results/starcoder2-humaneval.json :

{ "humaneval": { "pass@1": 0.354, "pass@10": 0.521, "pass@100": 0.689 }, "config": { "model": "bigcode/starcoder2-7b", "temperature": 0.8, "n_samples": 200 } }

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Evaluate code generation across 18 programming languages.

Checklist:

Multi-Language Evaluation:

  • Step 1: Generate solutions (host machine)
  • Step 2: Run evaluation in Docker (safe execution)
  • Step 3: Compare across languages

Step 1: Generate solutions on host

Generate without execution (safe)

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--max_length_generation 650
--temperature 0.8
--n_samples 50
--batch_size 50
--generation_only
--save_generations
--save_generations_path generations_multi.json

Step 2: Evaluate in Docker container

Pull the MultiPL-E Docker image

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Run evaluation inside container

docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro
-it evaluation-harness-multiple python3 main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--load_generations_path /app/generations.json
--allow_code_execution
--n_samples 50

Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Workflow 3: Instruction-Tuned Model Evaluation

Evaluate chat/instruction models with proper formatting.

Checklist:

Instruction Model Evaluation:

  • Step 1: Use instruction-tuned tasks
  • Step 2: Configure instruction tokens
  • Step 3: Run evaluation

Step 1: Choose instruction tasks

  • instruct-humaneval: HumanEval with instruction prompts

  • humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks

Step 2: Configure instruction tokens

For models with chat templates (e.g., CodeLlama-Instruct)

accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution

Step 3: HumanEvalPack for instruction models

Test code synthesis across 6 languages

accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution

Workflow 4: Compare Multiple Models

Benchmark suite for model comparison.

Step 1: Create evaluation script

#!/bin/bash

eval_models.sh

MODELS=( "bigcode/starcoder2-7b" "codellama/CodeLlama-7b-hf" "deepseek-ai/deepseek-coder-6.7b-base" ) TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do model_name=$(echo $model | tr '/' '-') echo "Evaluating $model"

accelerate launch main.py
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done

Step 2: Generate comparison table

import json import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"] results = []

for model in models: with open(f"results/{model}.json") as f: data = json.load(f) results.append({ "Model": model, "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}", "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}" })

df = pd.DataFrame(results) print(df.to_markdown(index=False))

When to Use vs Alternatives

Use BigCode Evaluation Harness when:

  • Evaluating code generation models specifically

  • Need multi-language evaluation (18 languages via MultiPL-E)

  • Testing functional correctness with unit tests (pass@k)

  • Benchmarking for BigCode/HuggingFace leaderboards

  • Evaluating fill-in-the-middle (FIM) capabilities

Use alternatives instead:

  • lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)

  • EvalPlus: Stricter HumanEval+/MBPP+ with more test cases

  • SWE-bench: Real-world GitHub issue resolution

  • LiveCodeBench: Contamination-free, continuously updated problems

  • CodeXGLUE: Code understanding tasks (clone detection, defect prediction)

Supported Benchmarks

Benchmark Problems Languages Metric Use Case

HumanEval 164 Python pass@k Standard code completion

HumanEval+ 164 Python pass@k Stricter evaluation (80× tests)

MBPP 500 Python pass@k Entry-level problems

MBPP+ 399 Python pass@k Stricter evaluation (35× tests)

MultiPL-E 164×18 18 languages pass@k Multi-language evaluation

APPS 10,000 Python pass@k Competition-level

DS-1000 1,000 Python pass@k Data science (pandas, numpy, etc.)

HumanEvalPack 164×3×6 6 languages pass@k Synthesis/fix/explain

Mercury 1,889 Python Efficiency Computational efficiency

Common Issues

Issue: Different results than reported in papers

Check these factors:

1. Verify n_samples (need 200 for accurate pass@k)

--n_samples 200

2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)

--temperature 0.8

3. Verify task name matches exactly

--tasks humaneval # Not "human_eval" or "HumanEval"

4. Check max_length_generation

--max_length_generation 512 # Increase for longer problems

Issue: CUDA out of memory

Use quantization

--load_in_8bit

OR

--load_in_4bit

Reduce batch size

--batch_size 1

Set memory limit

--max_memory_per_gpu "20GiB"

Issue: Code execution hangs or times out

Use Docker for safe execution:

Generate on host (no execution)

--generation_only --save_generations

Evaluate in Docker

docker run ... --allow_code_execution --load_generations_path ...

Issue: Low scores on instruction models

Ensure proper instruction formatting:

Use instruction-specific tasks

--tasks instruct-humaneval

Set instruction tokens for your model

--instruction_tokens "<s>[INST],</s>,[/INST]"

Issue: MultiPL-E language failures

Use the dedicated Docker image:

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

Argument Default Description

--model

HuggingFace model ID or local path

--tasks

Comma-separated task names

--n_samples

1 Samples per problem (200 for pass@k)

--temperature

0.2 Sampling temperature

--max_length_generation

512 Max tokens (prompt + generation)

--batch_size

1 Batch size per GPU

--allow_code_execution

False Enable code execution (required)

--generation_only

False Generate without evaluation

--load_generations_path

Load pre-generated solutions

--save_generations

False Save generated code

--metric_output_path

results.json Output file for metrics

--load_in_8bit

False 8-bit quantization

--load_in_4bit

False 4-bit quantization

--trust_remote_code

False Allow custom model code

--precision

fp32 Model precision (fp32/fp16/bf16)

Hardware Requirements

Model Size VRAM (fp16) VRAM (4-bit) Time (HumanEval, n=200)

7B 14GB 6GB ~30 min (A100)

13B 26GB 10GB ~1 hour (A100)

34B 68GB 20GB ~2 hours (A100)

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

sparse-autoencoder-training

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

clip

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

evaluating-code-models

No summary provided by upstream source.

Repository SourceNeeds Review
Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review