AISBench Benchmark Tool

AISBench Benchmark is a model evaluation tool built based on OpenCompass. It supports evaluation scenarios for both accuracy and performance testing of AI models on Ascend NPU.

Overview

Accuracy Evaluation: Accuracy verification of service-deployed models and local models on various QA and reasoning benchmark datasets, covering text, multimodal, and other scenarios.
Performance Evaluation: Latency and throughput evaluation of service-deployed models, extreme performance testing under stress test scenarios, steady-state performance evaluation, and real business traffic simulation.

Supported Scenarios

Scenario	Description
Accuracy Evaluation	Model accuracy on text/multimodal datasets
Performance Evaluation	Latency, throughput, stress testing
Steady-State Performance	Obtain true optimal system performance
Real Traffic Simulation	Simulate real business traffic patterns
Multi-turn Dialogue	Evaluate multi-turn conversation models
Function Call (BFCL)	Function calling capability evaluation

Supported Benchmarks

Text: GSM8K, MMLU, Ceval, FewCLUE series, dapo_math, leval
Multimodal: docvqa, infovqa, ocrbench_v2, omnidocbench, mmmu, mmmu_pro, mmstar, videomme, textvqa, videobench, vocalsound
Multi-turn Dialogue: sharegpt, mtbench
Function Call: BFCL (Berkeley Function Calling Leaderboard)

Installation

Environment Requirements

Python Version: Only Python 3.10, 3.11, or 3.12 is supported.

# Create conda environment
conda create --name ais_bench python=3.10 -y
conda activate ais_bench

Install from Source

git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517

Verify installation:

ais_bench -h

Optional Dependencies

# For service-deployed model evaluation (vLLM, Triton, etc.)
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt

# For Huggingface multimodal / vLLM offline inference
pip3 install -r requirements/hf_vl_dependency.txt

# For BFCL Function Calling evaluation
pip3 install -r requirements/datasets/bfcl_dependencies.txt --no-deps

Quick Start

Basic Command Structure

ais_bench --models <model_task> --datasets <dataset_task> [--summarizer example]

--models: Specifies the model task configuration
--datasets: Specifies the dataset task configuration
--summarizer: Result presentation task (default: example)

Find Configuration Files

# List all available task configurations
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search

Example: Service Model Accuracy Evaluation

Start vLLM inference service (follow vLLM documentation)
Prepare dataset:
- Download GSM8K from opencompass
- Extract to ais_bench/datasets/gsm8k/

Modify model configuration (vllm_api_general_chat.py):

from ais_bench.benchmark.models import VLLMCustomAPIChat

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr='vllm-api-general-chat',
        path="",
        model="",
        stream=False,
        request_rate=0,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8080,
        url="",
        max_out_len=512,
        batch_size=1,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            ignore_eos=False,
        )
    )
]

Run evaluation:

ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt

Output Results

dataset                 version  metric   mode  vllm_api_general_chat
----------------------- -------- -------- ----- ----------------------
demo_gsm8k              401e4c   accuracy gen                   62.50

Model Task Types

Service-Deployed Models

Model Type	Description
`vllm_api_general_chat`	General vLLM API chat model
`vllm_api_function_call_chat`	Function calling model (BFCL)
`triton_api_*`	Triton inference service

Local Models

Model Type	Description
`hf_*`	HuggingFace models
`vllm_offline_*`	vLLM offline inference

Performance Evaluation

Key Metrics

Metric	Description
TTFT	Time to First Token
TPOT	Time Per Output Token
Throughput	Tokens per second
Latency	Request latency (P50, P90, P99)

Performance Test Example

ais_bench --models vllm_api_general_chat --datasets custom_performance \
    --mode performance --concurrency 100

Steady-State Performance

For obtaining true optimal system performance:

ais_bench --models vllm_api_general_chat --datasets sharegpt \
    --stable-stage --duration 300

Real Traffic Simulation

ais_bench --models vllm_api_general_chat --datasets custom \
    --rps-distribution rps_config.json

Multi-task Evaluation

Multiple Models

ais_bench --models model1 model2 model3 --datasets dataset1

Multiple Datasets

ais_bench --models model1 --datasets dataset1 dataset2 dataset3

Parallel Execution

ais_bench --models model1 model2 --datasets dataset1 dataset2 --parallel 4

Custom Datasets

Performance Custom Dataset

Create a JSONL file with custom requests:

{"input": "Your prompt here", "max_output_length": 512}

Accuracy Custom Dataset

Refer to Custom Dataset Guide

Output Structure

outputs/default/20250628_151326/
├── configs/           # Combined configuration
├── logs/              # Execution logs
│   ├── eval/          # Evaluation logs
│   └── infer/         # Inference logs
├── predictions/       # Raw inference results
├── results/           # Calculated scores
└── summary/           # Final summaries
    ├── summary_*.csv
    ├── summary_*.md
    └── summary_*.txt

Task Management Interface

During execution, a real-time task management interface displays:

Task name and progress
Time cost and status
Log path
Extended parameters

Controls:

P key: Pause/Resume screen refresh
Ctrl+C: Exit

Common CLI Options

Option	Description
`--models`	Model task name(s)
`--datasets`	Dataset task name(s)
`--summarizer`	Result summarizer
`--search`	List config file paths
`--debug`	Print detailed logs
`--mode`	Evaluation mode (accuracy/performance)
`--parallel`	Number of parallel tasks
`--resume`	Resume from breakpoint
`--failed-only`	Re-run failed cases only

Advanced Features

Breakpoint Resume

ais_bench --models model1 --datasets dataset1 --resume outputs/default/20250628_151326

Failed Case Re-run

ais_bench --models model1 --datasets dataset1 --failed-only --resume outputs/default/20250628_151326

Multi-file Dataset Merge

For datasets like MMLU with multiple files:

ais_bench --models model1 --datasets mmlu_merged

Repeated Inference for pass@k

ais_bench --models model1 --datasets dataset1 --repeat-n 5

Troubleshooting

Installation Issues

Python version mismatch: Use Python 3.10/3.11/3.12
Dependency conflicts: Use conda environment
bfcl_eval pathlib issue: Use --no-deps flag

Runtime Issues

Model connection failed: Check host_ip, host_port, and service status
Dataset not found: Download dataset to ais_bench/datasets/
Memory issues: Reduce batch_size or use smaller dataset

Helper Scripts

Quick utility scripts for common operations:

Script	Description
scripts/check_env.sh	Verify environment setup
scripts/run_accuracy_test.sh	Quick accuracy test runner
scripts/run_performance_test.sh	Quick performance test runner
scripts/parse_results.py	Parse and summarize results

# Check environment
bash scripts/check_env.sh

# Quick accuracy test
bash scripts/run_accuracy_test.sh vllm_api_general_chat demo_gsm8k --host-port 8080

# Quick performance test
bash scripts/run_performance_test.sh vllm_api_general_chat sharegpt --concurrency 100

# Parse results
python scripts/parse_results.py outputs/default/20250628_151326

References

Detailed documentation for specific use cases:

Model Configuration Reference: All model types (vLLM, MindIE, Triton, TGI, offline) with parameter explanations
CLI Reference: Complete CLI options for accuracy and performance evaluation

Templates

Ready-to-use templates for custom evaluation:

Template	Description
assets/model_config_template.py	Model configuration template
assets/custom_qa_template.jsonl	QA dataset template
assets/custom_mcq_template.csv	Multiple choice dataset template
assets/custom_meta_template.json	Dataset metadata template