ais-bench

AISBench Benchmark - AI model evaluation tool for Ascend NPU. Supports accuracy evaluation (service/local models on text, multimodal datasets), performance evaluation (latency, throughput, stress testing, steady-state, real traffic simulation), vLLM/Triton inference services, 15+ benchmarks (MMLU, GSM8K, MMMU, docvqa, ocrbench_v2, etc.), multi-turn dialogue, Function Call (BFCL), and custom datasets.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ais-bench" with this command: npx skills add ascend-ai-coding/awesome-ascend-skills/ascend-ai-coding-awesome-ascend-skills-ais-bench

AISBench Benchmark Tool

AISBench Benchmark is a model evaluation tool built based on OpenCompass. It supports evaluation scenarios for both accuracy and performance testing of AI models on Ascend NPU.

Overview

  • Accuracy Evaluation: Accuracy verification of service-deployed models and local models on various QA and reasoning benchmark datasets, covering text, multimodal, and other scenarios.
  • Performance Evaluation: Latency and throughput evaluation of service-deployed models, extreme performance testing under stress test scenarios, steady-state performance evaluation, and real business traffic simulation.

Supported Scenarios

ScenarioDescription
Accuracy EvaluationModel accuracy on text/multimodal datasets
Performance EvaluationLatency, throughput, stress testing
Steady-State PerformanceObtain true optimal system performance
Real Traffic SimulationSimulate real business traffic patterns
Multi-turn DialogueEvaluate multi-turn conversation models
Function Call (BFCL)Function calling capability evaluation

Supported Benchmarks

  • Text: GSM8K, MMLU, Ceval, FewCLUE series, dapo_math, leval
  • Multimodal: docvqa, infovqa, ocrbench_v2, omnidocbench, mmmu, mmmu_pro, mmstar, videomme, textvqa, videobench, vocalsound
  • Multi-turn Dialogue: sharegpt, mtbench
  • Function Call: BFCL (Berkeley Function Calling Leaderboard)

Installation

Environment Requirements

Python Version: Only Python 3.10, 3.11, or 3.12 is supported.

# Create conda environment
conda create --name ais_bench python=3.10 -y
conda activate ais_bench

Install from Source

git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517

Verify installation:

ais_bench -h

Optional Dependencies

# For service-deployed model evaluation (vLLM, Triton, etc.)
pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt

# For Huggingface multimodal / vLLM offline inference
pip3 install -r requirements/hf_vl_dependency.txt

# For BFCL Function Calling evaluation
pip3 install -r requirements/datasets/bfcl_dependencies.txt --no-deps

Quick Start

Basic Command Structure

ais_bench --models <model_task> --datasets <dataset_task> [--summarizer example]
  • --models: Specifies the model task configuration
  • --datasets: Specifies the dataset task configuration
  • --summarizer: Result presentation task (default: example)

Find Configuration Files

# List all available task configurations
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search

Example: Service Model Accuracy Evaluation

  1. Start vLLM inference service (follow vLLM documentation)

  2. Prepare dataset:

    • Download GSM8K from opencompass
    • Extract to ais_bench/datasets/gsm8k/
  3. Modify model configuration (vllm_api_general_chat.py):

    from ais_bench.benchmark.models import VLLMCustomAPIChat
    
    models = [
        dict(
            attr="service",
            type=VLLMCustomAPIChat,
            abbr='vllm-api-general-chat',
            path="",
            model="",
            stream=False,
            request_rate=0,
            retry=2,
            api_key="",
            host_ip="localhost",
            host_port=8080,
            url="",
            max_out_len=512,
            batch_size=1,
            trust_remote_code=False,
            generation_kwargs=dict(
                temperature=0.01,
                ignore_eos=False,
            )
        )
    ]
    
  4. Run evaluation:

    ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt
    

Output Results

dataset                 version  metric   mode  vllm_api_general_chat
----------------------- -------- -------- ----- ----------------------
demo_gsm8k              401e4c   accuracy gen                   62.50

Model Task Types

Service-Deployed Models

Model TypeDescription
vllm_api_general_chatGeneral vLLM API chat model
vllm_api_function_call_chatFunction calling model (BFCL)
triton_api_*Triton inference service

Local Models

Model TypeDescription
hf_*HuggingFace models
vllm_offline_*vLLM offline inference

Performance Evaluation

Key Metrics

MetricDescription
TTFTTime to First Token
TPOTTime Per Output Token
ThroughputTokens per second
LatencyRequest latency (P50, P90, P99)

Performance Test Example

ais_bench --models vllm_api_general_chat --datasets custom_performance \
    --mode performance --concurrency 100

Steady-State Performance

For obtaining true optimal system performance:

ais_bench --models vllm_api_general_chat --datasets sharegpt \
    --stable-stage --duration 300

Real Traffic Simulation

ais_bench --models vllm_api_general_chat --datasets custom \
    --rps-distribution rps_config.json

Multi-task Evaluation

Multiple Models

ais_bench --models model1 model2 model3 --datasets dataset1

Multiple Datasets

ais_bench --models model1 --datasets dataset1 dataset2 dataset3

Parallel Execution

ais_bench --models model1 model2 --datasets dataset1 dataset2 --parallel 4

Custom Datasets

Performance Custom Dataset

Create a JSONL file with custom requests:

{"input": "Your prompt here", "max_output_length": 512}

Accuracy Custom Dataset

Refer to Custom Dataset Guide


Output Structure

outputs/default/20250628_151326/
├── configs/           # Combined configuration
├── logs/              # Execution logs
│   ├── eval/          # Evaluation logs
│   └── infer/         # Inference logs
├── predictions/       # Raw inference results
├── results/           # Calculated scores
└── summary/           # Final summaries
    ├── summary_*.csv
    ├── summary_*.md
    └── summary_*.txt

Task Management Interface

During execution, a real-time task management interface displays:

  • Task name and progress
  • Time cost and status
  • Log path
  • Extended parameters

Controls:

  • P key: Pause/Resume screen refresh
  • Ctrl+C: Exit

Common CLI Options

OptionDescription
--modelsModel task name(s)
--datasetsDataset task name(s)
--summarizerResult summarizer
--searchList config file paths
--debugPrint detailed logs
--modeEvaluation mode (accuracy/performance)
--parallelNumber of parallel tasks
--resumeResume from breakpoint
--failed-onlyRe-run failed cases only

Advanced Features

Breakpoint Resume

ais_bench --models model1 --datasets dataset1 --resume outputs/default/20250628_151326

Failed Case Re-run

ais_bench --models model1 --datasets dataset1 --failed-only --resume outputs/default/20250628_151326

Multi-file Dataset Merge

For datasets like MMLU with multiple files:

ais_bench --models model1 --datasets mmlu_merged

Repeated Inference for pass@k

ais_bench --models model1 --datasets dataset1 --repeat-n 5

Troubleshooting

Installation Issues

  1. Python version mismatch: Use Python 3.10/3.11/3.12
  2. Dependency conflicts: Use conda environment
  3. bfcl_eval pathlib issue: Use --no-deps flag

Runtime Issues

  1. Model connection failed: Check host_ip, host_port, and service status
  2. Dataset not found: Download dataset to ais_bench/datasets/
  3. Memory issues: Reduce batch_size or use smaller dataset

Helper Scripts

Quick utility scripts for common operations:

ScriptDescription
scripts/check_env.shVerify environment setup
scripts/run_accuracy_test.shQuick accuracy test runner
scripts/run_performance_test.shQuick performance test runner
scripts/parse_results.pyParse and summarize results
# Check environment
bash scripts/check_env.sh

# Quick accuracy test
bash scripts/run_accuracy_test.sh vllm_api_general_chat demo_gsm8k --host-port 8080

# Quick performance test
bash scripts/run_performance_test.sh vllm_api_general_chat sharegpt --concurrency 100

# Parse results
python scripts/parse_results.py outputs/default/20250628_151326

References

Detailed documentation for specific use cases:


Templates

Ready-to-use templates for custom evaluation:

TemplateDescription
assets/model_config_template.pyModel configuration template
assets/custom_qa_template.jsonlQA dataset template
assets/custom_mcq_template.csvMultiple choice dataset template
assets/custom_meta_template.jsonDataset metadata template

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

npu-smi

No summary provided by upstream source.

Repository SourceNeeds Review
General

atc-model-converter

No summary provided by upstream source.

Repository SourceNeeds Review
General

hccl-test

No summary provided by upstream source.

Repository SourceNeeds Review
General

ascend-docker

No summary provided by upstream source.

Repository SourceNeeds Review