aiter-ck-gemm-tune

Tune AITER's CK GEMM and fused MoE kernels for specific model shapes on AMD GPUs. Covers shape discovery from inference logs, baseline benchmarking, kernel tuning, and before/after performance comparison.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "aiter-ck-gemm-tune" with this command: npx skills add alexsun07/aiter-ck-gemm-tune

AITER CK GEMM & MoE Tune

A skill for tuning AITER's Composable Kernel (CK) GEMM and fused MoE kernels to achieve better performance for specific model shapes. The tuning workflow is a multi-step process: discover the environment, capture shapes, run baseline benchmarks, tune kernels, and compare results. The workflow supports both regular GEMM variants (a8w8, bf16, etc.) and the moe_2stages variant for fused MoE kernels used in Mixture-of-Experts models.

Background

AITER (AI Tensor Engine for ROCm) is AMD's high-performance operator library for LLM inference on ROCm/AMD GPUs. It provides optimized kernels for common operations in transformer models — most critically, GEMM (General Matrix Multiply), which dominates the compute in LLM inference (linear projections, attention, MLP/FFN layers, MoE expert computations).

Composable Kernel (CK) is AMD's open-source library of GPU kernel primitives. CK provides templated, composable building blocks for writing high-performance GPU kernels. AITER uses CK to implement its GEMM kernels, with many kernel variants optimized for different quantization schemes (INT8, FP4, BF16) and memory layouts (blockscale, byte-pair reshuffle, batched, MoE).

Why tuning matters: Each CK GEMM kernel has many implementation variants (tile sizes, pipeline configurations, split-K strategies). The optimal variant depends on the specific GEMM shape (M, N, K) and the GPU hardware (number of compute units). AITER's tuning process benchmarks all candidate kernel configurations for each shape and selects the fastest one. Shapes come from specific model architectures — for example, a Llama 70B model produces different (N, K) pairs than a DeepSeek V3 model. The M dimension corresponds to the batch/token count and varies at runtime, so tuning sweeps M as powers of 2 to cover all realistic batch sizes.

How it fits into the inference stack: Inference frameworks like sglang and vllm call into AITER for their GEMM operations. When AITER encounters a shape that hasn't been tuned, it falls back to a default kernel configuration and logs a warning. The tuning workflow in this skill captures those untuned shapes and finds optimal kernel configurations for them.

Supported Kernel Variants

Each variant follows the same tuning workflow pattern. The table below maps each variant to its key files (all paths relative to the aiter root):

VariantTune ScriptUntuned CSVTuned CSVTest FileREADME
a8w8csrc/ck_gemm_a8w8/gemm_a8w8_tune.pyaiter/configs/a8w8_untuned_gemm.csvaiter/configs/a8w8_tuned_gemm.csvop_tests/test_gemm_a8w8.pycsrc/ck_gemm_a8w8/README.md
a8w8_blockscalecsrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale_tune.pyaiter/configs/a8w8_blockscale_untuned_gemm.csvaiter/configs/a8w8_blockscale_tuned_gemm.csvop_tests/test_gemm_a8w8_blockscale.pycsrc/ck_gemm_a8w8_blockscale/README.md
a8w8_bpreshufflecsrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_tune.pyaiter/configs/a8w8_bpreshuffle_untuned_gemm.csvaiter/configs/a8w8_bpreshuffle_tuned_gemm.csv(none)csrc/ck_gemm_a8w8_bpreshuffle/README.md
a8w8_blockscale_bpreshufflecsrc/ck_gemm_a8w8_blockscale_bpreshuffle/gemm_a8w8_blockscale_bpreshuffle_tune.pyaiter/configs/a8w8_blockscale_bpreshuffle_untuned_gemm.csvaiter/configs/a8w8_blockscale_bpreshuffle_tuned_gemm.csv(none)csrc/ck_gemm_a8w8_blockscale_bpreshuffle/README.md
a4w4_blockscalecsrc/ck_gemm_a4w4_blockscale/gemm_a4w4_blockscale_tune.pyaiter/configs/a4w4_blockscale_untuned_gemm.csvaiter/configs/a4w4_blockscale_tuned_gemm.csvop_tests/test_gemm_a4w4.pycsrc/ck_gemm_a4w4_blockscale/README.md
batched_a8w8csrc/ck_batched_gemm_a8w8/batched_gemm_a8w8_tune.pyaiter/configs/a8w8_untuned_batched_gemm.csvaiter/configs/a8w8_tuned_batched_gemm.csvop_tests/test_batched_gemm_a8w8.pycsrc/ck_batched_gemm_a8w8/README.md
batched_bf16csrc/ck_batched_gemm_bf16/batched_gemm_bf16_tune.pyaiter/configs/bf16_untuned_batched_gemm.csvaiter/configs/bf16_tuned_batched_gemm.csvop_tests/test_batched_gemm_bf16.pycsrc/ck_batched_gemm_bf16/README.md
moe_2stagescsrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.pyaiter/configs/untuned_fmoe.csvaiter/configs/tuned_fmoe.csvop_tests/test_moe_2stage.pycsrc/ck_gemm_moe_2stages_codegen/README.md

Log Files

The skill records outputs from Steps 2, 3, and 4 to log files under $AITER_PATH/tune_logs/. Use this naming convention:

$AITER_PATH/tune_logs/<variant>_bench_before_<YYYYMMDD_HHMMSS>.log  # Step 2: baseline benchmark
$AITER_PATH/tune_logs/<variant>_tuning_<YYYYMMDD_HHMMSS>.log        # Step 3: tuning process
$AITER_PATH/tune_logs/<variant>_bench_after_<YYYYMMDD_HHMMSS>.log   # Step 4: post-tune benchmark

For example:

tune_logs/a8w8_blockscale_bench_before_20260321_143022.log
tune_logs/a8w8_blockscale_tuning_20260321_150000.log
tune_logs/a8w8_blockscale_bench_after_20260321_160515.log

Create the tune_logs/ directory if it doesn't exist. For interactive commands (Steps 2 and 4), use 2>&1 | tee <log> to show output in real time while logging. For long-running background jobs (Step 3), redirect output to file directly (> <log> 2>&1).

Workflow

Follow these steps in order. At each step, communicate clearly with the user about what is happening, what you found, and what you plan to do next.


Step 0: Environment Discovery

Before anything else, establish the working environment. Tuning typically runs inside a Docker container on a remote node with AMD GPUs. Ask the user to provide access details upfront:

  1. Target environment access: Ask the user how to reach the tuning environment:
    • Node access: How to SSH into the node (e.g., ssh user@node-hostname)
    • Docker container: The container name or ID to exec into (e.g., docker exec -it <container_name> bash)
    • If the user is already inside the target environment (local or already SSH'd in), that's fine too — just confirm.
    • All subsequent commands (Steps 1–4) should be run inside this environment.
  2. Locate aiter: The pip package may be named aiter or amd-aiter, so use pip list | grep -i aiter to find the exact package name, then pip show <package_name> | grep Location to get its installed path. Do not guess common locations — there may be multiple aiter copies on the system, and only the one registered in pip is the active installation. Verify by checking that csrc/ and aiter/configs/ exist under that path.
  3. Log location: Ask the user where the inference logs are. These could be from sglang, vllm, or another framework. Logs could also be provided directly. Logs may be on the node, inside the container, or on the user's local machine.
  4. Verify aiter installation: Check if aiter is installed in dev mode. If not, warn the user that python3 setup.py develop from the aiter root may be needed.

Step 1: Capture Shapes & Identify Kernel Type

The goal is to extract the shapes that need tuning and determine which kernel variant to tune.

Option A: Parse from aiter logs (preferred)

AITER logs untuned shapes in two different patterns depending on the kernel type. The bundled script scripts/parse_untuned_shapes.py auto-detects both patterns in a single pass.

Regular GEMM pattern:

shape is M:<value>, N:<value>, K:<value> ... not found tuned config in /tmp/aiter_configs/<variant>_tuned_gemm.csv, will use default config!

Fused MoE pattern (the moe_2stages variant):

[fused_moe] using 1stage default for (cu_num, token, model_dim, inter_dim, expert, topk, 'ActivationType.X', 'torch.dtype', 'torch.dtype', 'torch.dtype', 'QuantType.X', use_g1u1, doweight_stage1)

The key word in the MoE pattern is "default" — it means no tuned config was found and the kernel falls back to heuristics. When a tuned config IS found, the log shows kernel names instead of "default".

Step 1a: Run the parser to see what's in the log:

python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file>

This prints all variants found. For regular GEMM, it shows unique (N, K) pairs. For moe_2stages, it shows unique MoE configs (model_dim, inter_dim, expert, topk, quant type, etc.) and the token counts seen in the log.

Step 1b: If multiple variants are found, ask the user which to tune. Each variant must be tuned separately (different tune scripts, CSVs, and test files). GEMM and MoE cannot be combined in one CSV — they have entirely different formats.

Step 1c: Generate the untuned CSV for the chosen variant(s):

# Regular GEMM variant with M sweep:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant a8w8_blockscale --csv <output.csv> --m-sweep

# Fused MoE — use actual token values from the log:
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant moe_2stages --csv <output.csv>

# Fused MoE — sweep token as powers of 2 (more thorough):
python3 <skill_path>/scripts/parse_untuned_shapes.py <log_file> --variant moe_2stages --csv <output.csv> --token-sweep

Present the results to the user for confirmation before proceeding. If tuning multiple variants, repeat Steps 2–4 for each variant separately.

Option B: Direct user input

The user provides shapes and specifies the kernel variant directly.

Generating sweep values for tuning

Regular GEMM: For each unique (N, K) pair, generate tuning rows by sweeping M as powers of 2:

M = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768

This produces 16 × number_of_unique_NK_pairs rows for the untuned CSV.

Fused MoE: For each unique MoE config, either use the actual token values from the log (realistic) or sweep token as powers of 2 with --token-sweep (more thorough). There is no separate M dimension — the token count IS the batch dimension.

Note: The sweep for tuning (powers of 2) is separate from the values used for benchmarking in Steps 2/4. Benchmarking typically uses the test script's default list, which may include non-power-of-2 values. This is normal — we tune with powers of 2 to cover the key points.

Write the untuned CSV

The CSV format depends on the variant type:

Regular GEMM (e.g., a8w8_blockscale):

M,N,K
1,12288,4096
2,12288,4096
...
32768,12288,4096

Fused MoE (moe_2stages):

token,model_dim,inter_dim,expert,topk,act_type,dtype,q_dtype_a,q_dtype_w,q_type,use_g1u1,doweight_stage1
1,4096,512,512,10,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_1x128,1,0
4,4096,512,512,10,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_1x128,1,0
...

Write the CSV into the variant's untuned CSV file path (see the variant table above). Present the full shape list to the user before writing.


Step 2: Baseline Benchmark

Run the unit test for the target kernel variant with the specific shapes from Step 1 to establish baseline performance before tuning. No rebuild is needed at this point.

Pre-benchmark checklist: ck_preshuffle

Some test scripts have a --ck_preshuffle or --preshuffle flag (currently only a8w8_blockscale and moe_2stages). The correct setting can be inferred from the kernel variant detected in Step 1:

  • If the log shows a8w8_blockscale_tuned_gemm.csv (no "bpreshuffle" in the name) → use --ck_preshuffle False
  • If the log shows a8w8_blockscale_bpreshuffle_tuned_gemm.csv → use --ck_preshuffle True

Mention the inferred setting in your response to the user for confirmation, but no need to ask them to specify — the log already tells you.

Other variants (e.g., a8w8, a4w4_blockscale, batched variants) do not have this flag — skip this for them.

Handling test script choices constraints

Test scripts may have argparse choices restrictions on -m and/or -nk that reject values not in their hardcoded lists. Before running, read the argparse section at the bottom of the test file to check for choices constraints. If the shapes or M values you need are not in the choices list, you must modify the test script:

  • For -m: add missing M values (e.g., 16384, 32768) to both the choices and default lists.
  • For -nk: remove the choices parameter entirely (keep default) so any (N,K) pair can be passed.

CLI argument formats by variant

VariantTest FileShape ArgsExample
a8w8_blockscaletest_gemm_a8w8_blockscale.py-m M1 M2 ... -nk N1,K1 N2,K2 ...-m 1 2 4 ... 32768 -nk 12288,4096 24576,1536
a8w8test_gemm_a8w8.py-mnk M1,N1,K1 M2,N2,K2 ...-mnk 1,12288,4096 2,12288,4096 4,12288,4096
a4w4_blockscaletest_gemm_a4w4.py-mnk M1,N1,K1 M2,N2,K2 ...-mnk 1,12288,4096 2,12288,4096 4,12288,4096
batched_a8w8test_batched_gemm_a8w8.py-s M1,N1,K1 M2,N2,K2 ...-s 1,12288,4096 2,12288,4096 4,12288,4096
batched_bf16test_batched_gemm_bf16.py-s M1,N1,K1 M2,N2,K2 ...-s 1,12288,4096 2,12288,4096 4,12288,4096
moe_2stagestest_moe_2stage.py-t T1 T2 ... -dim M,I -e E -k K -q Q -a ACT -s DW -p PSSee MoE example below

For regular GEMM variants that use -mnk or -s (combined M,N,K tuples), generate all combinations of the M sweep with each (N,K) pair. For a8w8_blockscale which takes -m and -nk separately, pass all M values once and all (N,K) pairs once — the test script handles the cross product internally.

Example for a8w8_blockscale with (N,K) pairs (12288,4096) and (24576,1536):

cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_gemm_a8w8_blockscale.py \
  -m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
  -nk 12288,4096 24576,1536 \
  --ck_preshuffle False \
  2>&1 | tee tune_logs/a8w8_blockscale_bench_before_$(date +%Y%m%d_%H%M%S).log

Example for a8w8 with (N,K) pair (12288,4096):

cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_gemm_a8w8.py \
  -mnk 1,12288,4096 2,12288,4096 4,12288,4096 8,12288,4096 \
  16,12288,4096 32,12288,4096 64,12288,4096 128,12288,4096 \
  256,12288,4096 512,12288,4096 1024,12288,4096 2048,12288,4096 \
  4096,12288,4096 8192,12288,4096 16384,12288,4096 32768,12288,4096 \
  2>&1 | tee tune_logs/a8w8_bench_before_$(date +%Y%m%d_%H%M%S).log

MoE-specific benchmark (moe_2stages)

The test_moe_2stage.py script has a completely different CLI from the regular GEMM tests. Map the MoE config fields from Step 1 to CLI args:

MoE Config FieldCLI ArgNotes
token-t T1 T2 ...Space-separated list of token counts
model_dim, inter_dim-dim M,IComma-separated pair
expert-e ENumber of experts
topk-k KTop-K experts
act_type-a silu or -a geluActivation function
q_type-q NQuant index (see mapping below)
doweight_stage1-s f or -s tf=False, t=True
preshuffle-p f or -p tf=False, t=True

Quant index (-q) mapping — the -q value maps to (QuantType, q_dtype_a, q_dtype_w):

-qQuantTypeq_dtype_aq_dtype_wCommon Name
0NoNoneNonea16w16 (no quant)
1per_Tensorfp8fp8a8w8 per-tensor
2per_Tokenfp8fp8a8w8 per-token
3per_Tokenfp8int4a8w4
4per_1x32fp4x2fp4x2a4w4
5per_128x128fp8fp8a8w8 blockscale
6per_1x32bf16fp4x2a16w4
7per_1x32fp8fp4x2a8w4

To determine the correct -q value, match the q_type, q_dtype_a, and q_dtype_w from the log against this table. For example, QuantType.per_1x128 with fp8/fp8 maps to -q 5.

Note: QuantType.per_1x128 in the log corresponds to -q 5 (per_128x128 in the test). The name difference (per_1x128 vs per_128x128) is a known inconsistency between the log and the test script — they refer to the same blockscale FP8 quantization.

Example for MoE with Qwen3.5 shapes (fp8 blockscale, 512 experts, topk=10):

cd $AITER_PATH
mkdir -p tune_logs
python3 op_tests/test_moe_2stage.py \
  -t 1 4 8 32 64 128 1024 2048 16384 \
  -dim 4096,512 -e 512 -k 10 -q 5 -a silu -s f -p f \
  2>&1 | tee tune_logs/moe_2stages_bench_before_$(date +%Y%m%d_%H%M%S).log

MoE bypass caveat

For moe_2stages with blockscale FP8 (QuantType.per_1x128), there is a bypass in fused_moe.py that skips tuned configs when token * topk <= 128. This means:

  • For topk=10 (e.g., Qwen3.5): tokens 1–12 always use default heuristics regardless of tuning
  • For topk=2 (e.g., DeepSeek): tokens 1–64 always use default heuristics

Tuning these small token counts still produces valid configs, but they won't be used at inference time for this specific quant type. This is by design — the assembly kernel heuristics perform well enough at very small batch sizes. Focus benchmark attention on token counts above the bypass threshold.

Record the baseline log file path — you will need it in Step 4 for comparison.

If the variant has no test file (e.g., a8w8_bpreshuffle), inform the user and ask how they'd like to benchmark.


Step 3: Tune

Check available GPUs

Before tuning, run rocm-smi to check how many GPUs are free. Use --mp <num_free_gpus> to parallelize tuning across all available GPUs — this can dramatically reduce tuning time (e.g., 8x faster with 8 GPUs vs 1).

rocm-smi --showuse | grep "GPU use"

Run the tuning script

The general command pattern is:

cd $AITER_PATH
python3 <tune_script> -i <untuned_csv> -o <tuned_csv> [options]

Tuning is a long-running job (potentially hours). Run it in the background with output redirected to a log file. Use nohup to ensure the process survives if the SSH session disconnects:

Example for a8w8_blockscale with 8 free GPUs:

cd $AITER_PATH
mkdir -p tune_logs
nohup python3 csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale_tune.py \
  -i aiter/configs/a8w8_blockscale_untuned_gemm.csv \
  -o aiter/configs/a8w8_blockscale_tuned_gemm.csv \
  --libtype both --mp 8 --timeout 600 \
  > tune_logs/a8w8_blockscale_tuning_$(date +%Y%m%d_%H%M%S).log 2>&1 &

Example for moe_2stages with 8 free GPUs:

cd $AITER_PATH
mkdir -p tune_logs
nohup python3 csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py \
  -i aiter/configs/untuned_fmoe.csv \
  -o aiter/configs/tuned_fmoe.csv \
  --mp 8 --timeout 120 \
  > tune_logs/moe_2stages_tuning_$(date +%Y%m%d_%H%M%S).log 2>&1 &

Note: the MoE tuner does not have --libtype. Use --timeout 120 (shorter than GEMM since MoE shapes tune faster).

After launching, verify the process is running and monitor progress:

# Verify tuning process started (do NOT rely on $! — it doesn't work reliably through docker exec layers)
ps aux | grep tune.py | grep -v grep

# Monitor progress by tailing the log file
tail -f tune_logs/a8w8_blockscale_tuning_*.log

Key flags to consider

FlagDefaultDescription
--libtypeck, cktile, or both (recommend both for best results)
--mp Nall GPUsNumber of parallel GPU processes — set to number of free GPUs
--batch N100Shapes per tuning batch
--errRatio0.05Error tolerance threshold
-k / --splitKoffEnable split-K optimization
--warmup N5Warmup iterations before profiling
--iters N101Profiling iterations
--timeout NnoneTimeout in seconds per task group (recommend 600)
-voffVerbose output
--alloffRetune all shapes

Important warnings to communicate to the user:

  • Tuning can take a very long time (potentially hours) depending on the number of shapes and options
  • Using --libtype both is slower but produces better results
  • Use --mp with all available GPUs to maximize parallelism
  • --timeout is recommended to prevent individual shapes from hanging
  • The first run includes a JIT compilation step that can take several minutes before actual tuning begins

Step 4: Rerun & Compare

After tuning completes, rerun the benchmark to measure improvement. Reuse the exact same command from Step 2 with these changes:

For regular GEMM variants:

  1. Prepend AITER_REBUILD=1 to force aiter to rebuild kernels using the newly tuned CSV
  2. Change the log filename from bench_before to bench_after

For moe_2stages:

  1. Prepend AITER_REBUILD=1 (same as GEMM)
  2. Optionally set AITER_CONFIG_FMOE=<path_to_tuned_csv> if the tuned CSV is in a non-default location
  3. Change the log filename from bench_before to bench_after

This ensures the same shapes and flags are used for an apples-to-apples comparison. Do not re-type the command manually — copy the Step 2 command and apply the changes above.

Example — GEMM, if Step 2 command was:

python3 op_tests/test_gemm_a8w8_blockscale.py \
  -m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
  -nk 512,4096 4096,256 8192,4096 12288,4096 17408,4096 \
  --ck_preshuffle False \
  2>&1 | tee tune_logs/a8w8_blockscale_bench_before_20260321_143022.log

Then Step 4 command is:

AITER_REBUILD=1 python3 op_tests/test_gemm_a8w8_blockscale.py \
  -m 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 \
  -nk 512,4096 4096,256 8192,4096 12288,4096 17408,4096 \
  --ck_preshuffle False \
  2>&1 | tee tune_logs/a8w8_blockscale_bench_after_$(date +%Y%m%d_%H%M%S).log

Example — MoE after-benchmark:

AITER_REBUILD=1 AITER_CONFIG_FMOE=aiter/configs/tuned_fmoe.csv \
  python3 op_tests/test_moe_2stage.py \
  -t 1 4 8 32 64 128 1024 2048 16384 \
  -dim 4096,512 -e 512 -k 10 -q 5 -a silu -s f -p f \
  2>&1 | tee tune_logs/moe_2stages_bench_after_$(date +%Y%m%d_%H%M%S).log

The AITER_REBUILD=1 flag is essential — without it, old cached kernels will be used and you won't see improvements. The first run after tuning will take extra time for JIT rebuilding.

Compare results using the bundled comparison script:

python3 <skill_path>/scripts/compare_results.py \
  tune_logs/<variant>_bench_before_<timestamp>.log \
  tune_logs/<variant>_bench_after_<timestamp>.log

The script auto-detects the log format (GEMM vs MoE) and selects the appropriate comparison mode:

  • GEMM logs: matches shapes by (M, N, K), default metric is ck TFLOPS (higher is better)
  • MoE logs: matches shapes by (token, model_dim, inter_dim, E, topk), default metric is us (latency in microseconds, lower is better)

Both modes produce:

  • A per-shape comparison table with before/after values and speedup %
  • A summary with average/min/max speedup and improved/regressed counts
  • A per-config breakdown grouped by size category (small/medium/large)

You can override the metric with --metric "ck us" (latency) or --metric "asm TFLOPS".

Present the comparison results to the user and tell them where both log files are stored.


Step 5: Generate Report

After completing the comparison, generate a tuning report and save it to $AITER_PATH/tune_logs/<variant>_report_<YYYYMMDD_HHMMSS>.md. The report should contain:

  1. Environment summary: GPU model, aiter version, aiter path
  2. Shapes tuned: the (N, K) pairs or MoE configs, and kernel variant
  3. Tuning configuration: flags used (--libtype, --mp, --timeout, etc.)
  4. Full comparison table: the complete output from compare_results.py — include every shape, not a summary. This is the primary content of the report.
  5. Summary statistics: average/min/max speedup, improved/regressed counts, grouped by size category:
    • GEMM: per-(N,K) breakdown grouped by M category (Small M 1-63 decode, Medium M 64-512, Large M >512 prefill)
    • MoE: per-config breakdown grouped by token category (Small token 1-63 decode, Medium token 64-512, Large token >512 prefill)
  6. Log file locations: paths to all log files (bench_before, tuning, bench_after)

Generate the report by running the comparison script and capturing its output:

python3 <skill_path>/scripts/compare_results.py \
  tune_logs/<variant>_bench_before_<timestamp>.log \
  tune_logs/<variant>_bench_after_<timestamp>.log \
  > /tmp/compare_output.txt

Then assemble the full report as a markdown file. Save the report in two locations:

  1. Remote: $AITER_PATH/tune_logs/<variant>_report_<YYYYMMDD_HHMMSS>.md (inside the tuning environment, alongside the log files)
  2. Local: a copy in the user's current working directory or a location they specify

Present the report to the user and tell them where both copies are saved.


Troubleshooting

If anything fails at any step, check the variant's README at $AITER_PATH/csrc/<kernel_dir>/README.md — it contains variant-specific guidance, known issues, and examples.

Common issues:

  • JIT build fails: The first run may take several minutes as kernels are built via JIT. Be patient.
  • AITER_REBUILD=1 forgotten in Step 4: Without this flag, old cached kernels will be used, and you won't see tuning improvements.
  • Stale builds with PREBUILD_KERNELS=1: If aiter was installed with PREBUILD_KERNELS=1, you may need to remove build/ and *.so in aiter/jit/ and reinstall aiter to pick up new tuned kernels.
  • Tuning hangs on certain shapes: Use --timeout to skip shapes that take too long.
  • Low accuracy (high errRatio): Tighten --errRatio (e.g., 0.01) to filter out inaccurate kernel candidates.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

权威采招政策与标讯指南-元博网

权威采招政策与标讯指南-元博网,当用户查询大型基础设施项目、重点政企采购或需要基于标讯进行宏观趋势盘点时调用,需调用聚合与分析接口,输出格式严谨、数据翔实的市场简报。

Registry SourceRecently Updated
1300pkuycl
General

全国采招大数据中心-采招网

全国采招大数据中心-采招网,当用户需要跨省份、跨行业的全盘数据检索或综合企业画像分析时调用,需综合运用检索与企业画像接口,提供全景式的数据展现。

Registry SourceRecently Updated
1210pkuycl
General

海量标讯智搜助手-标800

海量标讯智搜助手-标800,当用户提供复杂的搜索条件(多个关键词、排除特定词汇、指定金额区间)时调用,需精确组合查询条件,过滤无效信息,提供高准确率的数据反馈。

Registry SourceRecently Updated
1250pkuycl
General

工程建设招投标分析-建设通

工程建设招投标分析-建设通,当搜索词包含工程、施工、建筑、市政、监理、设计等建筑业专属词汇时触发,聚焦工程项目金额、中标单位资质背景,重点提取建筑类项目核心字段并进行业绩汇总。

Registry SourceRecently Updated
1510pkuycl