Performance Benchmark Pipeline
Overview
Use this skill to systematically profile and score performance of Python + C projects on Linux. It orchestrates industry-standard tools across 4 tiers of increasing depth, scores a 7-dimension rubric, and outputs priority-ordered prescriptions.
This skill is intentionally repo-agnostic when you provide the benchmark entrypoint explicitly. Pytest benchmark autodiscovery is a convenience for Python repos.
Use This Skill When
- You need to profile CPU, cache, branch, or memory behavior of hot code.
- You want to detect algorithmic scaling issues (O(n) vs O(n^2)) before micro-optimizing.
- You need a baseline benchmark for regression comparison.
- You want to identify whether a bottleneck is algorithmic, data-layout, or hardware-level.
- You need ASM-level analysis of compiled code (C extensions, Numba JIT).
Outputs
benchmark_report.md— unified human-readable report with rubric scorecard, findings, prescriptions.benchmark_summary.json— machine-readable scores and raw metrics for baseline/regression.tier1/— pytest-benchmark JSON, tracemalloc JSON, GNU time output.tier2/— cachegrind + callgrind annotated outputs.tier3/— massif heap profile, perf stat counters, and opt-in nativeperf record/reportartifacts.tier4/— objdump disassembly, Numba ASM (if--asm-audit).
Workflow
1. Check Prerequisites
The script auto-checks at startup:
- Python >= 3.10
valgrind(required for Tier 2+)perf_event_paranoid(perf stat requires <= 1)- CPU governor (warns if not
performance) - Cache topology (auto-detected from sysfs for accurate Valgrind simulation)
- Available RAM (warns if insufficient for parallel Valgrind)
2. Discover or Specify Targets
Auto-discovers pytest.mark.benchmark tests and tests/benchmarks/ directories.
Override with --target "cmd {SIZE}" or --binary ./path/to/program.
Use --target or --binary for non-pytest repos.
Multi-size explicit targets must include {SIZE}.
Single-size explicit targets must also include {SIZE} when --sizes is present; otherwise omit --sizes.
3. Run Pipeline
python /path/to/perf-benchmark/scripts/perf_benchmark_pipeline.py \
--root /path/to/repo \
--target "python -m benchmark_entrypoint {SIZE}" \
--source-prefix path/to/source/ \
--tier medium \
--sizes 10000,100000 \
--out-dir /tmp/perf-bench
Tier options:
fast: Tier 1 only (pytest-benchmark + tracemalloc + GNU time). Seconds.medium: Tiers 1-2 (+ cachegrind + callgrind). Minutes.deep: Tiers 1-3 (+ massif + perf stat). Minutes.asm: All tiers including Tier 4 ASM audit.
Native sampled hotspots are opt-in via --perf-record. When enabled and perf
is available, Tier 3 also runs perf record + perf report --stdio and writes
raw artifacts plus a compact hotspot summary.
4. Review Rubric Scores
7 dimensions scored 0-4 each (max 28), in priority order:
| Priority | Dimension | Impact | Tool |
|---|---|---|---|
| 1 (highest) | Algorithmic Scaling | 100-1000x | pytest-benchmark + callgrind |
| 2 | L1 Cache Efficiency | 5-20x | cachegrind |
| 2 | Last-Level Cache | 5-20x | cachegrind |
| 2 | Memory Profile | 5-20x | massif + tracemalloc |
| 3 | Wall-Time Stability | quality gate | pytest-benchmark / time |
| 3 | CPU Efficiency | 2-5x | callgrind + perf stat |
| 3 | Branch Prediction | 2-5x | cachegrind / perf stat |
If Dimension 0 (Algorithmic Scaling) is FAIL, the report prints a STOP warning:
fix algorithmic issues before pursuing hardware optimizations.
If Dimension 0 is N/A, the report lists the missing sub-check evidence.
Full Algorithmic Scaling scoring requires deep or asm because allocation churn comes from massif.
5. Apply Prescriptions
Each FAIL/WARN dimension maps to concrete optimization patterns.
See references/rubric.md for thresholds and references/tool-guide.md for
tool selection guidance.
Algorithm Diagnosis Playbook
Use this advisory checklist before dropping into cache, branch, or ASM work:
- Confirm measured growth matches the expected complexity class before tuning constants.
- Prefer a lower asymptotic class or smaller search space before hardware-level work.
- Replace full recomputation with incremental maintenance when updates are local.
- Process the delta, not the full retained history; bound per-update work to changed inputs.
- Remove redundant passes, rereads, and copies of unchanged data.
- Add indexes, partitions, caches, or summaries so queries touch only required state.
- For streaming workloads, check whether per-update work scales with delta size or total state size.
Use references/question-bank.md for the fuller advisory diagnosis prompts.
6. Regression Comparison (Optional)
python scripts/perf_benchmark_pipeline.py \
--root . \
--out-dir /tmp/bench \
--target "./path/to/benchmark {SIZE}" \
--baseline /path/to/previous/benchmark_summary.json
Any scored dimension dropping >= 1 tier from baseline is surfaced in the report and summary as a regression blocker.
Agent Parallelism Opportunities
After the script completes, analysis can be parallelized across sub-agents:
Tier 1 stays isolated because timing and tracemalloc measurements are noise-sensitive. Preferred subagent split: per-artifact or per-rubric-dimension after the pipeline finishes.
Phase 1 (before script): prerequisites check || target discovery
Phase 2 (after script): each tool output parsed by independent sub-agent:
- parse tier1/pytest_benchmark.json
- parse tier2/cachegrind_annotated.txt
- parse tier2/callgrind_annotated.txt
- parse tier3/massif.out
- parse tier3/perf_stat.txt
- parse tier4/objdump_*.txt
Phase 3: rubric dimensions scored independently in parallel.
Phase 4: prescriptions written independently per dimension.
Sub-agents return structured findings matching references/finding-schema.json.
Framework-Specific Notes
- Numba: Pass
--env NUMBA_DISABLE_JIT=1for coverage; omit for actual JIT benchmarks. - ctypes/CFFI: C extensions loaded via Python — use
--source-prefixto filter Valgrind noise. - Standalone C: Use
--binary ./path/to/programto skip Python entirely. - Hybrid CPUs (Intel Alder/Raptor Lake): cachegrind simulates P-core cache hierarchy.
Quick Reference
# Fast check (seconds)
python scripts/perf_benchmark_pipeline.py --root . --out-dir /tmp/b --tier fast --target "python -m benchmark_entrypoint {SIZE}" --sizes 10000,100000
# Medium with source filtering
python scripts/perf_benchmark_pipeline.py --root . --out-dir /tmp/b --tier medium --target "./path/to/benchmark {SIZE}" --source-prefix path/to/source/ --sizes 10000,100000
# Deep with regression baseline
python scripts/perf_benchmark_pipeline.py --root . --out-dir /tmp/b --tier deep --target "./path/to/benchmark {SIZE}" --baseline /path/to/previous/benchmark_summary.json --sizes 10000,100000
# Deep with opt-in native hotspot sampling
python scripts/perf_benchmark_pipeline.py --root . --out-dir /tmp/b --tier deep --target "./path/to/benchmark {SIZE}" --sizes 10000,100000 --perf-record
# ASM audit for C binary
python scripts/perf_benchmark_pipeline.py --root . --out-dir /tmp/b --tier asm --binary ./path/to/program --asm-audit
References
references/rubric.md: 7-dimension scoring rubric.references/tool-guide.md: tool selection decision tree.references/asm-checklist.md: 6-pattern ASM reading guide.references/question-bank.md: performance audit questions.references/finding-schema.json: sub-agent return format.references/sample-report.md: annotated example output.
Known Limitations
- Valgrind cachegrind simulates 2-level cache (L1 -> LL). No separate L2.
- Valgrind adds 20-50x slowdown. Use
--valgrind-sizefor large inputs. perf statrequiresperf_event_paranoid <= 1.--perf-recordis opt-in and also requiresperf_event_paranoid <= 1.tracemallocis Python-only. C memory uses massif exclusively.- Dimension 0 requires benchmarks at >= 2 input sizes.
- callgrind heuristics cannot determine argument identity (memoization needs manual check).