Hardware Performance Counters
Purpose
Guide agents through hardware performance counter analysis: collecting PMU events with perf stat -e , using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with perf annotate .
Triggers
-
"How do I measure cache miss rate with perf?"
-
"How do I count branch mispredictions?"
-
"How do I compute IPC (instructions per clock) with perf?"
-
"How do I use the PAPI library for hardware counters?"
-
"How do I see which source lines cause the most cache misses?"
-
"How do I measure memory bandwidth with performance counters?"
Workflow
- perf stat — basic counter collection
Basic hardware event summary
perf stat ./prog
Output:
Performance counter stats for './prog':
1,234,567,890 instructions
456,789,012 cycles
12,345,678 cache-misses # 1.23 % of all cache refs
23,456,789 branch-misses # 2.34 % of all branches
0.456789012 seconds time elapsed
Derived metrics (computed from the output)
IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70
CPI = cycles / instructions ≈ 0.37
- Specifying PMU events with -e
Specific hardware events
perf stat -e instructions,cycles,cache-misses,branch-misses ./prog
L1/L2/L3 cache events
perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog
Memory bandwidth (Intel)
perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog
TLB misses
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog
Branch misprediction rate
perf stat -e branches,branch-misses ./prog
Rate = branch-misses / branches × 100%
Available events (varies by CPU)
perf list hardware # generic hardware events perf list cache # cache events perf list pmu # raw PMU events for your CPU
- Key metrics and thresholds
Metric Formula Healthy Concerning
IPC instructions / cycles
2.0 (modern x86) < 1.0
L1 miss rate L1-misses / L1-accesses < 1%
5%
LLC miss rate LLC-misses / LLC-accesses < 1%
10%
Branch miss rate branch-misses / branches < 1%
5%
MPKI misses per 1K instructions — L3 MPKI > 10 = memory bound
Compute MPKI (Misses Per Kilo-Instructions)
perf stat -e instructions,LLC-load-misses ./prog
MPKI = LLC-load-misses / (instructions / 1000)
- Raw PMU events (CPU-specific)
For events not in the generic aliases, use raw event codes:
Intel: use perf list or look up in Intel SDM
Format: rXXYY where XX=umask, YY=event code
perf stat -e r0124 ./prog # example Intel raw event
List Intel events with ocperf (OpenCL Perf Events)
pip install ocperf ocperf.py list | grep "mem_load"
Use libpfm4 for event names
pfm_ls | grep "MEM_LOAD" perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog
AMD: similar approach
perf stat -e r04041 ./prog # AMD raw event
- Source-level annotation with perf record/annotate
Record with hardware events
perf record -e LLC-load-misses -g ./prog
Annotate: show source lines sorted by cache miss count
perf annotate --stdio
Interactive (requires debug symbols)
perf report
Press 'a' on a function to annotate it
Combined: record hotspot + annotate
perf record -e cycles:u -g ./prog perf annotate --symbol=my_function --stdio 2>/dev/null | head -40
Example annotate output:
Percent | Source code
45.23 | for (int i = 0; i < N; i++)
3.12 | sum += data[i]; ← cache miss here (strided access)
- PAPI — Portable API for hardware counters
PAPI provides a portable C API across different CPU architectures:
#include <papi.h> #include <stdio.h>
int main(void) { int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L2_TCM, PAPI_BR_MSP}; long long values[4];
if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI init failed\n");
return 1;
}
PAPI_start_counters(Events, 4);
// --- Code to measure ---
do_work();
// -----------------------
PAPI_stop_counters(values, 4);
printf("Instructions: %lld\n", values[0]);
printf("Cycles: %lld\n", values[1]);
printf("IPC: %.2f\n", (double)values[0]/values[1]);
printf("L2 cache misses: %lld\n", values[2]);
printf("Branch mispred: %lld\n", values[3]);
return 0;
}
Build with PAPI
gcc -O2 -g -o prog prog.c -lpapi
Available PAPI events on your system
papi_avail -a | head -30 papi_native_avail | grep "L3" # native events with "L3"
Common PAPI presets:
Preset Event
PAPI_TOT_INS
Total instructions
PAPI_TOT_CYC
Total cycles
PAPI_L1_DCM
L1 data cache misses
PAPI_L2_TCM
L2 total cache misses
PAPI_L3_TCM
L3 total cache misses
PAPI_BR_MSP
Branch mispredictions
PAPI_TLB_DM
Data TLB misses
PAPI_FP_INS
Floating point instructions
PAPI_VEC_INS
Vector/SIMD instructions
- Intel PCM (Performance Counter Monitor)
Intel PCM — system-wide counters, no root required on modern kernels
git clone https://github.com/intel/pcm cd pcm && cmake -S . -B build && cmake --build build
Measure memory bandwidth
./build/bin/pcm-memory 1 # sample every 1 second
Core utilization + IPC
./build/bin/pcm 1
Cache miss breakdown per socket
./build/bin/pcm 1 -csv | head -20
Related skills
-
Use skills/profilers/intel-vtune-amd-uprof for guided microarchitecture analysis
-
Use skills/profilers/linux-perf for perf record/report and flamegraph generation
-
Use skills/low-level-programming/cpu-cache-opt for applying cache optimization patterns
-
Use skills/low-level-programming/simd-intrinsics for improving FLOPS/cycle metrics