Hardware Performance Counters

Purpose

Guide agents through hardware performance counter analysis: collecting PMU events with perf stat -e , using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with perf annotate .

Triggers

"How do I measure cache miss rate with perf?"
"How do I count branch mispredictions?"
"How do I compute IPC (instructions per clock) with perf?"
"How do I use the PAPI library for hardware counters?"
"How do I see which source lines cause the most cache misses?"
"How do I measure memory bandwidth with performance counters?"

Workflow

perf stat — basic counter collection

Basic hardware event summary

perf stat ./prog

Output:

Performance counter stats for './prog':

1,234,567,890 instructions

456,789,012 cycles

12,345,678 cache-misses # 1.23 % of all cache refs

23,456,789 branch-misses # 2.34 % of all branches

0.456789012 seconds time elapsed

Derived metrics (computed from the output)

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

CPI = cycles / instructions ≈ 0.37

Specifying PMU events with -e

Specific hardware events

perf stat -e instructions,cycles,cache-misses,branch-misses ./prog

L1/L2/L3 cache events

perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog

Memory bandwidth (Intel)

perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog

TLB misses

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog

Branch misprediction rate

perf stat -e branches,branch-misses ./prog

Rate = branch-misses / branches × 100%

Available events (varies by CPU)

perf list hardware # generic hardware events perf list cache # cache events perf list pmu # raw PMU events for your CPU

Key metrics and thresholds

Metric Formula Healthy Concerning

IPC instructions / cycles

2.0 (modern x86) < 1.0

L1 miss rate L1-misses / L1-accesses < 1%

5%

LLC miss rate LLC-misses / LLC-accesses < 1%

10%

Branch miss rate branch-misses / branches < 1%

5%

MPKI misses per 1K instructions — L3 MPKI > 10 = memory bound

Compute MPKI (Misses Per Kilo-Instructions)

perf stat -e instructions,LLC-load-misses ./prog

MPKI = LLC-load-misses / (instructions / 1000)

Raw PMU events (CPU-specific)

For events not in the generic aliases, use raw event codes:

Intel: use perf list or look up in Intel SDM

Format: rXXYY where XX=umask, YY=event code

perf stat -e r0124 ./prog # example Intel raw event

List Intel events with ocperf (OpenCL Perf Events)

pip install ocperf ocperf.py list | grep "mem_load"

Use libpfm4 for event names

pfm_ls | grep "MEM_LOAD" perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog

AMD: similar approach

perf stat -e r04041 ./prog # AMD raw event

Source-level annotation with perf record/annotate

Record with hardware events

perf record -e LLC-load-misses -g ./prog

Annotate: show source lines sorted by cache miss count

perf annotate --stdio

Interactive (requires debug symbols)

perf report

Press 'a' on a function to annotate it

Combined: record hotspot + annotate

perf record -e cycles:u -g ./prog perf annotate --symbol=my_function --stdio 2>/dev/null | head -40

Example annotate output:

Percent | Source code

45.23 | for (int i = 0; i < N; i++)

3.12 | sum += data[i]; ← cache miss here (strided access)

PAPI — Portable API for hardware counters

PAPI provides a portable C API across different CPU architectures:

#include <papi.h> #include <stdio.h>

int main(void) { int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L2_TCM, PAPI_BR_MSP}; long long values[4];

if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
    fprintf(stderr, "PAPI init failed\n");
    return 1;
}

PAPI_start_counters(Events, 4);

// --- Code to measure ---
do_work();
// -----------------------

PAPI_stop_counters(values, 4);

printf("Instructions:      %lld\n", values[0]);
printf("Cycles:            %lld\n", values[1]);
printf("IPC:               %.2f\n", (double)values[0]/values[1]);
printf("L2 cache misses:   %lld\n", values[2]);
printf("Branch mispred:    %lld\n", values[3]);

return 0;

}

Build with PAPI

gcc -O2 -g -o prog prog.c -lpapi

Available PAPI events on your system

papi_avail -a | head -30 papi_native_avail | grep "L3" # native events with "L3"

Common PAPI presets:

Preset Event

PAPI_TOT_INS

Total instructions

PAPI_TOT_CYC

Total cycles

PAPI_L1_DCM

L1 data cache misses

PAPI_L2_TCM

L2 total cache misses

PAPI_L3_TCM

L3 total cache misses

PAPI_BR_MSP

Branch mispredictions

PAPI_TLB_DM

Data TLB misses

PAPI_FP_INS

Floating point instructions

PAPI_VEC_INS

Vector/SIMD instructions

Intel PCM (Performance Counter Monitor)

Intel PCM — system-wide counters, no root required on modern kernels

git clone https://github.com/intel/pcm cd pcm && cmake -S . -B build && cmake --build build

Measure memory bandwidth

./build/bin/pcm-memory 1 # sample every 1 second

Core utilization + IPC

./build/bin/pcm 1

Cache miss breakdown per socket

./build/bin/pcm 1 -csv | head -20

Related skills

Use skills/profilers/intel-vtune-amd-uprof for guided microarchitecture analysis
Use skills/profilers/linux-perf for perf record/report and flamegraph generation
Use skills/low-level-programming/cpu-cache-opt for applying cache optimization patterns
Use skills/low-level-programming/simd-intrinsics for improving FLOPS/cycle metrics