hardware-counters

Hardware Performance Counters

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "hardware-counters" with this command: npx skills add mohitmishra786/low-level-dev-skills/mohitmishra786-low-level-dev-skills-hardware-counters

Hardware Performance Counters

Purpose

Guide agents through hardware performance counter analysis: collecting PMU events with perf stat -e , using the PAPI library for portable counter access, interpreting cache miss rates and branch misprediction ratios, computing IPC, and correlating events to source lines with perf annotate .

Triggers

  • "How do I measure cache miss rate with perf?"

  • "How do I count branch mispredictions?"

  • "How do I compute IPC (instructions per clock) with perf?"

  • "How do I use the PAPI library for hardware counters?"

  • "How do I see which source lines cause the most cache misses?"

  • "How do I measure memory bandwidth with performance counters?"

Workflow

  1. perf stat — basic counter collection

Basic hardware event summary

perf stat ./prog

Output:

Performance counter stats for './prog':

1,234,567,890 instructions

456,789,012 cycles

12,345,678 cache-misses # 1.23 % of all cache refs

23,456,789 branch-misses # 2.34 % of all branches

0.456789012 seconds time elapsed

Derived metrics (computed from the output)

IPC = instructions / cycles = 1,234,567,890 / 456,789,012 ≈ 2.70

CPI = cycles / instructions ≈ 0.37

  1. Specifying PMU events with -e

Specific hardware events

perf stat -e instructions,cycles,cache-misses,branch-misses ./prog

L1/L2/L3 cache events

perf stat -e
L1-dcache-loads,L1-dcache-load-misses,
L2-loads,L2-load-misses,
LLC-loads,LLC-load-misses
./prog

Memory bandwidth (Intel)

perf stat -e
uncore_imc/cas_count_read/,
uncore_imc/cas_count_write/
./prog

TLB misses

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog

Branch misprediction rate

perf stat -e branches,branch-misses ./prog

Rate = branch-misses / branches × 100%

Available events (varies by CPU)

perf list hardware # generic hardware events perf list cache # cache events perf list pmu # raw PMU events for your CPU

  1. Key metrics and thresholds

Metric Formula Healthy Concerning

IPC instructions / cycles

2.0 (modern x86) < 1.0

L1 miss rate L1-misses / L1-accesses < 1%

5%

LLC miss rate LLC-misses / LLC-accesses < 1%

10%

Branch miss rate branch-misses / branches < 1%

5%

MPKI misses per 1K instructions — L3 MPKI > 10 = memory bound

Compute MPKI (Misses Per Kilo-Instructions)

perf stat -e instructions,LLC-load-misses ./prog

MPKI = LLC-load-misses / (instructions / 1000)

  1. Raw PMU events (CPU-specific)

For events not in the generic aliases, use raw event codes:

Intel: use perf list or look up in Intel SDM

Format: rXXYY where XX=umask, YY=event code

perf stat -e r0124 ./prog # example Intel raw event

List Intel events with ocperf (OpenCL Perf Events)

pip install ocperf ocperf.py list | grep "mem_load"

Use libpfm4 for event names

pfm_ls | grep "MEM_LOAD" perf stat -e $(pfm_ls | grep "MEM_LOAD_RETIRED.L3_MISS") ./prog

AMD: similar approach

perf stat -e r04041 ./prog # AMD raw event

  1. Source-level annotation with perf record/annotate

Record with hardware events

perf record -e LLC-load-misses -g ./prog

Annotate: show source lines sorted by cache miss count

perf annotate --stdio

Interactive (requires debug symbols)

perf report

Press 'a' on a function to annotate it

Combined: record hotspot + annotate

perf record -e cycles:u -g ./prog perf annotate --symbol=my_function --stdio 2>/dev/null | head -40

Example annotate output:

Percent | Source code

45.23 | for (int i = 0; i < N; i++)

3.12 | sum += data[i]; ← cache miss here (strided access)

  1. PAPI — Portable API for hardware counters

PAPI provides a portable C API across different CPU architectures:

#include <papi.h> #include <stdio.h>

int main(void) { int Events[] = {PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L2_TCM, PAPI_BR_MSP}; long long values[4];

if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
    fprintf(stderr, "PAPI init failed\n");
    return 1;
}

PAPI_start_counters(Events, 4);

// --- Code to measure ---
do_work();
// -----------------------

PAPI_stop_counters(values, 4);

printf("Instructions:      %lld\n", values[0]);
printf("Cycles:            %lld\n", values[1]);
printf("IPC:               %.2f\n", (double)values[0]/values[1]);
printf("L2 cache misses:   %lld\n", values[2]);
printf("Branch mispred:    %lld\n", values[3]);

return 0;

}

Build with PAPI

gcc -O2 -g -o prog prog.c -lpapi

Available PAPI events on your system

papi_avail -a | head -30 papi_native_avail | grep "L3" # native events with "L3"

Common PAPI presets:

Preset Event

PAPI_TOT_INS

Total instructions

PAPI_TOT_CYC

Total cycles

PAPI_L1_DCM

L1 data cache misses

PAPI_L2_TCM

L2 total cache misses

PAPI_L3_TCM

L3 total cache misses

PAPI_BR_MSP

Branch mispredictions

PAPI_TLB_DM

Data TLB misses

PAPI_FP_INS

Floating point instructions

PAPI_VEC_INS

Vector/SIMD instructions

  1. Intel PCM (Performance Counter Monitor)

Intel PCM — system-wide counters, no root required on modern kernels

git clone https://github.com/intel/pcm cd pcm && cmake -S . -B build && cmake --build build

Measure memory bandwidth

./build/bin/pcm-memory 1 # sample every 1 second

Core utilization + IPC

./build/bin/pcm 1

Cache miss breakdown per socket

./build/bin/pcm 1 -csv | head -20

Related skills

  • Use skills/profilers/intel-vtune-amd-uprof for guided microarchitecture analysis

  • Use skills/profilers/linux-perf for perf record/report and flamegraph generation

  • Use skills/low-level-programming/cpu-cache-opt for applying cache optimization patterns

  • Use skills/low-level-programming/simd-intrinsics for improving FLOPS/cycle metrics

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

cmake

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

static-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

llvm

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

gdb

No summary provided by upstream source.

Repository SourceNeeds Review