ARM / AArch64 Assembly

Purpose

Guide agents through AArch64 (64-bit) and ARM (32-bit Thumb) assembly: registers, calling conventions, inline asm, and NEON/SVE SIMD patterns.

Triggers

"How do I read ARM64 assembly output?"
"What are the AArch64 registers and calling convention?"
"How do I write inline asm for ARM?"
"What is the difference between AArch64 and ARM Thumb?"
"How do I use NEON intrinsics?"

Workflow

Generate ARM assembly

AArch64 (native or cross-compile)

aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s

32-bit ARM Thumb

arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s

From objdump

aarch64-linux-gnu-objdump -d -S prog

From GDB on target

(gdb) disassemble /s main

AArch64 registers (AAPCS64)

x0 –x7

— Arguments 1–8 and return values

Indirect result location (struct return)

x9 –x15

— Caller-saved temporaries

x16 –x17

ip0 , ip1

Intra-procedure-call temporaries (used by linker)

x18

Platform register (reserved on some OS)

x19 –x28

— Callee-saved

x29

Frame pointer (callee-saved)

x30

Link register (return address)

— Stack pointer (must be 16-byte aligned at call)

— Program counter (not directly accessible)

xzr

wzr

Zero register (reads as 0, writes discarded)

v0 –v7

q0 –q7

FP/SIMD args and return

v8 –v15

— Callee-saved SIMD (lower 64 bits only)

v16 –v31

— Caller-saved temporaries

Width variants: x0 (64-bit), w0 (32-bit, zero-extends to 64), h0 (16), b0 (8).

AAPCS64 calling convention

Integer/pointer args: x0 –x7

Float/SIMD args: v0 –v7

Return: x0 (int), x0 +x1 (128-bit), v0 (float/SIMD) Callee-saved: x19 –x28 , x29 (fp), x30 (lr), v8 –v15 (lower 64 bits) Caller-saved: everything else

Stack must be 16-byte aligned at any bl or blr instruction.

Common AArch64 instructions

Instruction Effect

mov x0, x1

Copy register

mov x0, #42

Load immediate

movz x0, #0x1234, lsl #16

Move zero-extended with shift

movk x0, #0xabcd

Move with keep (partial update)

ldr x0, [x1]

Load 64-bit from address in x1

ldr x0, [x1, #8]

Load from x1+8

str x0, [x1, #8]

Store x0 to x1+8

ldp x0, x1, [sp, #16]

Load pair (two regs at once)

stp x29, x30, [sp, #-16]!

Store pair, pre-decrement sp

add x0, x1, x2

x0 = x1 + x2

add x0, x1, #8

x0 = x1 + 8

sub x0, x1, x2

x0 = x1 - x2

mul x0, x1, x2

x0 = x1 * x2

sdiv x0, x1, x2

Signed divide

udiv x0, x1, x2

Unsigned divide

cmp x0, x1

Set flags for x0 - x1

cbz x0, label

Branch if x0 == 0

cbnz x0, label

Branch if x0 != 0

bl func

Branch with link (call)

blr x0

Branch with link to address in x0

ret

Return (branch to x30)

ret x0

Return to address in x0

adrp x0, symbol

PC-relative page address

add x0, x0, :lo12:symbol

Low 12 bits of symbol offset

Typical function prologue/epilogue

// Non-leaf function stp x29, x30, [sp, #-32]! // save fp, lr; allocate 32 bytes mov x29, sp // set frame pointer stp x19, x20, [sp, #16] // save callee-saved registers // ... body ... ldp x19, x20, [sp, #16] // restore ldp x29, x30, [sp], #32 // restore fp, lr; deallocate ret

// Leaf function (no calls, no callee-saved regs needed) // Can use red zone (no rsp adjustment) — but AArch64 has no red zone sub sp, sp, #16 // allocate locals // ... body ... add sp, sp, #16 ret

Inline assembly (GCC/Clang)

// Barrier asm volatile ("dmb ish" ::: "memory");

// Load acquire static inline int load_acquire(volatile int *p) { int val; asm volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p)); return val; }

// Store release static inline void store_release(volatile int *p, int val) { asm volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val)); }

// Read system counter static inline uint64_t read_cntvct(void) { uint64_t val; asm volatile ("mrs %0, cntvct_el0" : "=r"(val)); return val; }

AArch64-specific constraints:

"Q" — memory operand suitable for exclusive/acquire/release instructions
"r" — any general-purpose register
"w" — any FP/SIMD register

NEON SIMD intrinsics

#include <arm_neon.h>

// Add 4 floats at once float32x4_t a = vld1q_f32(arr_a); // load 4 floats float32x4_t b = vld1q_f32(arr_b); float32x4_t c = vaddq_f32(a, b); vst1q_f32(result, c);

// Horizontal sum float32x4_t sum = vpaddq_f32(c, c); sum = vpaddq_f32(sum, sum); float total = vgetq_lane_f32(sum, 0);

Naming convention: v<op><q>_<type>

q suffix: 128-bit (quad) vector
_f32 : float32, _s32 : int32, _u8 : uint8, etc.

For a register reference, see references/reference.md.

Related skills

Use skills/low-level-programming/assembly-x86 for x86-64 assembly
Use skills/compilers/cross-gcc for cross-compilation toolchain
Use skills/debuggers/gdb for debugging ARM code with gdbserver

assembly-arm

Safety Notice

Copy this and send it to your AI assistant to learn

AArch64 (native or cross-compile)

32-bit ARM Thumb

From objdump

From GDB on target

Source Transparency

Related Skills

cmake

static-analysis

llvm

gdb