ARM / AArch64 Assembly
Purpose
Guide agents through AArch64 (64-bit) and ARM (32-bit Thumb) assembly: registers, calling conventions, inline asm, and NEON/SVE SIMD patterns.
Triggers
-
"How do I read ARM64 assembly output?"
-
"What are the AArch64 registers and calling convention?"
-
"How do I write inline asm for ARM?"
-
"What is the difference between AArch64 and ARM Thumb?"
-
"How do I use NEON intrinsics?"
Workflow
- Generate ARM assembly
AArch64 (native or cross-compile)
aarch64-linux-gnu-gcc -S -O2 foo.c -o foo.s
32-bit ARM Thumb
arm-linux-gnueabihf-gcc -S -O2 -mthumb foo.c -o foo.s
From objdump
aarch64-linux-gnu-objdump -d -S prog
From GDB on target
(gdb) disassemble /s main
- AArch64 registers (AAPCS64)
Register Alias Role
x0 –x7
— Arguments 1–8 and return values
x8
xr
Indirect result location (struct return)
x9 –x15
— Caller-saved temporaries
x16 –x17
ip0 , ip1
Intra-procedure-call temporaries (used by linker)
x18
pr
Platform register (reserved on some OS)
x19 –x28
— Callee-saved
x29
fp
Frame pointer (callee-saved)
x30
lr
Link register (return address)
sp
— Stack pointer (must be 16-byte aligned at call)
pc
— Program counter (not directly accessible)
xzr
wzr
Zero register (reads as 0, writes discarded)
v0 –v7
q0 –q7
FP/SIMD args and return
v8 –v15
— Callee-saved SIMD (lower 64 bits only)
v16 –v31
— Caller-saved temporaries
Width variants: x0 (64-bit), w0 (32-bit, zero-extends to 64), h0 (16), b0 (8).
- AAPCS64 calling convention
Integer/pointer args: x0 –x7
Float/SIMD args: v0 –v7
Return: x0 (int), x0 +x1 (128-bit), v0 (float/SIMD) Callee-saved: x19 –x28 , x29 (fp), x30 (lr), v8 –v15 (lower 64 bits) Caller-saved: everything else
Stack must be 16-byte aligned at any bl or blr instruction.
- Common AArch64 instructions
Instruction Effect
mov x0, x1
Copy register
mov x0, #42
Load immediate
movz x0, #0x1234, lsl #16
Move zero-extended with shift
movk x0, #0xabcd
Move with keep (partial update)
ldr x0, [x1]
Load 64-bit from address in x1
ldr x0, [x1, #8]
Load from x1+8
str x0, [x1, #8]
Store x0 to x1+8
ldp x0, x1, [sp, #16]
Load pair (two regs at once)
stp x29, x30, [sp, #-16]!
Store pair, pre-decrement sp
add x0, x1, x2
x0 = x1 + x2
add x0, x1, #8
x0 = x1 + 8
sub x0, x1, x2
x0 = x1 - x2
mul x0, x1, x2
x0 = x1 * x2
sdiv x0, x1, x2
Signed divide
udiv x0, x1, x2
Unsigned divide
cmp x0, x1
Set flags for x0 - x1
cbz x0, label
Branch if x0 == 0
cbnz x0, label
Branch if x0 != 0
bl func
Branch with link (call)
blr x0
Branch with link to address in x0
ret
Return (branch to x30)
ret x0
Return to address in x0
adrp x0, symbol
PC-relative page address
add x0, x0, :lo12:symbol
Low 12 bits of symbol offset
- Typical function prologue/epilogue
// Non-leaf function stp x29, x30, [sp, #-32]! // save fp, lr; allocate 32 bytes mov x29, sp // set frame pointer stp x19, x20, [sp, #16] // save callee-saved registers // ... body ... ldp x19, x20, [sp, #16] // restore ldp x29, x30, [sp], #32 // restore fp, lr; deallocate ret
// Leaf function (no calls, no callee-saved regs needed) // Can use red zone (no rsp adjustment) — but AArch64 has no red zone sub sp, sp, #16 // allocate locals // ... body ... add sp, sp, #16 ret
- Inline assembly (GCC/Clang)
// Barrier asm volatile ("dmb ish" ::: "memory");
// Load acquire static inline int load_acquire(volatile int *p) { int val; asm volatile ("ldar %w0, %1" : "=r"(val) : "Q"(*p)); return val; }
// Store release static inline void store_release(volatile int *p, int val) { asm volatile ("stlr %w1, %0" : "=Q"(*p) : "r"(val)); }
// Read system counter static inline uint64_t read_cntvct(void) { uint64_t val; asm volatile ("mrs %0, cntvct_el0" : "=r"(val)); return val; }
AArch64-specific constraints:
-
"Q" — memory operand suitable for exclusive/acquire/release instructions
-
"r" — any general-purpose register
-
"w" — any FP/SIMD register
- NEON SIMD intrinsics
#include <arm_neon.h>
// Add 4 floats at once float32x4_t a = vld1q_f32(arr_a); // load 4 floats float32x4_t b = vld1q_f32(arr_b); float32x4_t c = vaddq_f32(a, b); vst1q_f32(result, c);
// Horizontal sum float32x4_t sum = vpaddq_f32(c, c); sum = vpaddq_f32(sum, sum); float total = vgetq_lane_f32(sum, 0);
Naming convention: v<op><q>_<type>
-
q suffix: 128-bit (quad) vector
-
_f32 : float32, _s32 : int32, _u8 : uint8, etc.
For a register reference, see references/reference.md.
Related skills
-
Use skills/low-level-programming/assembly-x86 for x86-64 assembly
-
Use skills/compilers/cross-gcc for cross-compilation toolchain
-
Use skills/debuggers/gdb for debugging ARM code with gdbserver