A Deeper Look at ARM Assembly Language

By Jeff Tranter Wednesday, June 29, 2022

Let's continue our blog series on ARM assembly language by drilling down into some of the basic ARM machine language instructions.

Instruction Formats and Addressing Modes

ARM instructions accept from zero to three (and occasionally more) operands. An optional S suffix can be added to indicate that the result should affect the flags in the status register. Most source operands can be a register or immediate data and the destination register can usually be the same as a source register.

The most basic instruction is MOV (for move) and takes the form MOV dest, src. Here are some examples:

MOV  R1,R2       Copy the contents of R2 to R1
MOV  R2,#1234    Move immediate value 1234 into R2
MOVN R1,R2       Move Negative; Copy 1's complement of R2 into R1

Math Functions

All of the common arithmetic functions are provided. Below are some examples:

Add:
ADD  R0,R1,R2     R0 ← R1 + R2 Add
ADC  R0,R1,R2     R0 ← R1 + R2 Add with carry
ADDS R0,R1,R2     R0 ← R1 + R2 Add, setting flags

Subtract:
SUB R0,R1,R2      R0 ← R1-R2 Subtract
RSB R0,R1,R2      R0 ← R2-R1 Reverse subtract

Multiply, Multiply and Accumulate:
MUL R0,R1,R2      R0 ← R1*R2
MLA R0,R1,R2,R3   R0 ← (R1*R2)+R3

There is also (at least on some ARM platforms) a "long multiply" that produces a 64-bit result.

Compare and Logical Instructions

Instructions for comparing values are provided, e.g.

CMP  R1,R2  Compare R2 and R1, set flags
CMP  R3,#0  Compare R3 with zero, set flags
CMPN R4,0   Compare negative (1's complement of zero)

Also provided are logical operations such as AND, ORR (or), and EOR (exclusive or), e.g.

AND R1,R2,R3  R1 ← R2 AND R3

Branching and Conditions

Branching works like on most processors, e.g.

BEQ label   Branch to label if Z flag is set
BNE label   Branch to label if Z is not set

In the more general case, most instructions can be made conditional simply by adding a suffix with the condition, e.g.

MOVCS   R0,R1   Move if carry flag set
MOV CS  R0,R1   Same as above (space allowed between mnemonic and condition)

The conditions supported are the following:

EQ/NE	Equal/Not equal
VS/VC	Overflow set/Overflow clear
AL	Always
NV	Never
HI	Higher
LS	Lower or same
PL	Plus (minus clear)
MI	Minus (Minus set)
CS/HS	Carry set (higher or same)
CC/LO	Carry clear (lower)
GE	Greater than or equal
LT	Less than
GT	Greater than
LE	Less than or equal

Shifts and Rotates

The ARM CPU has a barrel shifter that can shift or rotate a result by up to 32 bit positions at once. Shifts and rotates are only done as part of other instructions and not explicitly with shift or rotate instructions (however, the assembler will accept them as instructions and convert them to a MOV).

The shift or rotate operation is added as an optional third operand. This is supported by instructions for move, add, subtract, compare, and, or, xor, test, and others. The operations are LSL, LSR, ASL, ASR, ROR, ROL, RRX (rotate through extend/carry bit). Logical shifts shift in a zero. Arithmetic shifts maintain the sign of the value.

Here are some examples:

MOV R0, R1, LSL#1        R0 ← R1 shifted left by 1 bit position
MOV R0, R1, ROR#4        R0 ← R1 rotated right by 4 bit positions
MOV CS S R0, R1, ASR #2  If carry bit is set, R0 ← R1 shifted right by two positions, maintaining sign and setting flags

Assembler Output From C OR C++ Compiler

I mentioned the use case of wanting to examine the assembler output of the C or C++ compiler. This can be useful for optimizing code or debugging suspected compiler issues. Let's use this small example which illustrates some typical C code but doesn't do anything meaningful:

int main()
{
    int j, k;

    for (int i = 0; i < 100; i++) {
        j = i * i;
        if (j % 2) {
            k = j;
        } else {
            k = 2 * j;
        }
    }
    return 0;
}

If we compile this with the GNU C compiler and the -S and -fverbose-asm options, we can see the assembler output, e.g.:

gcc -S -fverbose-asm example.c

The output can be enlightening, even if you don't know all the details of assembler programming. It is quite long though, so let's just look at a few highlights. The corresponding line of C code is shown as comments in the assembler output. Here is the line starting the for loop:

@ example.c:5:     for (int i = 0; i < 100; i++) {
    mov    r3, #0    @ tmp114,
    str    r3, [fp, #-8]    @ tmp114, i

Here we can see R3 set to zero for the loop variable i. Local variables are stored on the stack, using the frame pointer, fp or R11. It looks like variable i is at an offset of 8 from the frame pointer, and the initialized value in R3 gets stored there.

@ example.c:6:         j = i * i;
    ldr    r3, [fp, #-8]    @ tmp116, i
    ldr    r2, [fp, #-8]    @ tmp117, i
    mul    r3, r2, r3    @ tmp115, tmp117, tmp116
    str    r3, [fp, #-12]    @ tmp115, j

In the code above we are getting the value of i, again at offset -8 from the frame pointer and putting it in r3. Another copy goes in r2. Then we multiply r2 and r3 and store the result back in r3. Finally, r3 is stored at -12 from the frame pointer, which must correspond to the variable j.

@ example.c:8:             k = j;
    ldr    r3, [fp, #-12]    @ tmp118, j
    str    r3, [fp, #-16]    @ tmp118, k

Here we see a simple assignment, getting variable j from the stack frame and writing it to variable k on the stack. Due to the load/store architecture, we need to do this via a register.

@ example.c:10:             k = 2 * j;
    ldr    r3, [fp, #-12]    @ tmp120, j
    lsl    r3, r3, #1    @ tmp119, tmp120,
    str    r3, [fp, #-16]    @ tmp119, k

In this code, we get variable j from the stack and put it in register r3. The compiler is smart and implements a multiply by 2 as a shift left instruction. The result is stored via the stack frame into variable k.

To see a case where it does use a multiply instruction, look at the code generated from line 6 of the example:

@ example.c:6:         j = i * i;
        ldr     r3, [fp, #-8]   @ tmp116, i
        ldr     r2, [fp, #-8]   @ tmp117, i
        mul     r1, r2, r3      @ tmp115, tmp117, tmp116
        str     r1, [fp, #-12]  @ tmp115, j

Advanced and Miscellaneous Topics

There are many more ARM instructions and features that we could cover, but not in a reasonable length for a blog post. I would like to mention some topics that you might want to explore in more detail.

Addressing modes: We didn't cover the supported instruction addressing modes. The compiler output showed the use of indirect addressing using the frame pointer. ARM supports indirect and pre and post-indexed addressing modes, PC-relative addressing for position independent code, and more.

Is ARM big-endian or little-endian? It can be set to either via the CPSR! The default is little-endian and most OSes use it in this mode (which is the same as Intel platforms) but in certain cases, you could choose to use big-endian for increased efficiency.

Debugging: The GNU gdb debugger can show or step by machine code instructions, set breakpoints, and has other features to support debugging at the machine language level. These features are available from most IDEs that use gdb, including Qt Creator.

C/C++ in-line assembler: GCC supports putting assembler code into C or C++ code. This is often the best solution for small routines or optimizing where you just want to add a few lines of code. You have access to variable names and can use conditional compilation to use different code depending on the platform.

Floating Point: Most ARM chips, including the Raspberry Pi, have an onboard Vector Floating-Point unit (VFP). This provides 32 64-bit IEEE standard floating-point registers with support for math functions as well as some vector operations. On platforms where it is not present, the operating system can transparently implement it in software (at reduced performance).

Thumb Mode: Thumb is an alternative mode where the processor implements a subset of ARM instructions. The instructions are encoded into 16-bits rather than 32, making the code more compact. You can switch modes at run time and make calls between ARM and THUMB code.

GPIO: Some SOMs (e.g. Raspberry Pi) have registers to control onboard GPIO functions. If you want to run GPIO at the maximum possible speed you can directly access the registers. It is highly hardware-dependent (even varying across models of Raspberry Pi, for example) but extremely fast.

64-bit Support: Some ARM processors have a 64-bit mode (e.g. the Raspberry Pi 3 and 4). In this mode, you have 64-bit registers and can use 64-bit addresses. 32-bit addresses are limited to 4GB, so this allows addressing more memory and working more efficiently with 64-bit values. The downside is the larger code size. A 64-bit version of the Raspberry Pi OS was recently made available as a supported option for that platform.

Raspberry Pi Pico: If you want to explore ARM-based microcontrollers, the Raspberry Pi Pico is a $4 ARM-based microcontroller that has an RP2040 Dual-core Cortex-M0+ processor with 264KB RAM, 2MB flash. Other features include 26 GPIO pins, 3 analog inputs, 2 UARTs, 2 SPI interfaces, 2 I2C interfaces, USB, and 8 Programmable I/O (PIO) state machines. This is a good option for learning bare-metal microcontroller programming

References

Here are a few references I have found useful for learning ARM assembly language programming:

Raspberry Pi Assembly Language, Bruce Smith https://www.brucesmith.info
ARM A32 Assembly Language, Bruce Smith https://www.brucesmith.info
ARM and Thumb-2 Instruction Set Quick Reference Card https://developer.arm.com/documentation/qrc0001/m/
https://marcin.juszkiewicz.com.pl/download/tables/syscalls.html
https://en.wikipedia.org/wiki/ARM_architecture

Summary

I hope you found this blog series on ARM assembly language programming interesting and useful. If you want to learn more, I encourage you to try writing and running some small programs of your own on an ARM-based platform such as a Raspberry Pi. If you missed part 1 in our series, read it here.