[转] ARM Assembler for iOS

Part-1：Environment Setup

A few weeks ago I came into the need of optimizing some OpenGL code for iOS. There was alot of matrix- and vector-calculations going on that could greatly be improved by taking advantage of the NEON coprocessor found in every of the newer iPhone and iPad versions. It is a vector-based processor also allowing single-precission float-point calculations. Taking advantage of this processor for OpenGL applications can make a signification difference as you often do vector-based floating-point calculations. Nicely, there already exist a librarythat provides all that is needed regarding vector- and matrix-opertions and leverages the NEON chip by implementing the main functionality directly in assembly code (you are not able to directly access that coprocessor from your C-code). My only problem: I always like to understand what I am doing and would even sometimes rather re-implement some library myself then blindly reusing something I don’t understand the inner-workings of. Not sure if this is a good quality or a bad one…

This led me to the goal for this series of posts: Learn assembly for the ARM chip, which is found in all of the iOS-powered devices by Apple (and also in most other phones that play in the same league as the iPhone). I will try to write down my experience and the learned stuff within this series to help others getting started as well. For this, I will heavily rely on external resources like documentations, blog-posts and tutorials. This “diary” will function as the glue that holds everything together and makes me stick with my set goal to learn ARM assembly.

In this parts of the series we will focus on setting up the environment and make sure that our tool-chain is working. Namely, we will do the following:

Install the GNU ARM Assembler toolchain
Build a first Hello-World C-application that we will compile, run and debug with the installed toolchain

Lets get started!

Installing the GNU ARM toolchain

As the iOS applications compiled for the iPhone Simulator are translated to normal x68 code (the simulator does not simulate an ARM chip) we cannot just write assembly code in your iOS project and run it in the simulator. For sure, we don’t want to connect an iOS device for our early fiddling with the ARM chip. Thus we will install the GNU ARM Assembler tool-chain which is free and comes with a simulator that allows us to run the developed code without a real device.

Go to the GNU ARM webpage and download the binary for your architecture. As I am focusing on Apple mobile devices, i assume you will download the toolchain for Mac OSX.

After unpacking and installing the download, you should have the following tools installed, that you are able to call from you command-line shell

arm-elf-gcc The gcc compiler to build executables
arm-elf-run To run executables
arm-elf-gdb The debugger to debug built executables
arm-elf-objdump To disassemble/read out sections of the library

As you might notice, these are all the standard binutils you already know. They only support (and compile for) the ARM architecture, but if you are familiar with gcc, gdb, etc, you should have no big problems with these tools.

The only small negative point is that these tools build and run ELF-based executables and not the Mach-O executables known from Mac OSX and iOS. So, you will have to use objdump here, but use otool for example on your iOS executables. But it is not really justified to make such a complaint…

Building a simple Hello-World App for the ARM chip

Lets build a very simple Hello-World app. We will compile, link, run and debug that app with the just installed tools.

Create a helloworld.c file with the following content:

#include <stdio.h>

int main(int argc, char *argv[])
{
	char *str = "hello world";
	printf("%s\n", str);

	asm volatile("mov r0,r0");

	return 0;
}

You can see that the file only prints out “hello world” and has a simple stub for adding some assembly-code to playing around with the ARM-specifics. The “mov r0,r0”, is basically a no-op and we don’t worry about it for now.

Perform the following set of calls to build, link and run the executable (named “hw”):

arm-elf-gcc -mcpu=arm7 -O2 -g -c helloworld.c  -o helloworld.o
arm-elf-gcc -mcpu=arm7 -o hw helloworld.o -lc
arm-elf-run hw

There should be no problems with these steps and running the “hw” exectuable should actually print “hello world” to the commandline; but I just also want to show you how to start the executable in the debugger:

macbook:arm-asm daniel$ arm-elf-gdb
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "--host=powerpc-apple-darwin6.8 --target=arm-elf".
(gdb) file hw
Reading symbols from hw...done.
(gdb) target sim
Connected to the simulator.
(gdb) load
Loading section .init, size 0x1c vma 0x8000
Loading section .text, size 0x2d74 vma 0x801c
Loading section .fini, size 0x18 vma 0xad90
Loading section .rodata, size 0x18 vma 0xada8
Loading section .data, size 0x8a8 vma 0xaec0
Loading section .eh_frame, size 0x4 vma 0xb768
Loading section .ctors, size 0x8 vma 0xb76c
Loading section .dtors, size 0x8 vma 0xb774
Loading section .jcr, size 0x4 vma 0xb77c
Start address 0x811c
Transfer rate: 111616 bits in <1 sec.
(gdb) run
Starting program: /Users/daniel/tmp/arm-asm/hw
hello world

Program exited normally.
[Switching to process 0]

You see, it requires first to call “file hw”, to load the executable, “target sim”, then “load” and finally “run” to run the executable. All further commands, like setting break-points and stepping through the code is standard gdb and you can look it up in any regular tutorial on this toolchain (you can already do alot with break, stepi, nexti, continue, list and print).

Do you remember the “mov r0,r0” assembly instruction in our code? We can use “arm-elf-objdump -d hw” to actually disassembly the executable and see it listed as a “nop”:

macbook:arm-asm daniel$ arm-elf-objdump -d hw | grep -A 12 "<main>:"
00008224 <main>:
    8224:	e1a0c00d 	mov	ip, sp
    8228:	e92dd800 	stmdb	sp!, {fp, ip, lr, pc}
    822c:	e24cb004 	sub	fp, ip, #4	; 0x4
    8230:	e59f000c 	ldr	r0, [pc, #12]	; 8244 <main+0x20>
    8234:	eb00023e 	bl	8b34 <puts>
 8238: e1a00000 nop (mov r0,r0)
    823c:	e3a00000 	mov	r0, #0	; 0x0
    8240:	e91ba800 	ldmdb	fp, {fp, sp, pc}
    8244:	0000ada8 	andeq	sl, r0, r8, lsr #27

00008248 <atexit>:
    8248:	e1a0c00d 	mov	ip, sp

Congratulations! You have successfully installed the tool-chain and are now ready to fiddle around with ARM assembly.

Further Reading

As a homework for the next part of the series, i will assume you have a look at the following really great resources:

The first tutorial was originally written for the Gameboy Advanced, but that should not confuse you. The GBA also uses an ARM7-chip and thus we can transfer almost all of the learned to our iOS devices.

You actually might wamt to play around a little with what you learn in the tutorial. Do you remember the “asm volatile” block in your main-method? You could use it to put some simple assembler code in and just play around. Details on how to pass and return back C-based variables to/from this inline-code, can be found in this very well-written blog-post. I don’t assume you go through it in every detail (we will come back to it later), but it will help you if you want to pass values between C- and assembly-code and not have to worry about call-conventions between C- and assembly-code too much. Actually, this will be the topic in Part 2 or 3 of the series.

=======================================================================

Part-2：First Steps

As this series is mainly to write down my own learning experience and recap on it, I assume you have made your homework from Part 1 as well as I have. Mainly, reading theWhirlwind tour of ARM Assembly (Sections 23.1 – 23.3 should be enough for now). From here, we will start to write our own first assembly functions and get familiar with the the toolchain. In detail, we will learn the following in this part:

Writing a simple assembler function that uses simple instructions like mov andadd
Compiling/linking an assembler-file to C-code
Using gdb to debug your assembly code

Recap

Let’s start with a quick summary of what you should have learn from the Whirlwind tour of ARM Assembly:

There basically exist three types of assembler instructions: data- (e.g. add, sub, mov, cmp), memory- (e.g. ldr,sdr,sdmfd) and -branch-instructions (e.g. b,bl,bx)
All instructions are 32 bit in size (except for THUMB-mode which has only 16 bit instructions); this leads to the problem that immediate values to be mov’ed into a register have only 12 bit left (20 bit are used to encode the rest of the instruction): 8 bit for a number n and 4 bit for a right-rotation r; based on this encoding, the represented number is given as n ror (2*r). Effectively, allowing a call like mov r0, #0×4 (not mov r0, #0xfff0) because 0×4 = 0×4 ror (2*0) (not possible for 0xfff0 because 0xfff does not even fit into the 8 bits of n).
The above limitation can be worked around by calculating the required value 0xfff0 with multiple assembler instructions or loading it from memory. ldr r0, =0xfff0 (note the “=”) does this implicitly: If representable by a so-calledimmediate value, it is a direct mov, but in the example it will be converted to a memory-load
There exist registers r0 to r15; where, r13 to r15 have special purposes and also an alias that can be used in assembly (r13/sp: Stack Pointer, r14/lr: Link Register, r15/pc: Program Counter). Actually, r11 and r12 too, but we omitt that here. r0-r3 and r12 are scratch registers that can be changed within a function-call; everything else (r4-r11) has to be restored correctly in the epilog of a called function.
All assembler instructions can be executed conditionally. E.g. cmp r0, r1, movlt r3, #4: movlt is only executed when r0 < r1 (less than). The cmp-instruction sets the status-flags that are read out by the next instruction. Generally, every instruction (where it makes sense) can also be executed to set the status-flags correctly by appending an “s”; e.g. subs r0, r1, r2 (r0=r1-r2 and update status-flags). Thus, the general form of a data-statement is op{cond}{status}
The Barrel Shifter can be applied to the last operand of a statement to shift/rotate its bits by a fixed amount or an amount given within a register. E.g. mov r0, r1, lsl #3 (in C-Syntax r1 << 3). This is basically free and is executed with the statement in one processor cycle and is preferable over multiplication operations whenever possible.
Data instructions (mov, add, etc.) can only work on registers and immediate values; not memory address; they need to be loaded in a register via a memory operation first. Also, there is no operation for division and multiplications are only possible on registers (not immediate values)
Memory Instructions (e.g. ldr r0 [r1]: load in register; in C-syntax: r0=*r1). The memory address defined by the bracketed statement ([]) can consist of a register, register + index-regsiter or register + immediate value and either case with additional barrel-shifting. E.g. ldr r0, [r1, r2, lsl #4] (r0=*(r1 + (r2 <<4))). It can also be used to load half-words (2 bytes) or a single byte only. Append the following to the instruction for this: h (half-word), sh (signed half-word), b (byte), sb (signed byte).
Memory addresses in an ldr-instruction do not fit in the instruction (same reason as for the above limitation on immediate values (only 12 bits)). But, assembler will transparently convert used memory labels to so-called PC-relative addresses(address is calculated based on the current program-counter by the assembler). Alternatives are memory-pools which are created when using ldr r0, =labelname (note the “=”); note: this only loads the address of labelname into r0; not the value!!!
Bulk-memory operations like stm* and ldm* can be used to load a vector (array of memory values) from memory in a list of register and store it back, respectively. These operations are also used for pushing/poping parameters to/from the stack. There actually exist aliases for different stack-implementations. Generally, thefull-decrementing stack is used on the ARM architecture: I.e. the stack-pointer (sp) points to the currently used stack-address and grows to beginning of the memory area. You can see in the figure below what an stmfd statement is doing internally, when storing r0 on the stack. Note also, that the exclamation mark “!” is the auto-indexing feature of these memory operations: sp is internally decremented to the next memory location automatically.

Branch Instructions are the GOTOs of assembly. You can change/redirect the flow of our assembly-code. Mainly, we have “b LABELNAME” to jump to a labeled position in your assembly and bl and bx to call subroutines and return again. “bl LABELNAME” will jump to a labeled point in your assembly-code and set the lr-register to the current pc-value (Program counter); when we want to return from our sub-routine (as a “return” in C-code), we call bx which is basically an alias for “mov pc, lr”; meaning: restore the program counter to the saved value and thus cotinue with the next assembly instruction after the sub-routine call. The only difference between “bx lr” and “mov pc, lr” is that the former is required for inter-operability between normal ARM assembly and THUMB-mode.

The first assembler function

Let us write a first simple assembler function based on our already gained knowledge. It will be located in an own assembler-file and will be called from our C-code’s main-method.

Create a file asmlib.s with the following assembly-code:

@ ARM Assembler Test Library

@ int asm_sum(int a, int b)
	.align 2				@ Align to word boundary
	.arm					@ This is ARM code
	.global asm_sum			@ This makes it a real symbol
asm_sum:					@ Start of function definition
	add     r2, r0, r1		@ Add up a (r0) and b (r1) and store result in r2
	mov		r0, r2			@ Store sum (r2) in r0 which stores return-value
	mov		pc, lr			@ Set program counter to lr (was set by caller)

We have defined a very simple assembler function to multiple the two arguments a and b (which are actually handed to the function in register r0 and r1) and return back the sum within r0 to the caller. We will get into the details of the call-conventions in the next Part of the series. For now, just take it for granted that the arguments are handed over in this way and returned in r0. You should know by now that the “asm_sum:” is a label to this instruction-block and when jumped to will first execute the “add”, “mov” and last reset the program-counter (pc) to the next statement of the callers code (was stored in lr by caller).

There are some ARM-asembler directives used in the begining of the file to align the function-label to the next word-boundary (.align x means align to 2^x byte boundary). In general, the ARM processor should be able to handle unaligned access as well, but the Apple documents specifically state that functions have to be aligned. To know the details on why aligned access is important, read this post. “.arm” defines this as ARM-code and “.global” exposes the label as a global symbol. This is important so we can call this function from our C-code.

Some more information is given in the comments starting with “@”. One additional note: We could actual remove the “mov r0, r2″ call if we had called “add r0, r0, r1″ in the first place, but to learn assembly, it is not bad to have some more explicit code.

Next, we will call our defined function that returns the sum of its two arguments from our main-method in C-code. We use the following main.c for this:

#include <stdio.h>

extern int asm_sum(int a, int b);

int main(int argc, char *argv[])
{
	printf("== sum ==\n");
	int a = 71;
	int b = 29;
	printf("%d + %d = %d\n", a, b, asm_sum(a, b));

	return 0;
}

The only thing that you will notice is that we have to define the asm_sum-method as as external symbol, as we don’t define it here in our C-code.

You can now try to compile and link both files to an executable with the following calls (the last line is already for running our executable):

arm-elf-gcc -mcpu=arm7 -O2 -g -c asmlib.s  -o asmlib.o
arm-elf-gcc -mcpu=arm7 -O2 -g -c main.c  -o main.o
arm-elf-gcc -mcpu=arm7 -o armtest *.o -lc
arm-elf-run armtest

Running the program should show you the expected result; i.e. display the printf-statement in the stdout. Good job!

Debugging

Lets step through our assembly code with the debugger. First, create a file named “.gdbinit” in the folder where also the other two files are located and put in the following lines:

file armtest
target sim
load

This file will be loaded on startup of gdb and already set our executable, set the target architecture to the ARM simulator and load the code into memory. We can now do some simple debugging:

macbook:ARMAssembly_Part2 daniel$ arm-elf-gdb
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "--host=powerpc-apple-darwin6.8 --target=arm-elf".
Connected to the simulator.
Loading section .init, size 0x1c vma 0x8000
Loading section .text, size 0x8a3c vma 0x801c
Loading section .fini, size 0x18 vma 0x10a58
Loading section .rodata, size 0x248 vma 0x10a70
Loading section .data, size 0x8bc vma 0x10db8
Loading section .eh_frame, size 0x4 vma 0x11674
Loading section .ctors, size 0x8 vma 0x11678
Loading section .dtors, size 0x8 vma 0x11680
Loading section .jcr, size 0x4 vma 0x11688
Start address 0x811c
Transfer rate: 306272 bits in <1 sec.
(gdb) break asmlib.s:9
Breakpoint 1 at 0x8228: file asmlib.s, line 9.
(gdb) run
Starting program: /Users/daniel/Dev/Test/ARMAssembly_Part2/armtest
== sum ==

Breakpoint 1, asm_sum () at asmlib.s:9
9		mov		r0, r2			@ Store sum (r2) in r0 which stores return-value
Current language:  auto; currently asm
(gdb) print $r0
$1 = 71
(gdb) print $r1
$2 = 29
(gdb) print $r2
$3 = 100
(gdb) stepi
10		mov		pc, lr			@ Set program counter to lr (was set by caller)
(gdb) continue
Continuing.
71 + 29 = 100

Program exited normally.
[Switching to process 0]
Current language:  auto; currently c
(gdb)

You see, that we first define a breakpoint at line 9 of the file asmlib.s, then we “run” the executable. The simulator stops at the breakpoint. “print $regname” allows us to print the content of a register. As exepected, the function parameters one and two are stored in r0 and r1 (compare with the values set for a and b in main.c). After that, we use “stepi” to step to the next instruction and then call “continue” so the normal execution proceeded and finally ends the program. These few gdb-commands should already allow you to do some simple debugging. For more details on gdb, google is your friend

Some more instructions in a nutshell

Lets write one more assembly function to get familiar with some more instructions. How about a multiply-routine that returns for two parameter a and b the value a*b:

@ int asm_mul(int a, int b)
	.align 2
	.arm
	.global asm_mul
asm_mul:
	stmfd   sp!, {r4-r11}   @ in case we needed to work with more than registers r0-r3,
                            @ have to save the first on the stack (only r0-r3 and r12 are scratch registers)
                            @ Here, actually don't need them...

	mov     r3, #0          @ Initialize register holding result of multiplication

	movs    r2, r0          @ Move "a" into r2 and set status-flags (mov"s")
	beq     asm_mul_return  @ Immediately return if a==0

	movs    r2, r1          @ Move "b" into r2 and set status-flags (mov"s")
	beq     asm_mul_return  @ Immediately return if b==0

asm_mul_loop:
	add     r3, r3, r0      @ r3 = r3 + r0
	subs    r1, r1, #1      @ r1 = r1 - 1 (decrement)
	bne     asm_mul_loop    @ If the zero-flag is not set (r1 > 0), loop once more

asm_mul_return:
	ldmfd   sp!, {r4-r11}   @ Restore the registers
	mov     r0, r3          @ Store result in r0 (return register)
	mov     pc, lr

Please note that we could have used assembly instructions to do the multiplication for us, but this way we can recap on some instructions we have learned more easily.

The algorithm implemented is basically: result = 0; if (a == 0 or b==0) return result; else while(b>0) result = result + a; b = b – 1 and should be quiet straight-forward. Here are some points to take away though:

“subs” not only does a substraction, but because we append the “s”, the status-register will be set; in specific, the zero-flag. If not zero (ne=not equal: a little confusing, but think of it as “if the result is non-zero, the two operands to sub must be not equal“; or have a look in the table in Section 23.3.4 of the well-knowntutorial guiding this series) we branch to the label asm_mul_loop.
You see, that the status-register can be set by almost any data-instruction; also “mov” can be appended with “s” to set it. Here, we read out again the zero-flag.
We use stmfd to store registers r4-r11 on the stack and restore it at the end of the routine via lmfd. This, you would in general always do, if you need to use more then registers r0-r3 in your routine. As r4-r11 are no scratch-registers, it is the convention that they are the same after a functional call as before.

To test your routine, you will have to add a method-declaration for asm_mul to main.c and call it within your main-method. You can find the full sources and a Makefile of our examples on github.

Further Reading

As we will be covering function-call conventions in iOS within the next part, I recommend the following read as preparation:

iOS ABI Function Call Guide by Apple

=======================================================================

to cotinue