This document is Copyright 1994 ARM Ltd, and has been included on this disc with their kind permission. This manual is supplied "as is"; ARM Limited ("ARM") makes no warranty, express or implied, of the merchantability of this document or its fitness for any particular purpose. In no circumstances shall ARM be liable for any damage, loss of profits, or any indirect or consequential loss arising out of the use of these recipes or inability to use these recipes, even if ARM has been advised of the possibility of such loss. --------------------------------------------------------------------------- 2. ARM Instruction Set and Processor Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ARM instruction set has the following key features, some of which are common to many other processors, and some of which are not: Load/Store architecture (only load and store instructions access memory). 32 bit instructions, 32/8 bit data words/bytes. 32 bit addresses (26 bit on earlier ARMs). 15 general purpose 32 bit registers, program counter and program status register - a subset of these are banked, to give rapid context switching for interrupt and supervisor modes. (See the appropriate ARM Data Sheet for details of particular processors). Flexible store multiple and load multiple instructions allow any set of registers from a single bank to be transferred to/from memory by a single instruction. There is no single instruction to move an immediate 32 bit value to a register (in general, a literal value has to be loaded from memory). However, a large set of common 32-bit values can be generated in a single instruction. All instructions are executed conditionally on the state of the current program status register. Only data processing operations with the S bit set change the state of the current program status register. The second argument to all data-processing and single data-transfer operations can be shifted in quite a general way before the operation is performed. This supports, but is not limited to, scaled addressing, multiplication by a small constant, and construction of constants, within a single instruction. Co-processor instructions support a general way to extend the ARM's architecture in a customer-specific manner. In addition, the ARM processor has: Support for Big- or Little-Endian memory. A powerful barrel shifter to support ARM's within-instruction shifts. The recipes in this chapter discuss some of these features in greater detail. 2.1 Making the Most of Conditional Execution -------------------------------------------- 2.1.1 About this Recipe ----------------------- In this recipe you learn how conditional execution can eliminate branch instructions, producing smaller and faster code. Euclid's Greatest Common Divisor algorithm is used for illustrative purposes. Specifically, you will learn how to use: conditional execution; the 'S' bit in ARM data processing instructions. 2.1.2 The ARM's ALU Status Flags -------------------------------- The ARM's Program Status Register contains, among other flags, copies of the ALU status flags: N Negative result from ALU flag Z Zero result from ALU flag C ALU operation Carried out V ALU operation oVerflowed 2.1.3 Execution Conditions -------------------------- Every ARM instruction has a 4 bit field which encodes the conditions under which it will be executed. These conditions refer to the state of the ALU N, Z, C and V flags as follows: EQ Z set (equal) NE Z clear (not equal) CS/HS C set (unsigned >=) CC/LO C clear (unsigned <) MI N set (negative) PL N clear (positive or zero) VS V set (overflow) VC V clear (no overflow) HI C set and Z clear (unsigned >) LS C clear and Z set (unsigned <=) GE N and V the same (signed >=) LT N and V differ (signed <) GT Z clear, N and V the same (signed >) LE Z set, N and V differ (signed <=) AL Always execute (the default if none is specified) 2.1.4 Setting the ALU Flags in the PSR -------------------------------------- Data processing instructions change the state of the ALU's N,Z,C and V status outputs but these are latched in the PSR'S ALU flags only if a special bit (the 'S' bit) is set in the instruction. 2.1.5 Illustration - Euclid's GCD Algorithm ------------------------------------------- The following code fragment is extracted from gcd.c, which can be found in the examples directory. while (a != b) { if (a > b) a -= b; else b -= a; } Without conditional execution this could be naively coded as: gcd CMP a1, a2 BEQ end BLT lessthan SUB a1, a1, a2 B gcd lessthan SUB a2, a2, a1 B gcd end Conditional execution and selective setting of the PSR'S ALU flags allows it to be coded much more compactly as follows (this version can be found in the examples directory as gcd.s). gcd CMP a1, a2 SUBGT a1, a1, a2 SUBLT a2, a2, a1 BNE gcd Two 'tricks' are illustrated: The CMP instruction (implicitly) has the 'S' bit set, so the result of the comparison sets the PSR ALU status flags. However, the following two subtractions do not have the 'S' bit set, so they do not affect the PSR ALU status flags which remain in the state set by the earlier CMP instruction when the BNE instruction is executed. The test (a != b) has been combined with the branch back to the top of the loop, giving shorter code, and in many instances code which runs more quickly. The two subtractions are executed only if the condition specified is met, so two branches around these instructions can be avoided. In addition to the obvious benefit of smaller code, any pipeline refill caused by the branches will also have been avoided. 2.1.6 Running the C Example --------------------------- You can run the C gcd routine shown above under armsd. To do this first set your current directory to the examples directory. Compile, link and run the C version of the gcd routine by using the following commands: armcc -c -Ospace gcd.c armcc -c gcdtest.c armlink -o gcdtest gcd.o gcdtest.o somewhere/armlib.32l armsd gcdtest where somewhere is the directory in which armlib.32l can be found. Explanation ----------- The two armcc commands compile the gcd function and the test harness, creating relocatable object files gcd.o and gcdtest.o. The armlink command links your relocatable objects with the ARM C library to create a runnable program (here called gcdtest). The armsd command invokes the debugger, with gcdtest as the program to be run. Again -li specifies that little-endian memory is required (as with armasm above). For more details on running programs under armsd see The ARM Symbolic Debugger (armsd) starting on page26 of the User Manual and armsd Command Language starting on page87 of the User Manual. Note that armcc may not produce the hand-optimised instruction sequence shown above - this example is intended to demonstrate how in some cases conditional execution and use of the S bit can be hand crafted to produce extremely efficient code. 2.1.7 Running the Assembler Example ----------------------------------- You can run the gcd routine shown above under armsd. To do this first set your current directory to the examples directory. You can assemble, link and run the assembler gcd routine by using the following commands: armasm gcd.s -o gcd.o armcc -c gcdtest.c armlink -o gcdtest gcd.o gcdtest.o somewhere/armlib.32l armsd gcdtest where somewhere is the directory in which armlib.32l can be found. Explanation ----------- The armasm command assembles the gcd function, creating the relocatable object file gcd.o. The armcc command compiles the test harness. The -c flag tells armcc not to link its output with the C library. The armlink command links your relocatable objects with the ARM C library to create a runnable program (here called gcdtest). The armsd command invokes the debugger, with gcdtest as the program to be run. For more details on running programs under armsd see The ARM Symbolic Debugger (armsd) starting on page26 of the User Manual and armsd Command Language starting on page87 of the User Manual. 2.1.8 Related Topics -------------------- There are many examples of code which makes good use of the ARM's condition codes and S bit in recipes in chapter Exploring ARM Assembly Language starting on page20. 2.2 Using the Barrel Shifter ---------------------------- 2.2.1 About This Recipe ----------------------- In this recipe you learn: how to index into an array efficiently in ARM assembler. how to use the barrel shifter in the main ARM instruction classes; 2.2.2 Addressing an Entry in a Table of Words --------------------------------------------- The following piece of code inefficiently calculates the address of an entry in a table of words and then loads the desired word: ; R0 holds the entry number [0,1,2,...] LDR R1, =StartOfTable MOV R3, #4 MLA R1, R0, R3, R1 LDR R2, [R1] ... StartOfTable DCD table data Loading the desired table entry is performed by first loading the start address of the table, then moving the immediate constant "4" into a register, using the multiply and add instruction to calculate the address, and finally loading the entry. However, this operation can be performed by the barrel shifter more efficiently as follows: ; R0 holds the entry number [0,1,2,...] LDR R1, =StartOfTable LDR R2, [R1, R0, LSL #2] ... StartOfTable DCD table data In this code the barrel shifter shifts R0 left 2 bits (ie. multiplying it by 4), this intermediate value is then used as the index for the LDR instruction. Thus a single instruction is used to perform the whole operation. Such significant savings can frequently be made by making good use of the barrel shifter. 2.2.3 The ARM's Barrel Shifter ------------------------------ The ARM core contains a Barrel shifter which takes a value to be shifted or rotated, an amount to shift or rotate by and the type of shift or rotate. This can be used by various classes of ARM instructions to perform comparatively complex operations in a single instruction. On ARMs up to and including the ARM6 family, instructions take no longer to execute by making use of the barrel shifter, unless the amount to be shifted is specified by a register, in which case the instruction will take an extra cycle to complete. The barrel shifter can perform the following types of operation: LSL shift left by n bits; LSR logical shift right by n bits; ASR arithmetic shift right by n bits (the bits fed into the top end of the operand are copies of the original top (or sign) bit); ROR rotate right by n bits; RRX rotate right extended by 1 bit. This is a 33 bit rotate, where the 33rd bit is the PSR C flag. The barrel shifter can be used in several of the ARM's instruction classes. The options available in each case are described below. 2.2.4 LDR/STR ------------- The index can be a register shifted by any 5 bit constant. It may also be an unshifted 12 bit constant. eg. STR R7, [R0], #24 ; Post-indexed LDR R2, [R0], R4, ASR #4 ; Post-indexed STR R3, [R0, R5, LSL #3] ; Pre-indexed LDR R6, [R0, R1, ROR #6]! ; Pre-indexed + Writeback Explanation ----------- In all of the above instructions R0 is the base register. In the pre-indexed instructions the offset is calculated and added to the base. This address is used for the transfer. If writeback is selected, then the transfer address is written back into the base register. In the post-indexed instructions the offset is calculated and added to the base after the transfer. The base register is always updated by post-indexed instructions. 2.2.5 Data Processing Operations -------------------------------- The last operand (the second for binary operations, and the first for unary operations) may be: an 8 bit constant rotated right through an even number of positions. eg. ADD R0, R1, #&C5, 10 MOV R5, #&FC000003 Note that in the second example the assembler is left to work out how to split the constant &FC000003 into an 8 bit constant and an even shift (in this case "#&FC000003" could be replaced by "#&FF, 6"). See Loading Constants into Registers starting on page15 for more information. a register (optionally) shifted or rotated either by a 5-bit constant or by another register. eg. ADD R0, R1, R2 SUB R0, R1, R2, LSR #10 CMP R1, R2, R1, ROR R5 MVN R3, R2, RRX 2.2.6 Program Status Register Transfer Instructions --------------------------------------------------- For the precise format of these instructions see the appropriate datasheet. 2.2.7 Related Topics -------------------- For more examples which make good use of the barrel shifter see many of the recipes in chapter Exploring ARM Assembly Language starting on page20. The following cover loading constants into registers, and explain how armasm can help out the assembly language programmer: MOV / MVN starting on page15; LDR Rd, =numeric constant starting on page16. 2.3 Flexibility of Load and Store Multiple ------------------------------------------ 2.3.1 About this Recipe ----------------------- In this recipe you learn about: the benefits and capabilities of the load and store multiple instructions; types of stacks supported directly by load and store multiple. 2.3.2 Multiple vs Single Transfers ---------------------------------- The Load and Store Multiple instructions provide a way to efficiently move the contents of several registers to and from memory. The advantages of using a single load or store multiple instruction over a series of load or store single instructions are: Smaller code size; On Von Neumann architectures such as all ARMs up to the ARM6 family, there is only a single instruction fetch overhead, rather than many instruction fetches. On Von Neumann architectures, only one register write back cycle is required for a load multiple, as opposed to one for every load single; On uncached ARM processors, the first word of data transfered by a load or store multiple will always be a non-sequential memory cycle, but all subsequent words transferred can be sequential (faster) memory cycles. 2.3.3 The Register List ----------------------- The registers the load and store multiple instructions transfer are encoded into the instruction by one bit for each of the registers R0 to R15. A set bit indicates the register will be transferred, and a clear bit indicates that it will not be transferred. Thus it is possible to transfer any subset of the registers in a single instruction. The way the subset of registers to be transferred is specified is simply by listing those registers which are to be transferred in curly brackets eg. {R1, R4-R6, R8, R10} 2.3.4 Increment / Decrement, Before / After ------------------------------------------- The base address for the transfer can either be incremented or decremented between register transfers, and this can happen either before or after each register transfer. eg. STMIA R10, {R1, R3-R5, R8} The suffix IA could also have been IB, DA or DB, where I indicates increment, D decrement, A after and B before. 2.3.5 Base Register Writeback ----------------------------- In the last instruction, although the address of the transfer was changed after each transfer, the base register was not updated at any point. Register writeback can be specified so that the base register is updated. Clearly the base register will change by the same amount whether "before" or "after" is selected. An example of a load multiple using base writeback is: LDMDB R11!, {R9, R4-R7} Note ---- In all cases the lowest numbered register is transferred to or from the lowest memory address, and the highest numbered register to or from the highest address. [The order in which the registers are listed in the register list makes no difference. Also, the ARM always performs sequential memory accesses in increasing memory address order. Therefore 'decrementing' transfers actually perform a subtraction first and then increment the transfer address register by register]. 2.3.6 Stack Notation -------------------- Since the load and store multiple instructions have the facility to update the base register (which for stack operations can be the stack pointer), these instructions provide single instruction push and pop operations for any number of registers. Load multiple being pop, and store multiple being push. There are several types of stack which the Load and Store Multiple Instructions can be used with: Ascending or descending stacks. ie. the stack grows up memory or down memory. [Sometimes a pair of stacks, one of which grows up memory and one of which grows downwards are used - thus choosing the direction is not always just a matter of taste]. Empty or Full stacks. The stack pointer can either point to the top item in the stack (a full stack), or the next free space on the stack (an empty stack). As stated above, pop and push operations for these stacks can be implemented directly by load and store multiple instructions. To make it easier for the programmer special stack sufficies can be added to the LDM and STM instructions (as an alternative to Increment / Decrement and Before / After sufficies) as follows: STMFA R10!, {R0-R5} ; Push R0-R5 onto a Full Ascending Stack LDMFA R10!, {R0-R5} ; Pop R0-R5 from a Full Ascending Stack STMFD R10!, {R0-R5} ; Push R0-R5 onto a Full Descending Stack LDMFD R10!, {R0-R5} ; Pop R0-R5 from a Full Descending Stack STMEA R10!, {R0-R5} ; Push R0-R5 onto an Empty Ascending Stack LDMEA R10!, {R0-R5} ; Pop R0-R5 from an Empty Ascending Stack STMED R10!, {R0-R5} ; Push R0-R5 onto an Empty Descending Stack LDMED R10!, {R0-R5} ; Pop R0-R5 from an Empty Descending Stack 2.3.7 Related Topics -------------------- For more information on using stacks in assembly language see Stacks in Assembly Language starting on page22. For further discussion of some of the benefits which can be gained by using LDM and STM see Loop Unrolling starting on page56. 2.4 Loading Constants into Registers ------------------------------------ 2.4.1 About this Recipe ----------------------- This recipe explains and demonstrates: Why loading constants / addresses is an issue on the ARM; How to solve it using MOV / MVN; How to solve it using LDR Rd, =expression How to solve it using ADR and ADRL 2.4.2 Why is Loading Constants an issue ? ----------------------------------------- Since all ARM instructions are precisely 32 bits long, and ARM instructions do not use the instruction stream as data, there is no single instruction which will load any 32 bit immediate constant into a register without performing a data load from memory. However, there are ways to load many commonly used constants into a register without resorting to a data load from memory. Of course, a data load from memory allows any 32-bit value to be loaded into a register, but the added expense of a data load can often be avoided. The assembler provides several 'instruction extensions', and two pseudo instructions to make the efficient loading of constants and addresses non-painful. 2.4.3 MOV / MVN --------------- As described in the recipe Using the Barrel Shifter starting on page10, the MOV and MVN instructions allow many constants to be constructed. The constants which these instructions can construct must be eight bit constants rotated right through an even number of positions. By using MVN the bitwise complement of such values can also be constructed. Having to convert a constant into this form is an onerous task no-one wants to do. Therefore armasm will do this automatically. Either MOV or MVN may be used with a constant which can be constructed using either of these instructions. armasm will choose the correct instruction and construct the constant. If it is impossible to construct the desired constant armasm will report this as an error. To illustrate this, look at the following MOV and MVN instructions. The instruction listed in the comment is the ARM instruction which is produced by armasm. MOV R0, #0 ; => MOV R0, #0 MOV R1, #&FF000000 ; => MOV R1, #&FF, 8 MOV R2, #&FFFFFFFF ; => MVN R2, #0 MVN R0, #1 ; => MVN R0, #1 MOV R1, #&FC000003 ; => MOV R1, #&FF, 6 MOV R2, #&03FFFFFC ; => MVN R2, #&FF, 6 MOV R3, #&55555555 ; Reports an error (it cannot be constructed) 2.4.4 Assembling the Example ---------------------------- The above code is available in loadcon1.s in the examples directory. To assemble it first set the current directory to examples and then issue the command: armasm loadcon1.s -o loadcon1.o -li To confirm that armasm produced the correct code, the code area can be disassembled by looking at the output from: decaof -c loadcon1.o Explanation ----------- The -li argument can be omitted if the tools have been configured a ppropriately. See The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details. decaof is the ARM Object Format decoder. The -c option requests that decaof dissassemble the code area. 2.4.5 LDR Rd, =numeric constant ------------------------------- armasm provides a mechanism which unlike MOV and MVN can construct any 32-bit numeric constant, but which may not result in a data processing operation to do it. This is the "LDR Rd, =" mechanism. If the numeric constant can be constructed by using either MOV or MVN, then this will be the instruction used to load the constant. If this cannot be done, however, armasm will produce an LDR instruction to read the constant from a literal pool. 2.4.6 Literal Pools ------------------- A literal pool is a portion of memory set aside for constants. By default a literal pool is placed right at the end of the program. However, for large programs, this literal pool may not be accessible throughout the program (due to the LDR offset being a 12 bit value), so further literal pools can be placed using the LTORG directive. When the "LDR, Rd, =" mechanism needs to access a literal in a literal pool, armasm first checks previously encountered literal pools to see if the desired constant is already available and addressable. If it is then this literal is addressed, otherwise armasm will attempt to place the literal in the next available literal pool. If this literal pool is not addressable then an error will result, and an additional LTORG should be placed close to (but after) the failed "LDR Rd,=" instruction. Although this may sound somewhat complicated, in practice, it is simple to use. Consider the following example, which demonstrates how literal pools and "LDR Rd,=" work. The instruction listed in the comment is the ARM instruction which gets produced by armasm. This code is for illustration purposes only, and is not intended to be executed. AREA Example, CODE, REL LDR R0, =42 ; => MOV R0, #42 LDR R1, =&55555555 ; => LDR R1, [PC, #offset to Literal Pool 1] LDR R2, =&FFFFFFFF ; => MVN R2, #0 LTORG ; Literal Pool 1 contains literal &55555555 LDR R3, =&55555555 ; => LDR R3, [PC, #offset to Literal Pool 1] ; LDR R4, =&66666666 ; If this is uncommented it will fail, as ; Literal Pool 2 is not accessible (out of reach) LargeTable2 % 4200 END ; Literal Pool 2 is empty 2.4.7 Assembling the Example ---------------------------- The above code is available in loadcon2.s in the examples directory. To assemble it first set the current directory to examples and then issue the command: armasm loadcon2.s -o loadcon2.o -li To confirm that armasm produced the correct code, the code area can be disassembled by looking at the output from: decaof -c loadcon2.o Explanation ----------- The -li argument can be omitted if the tools have been configured appropriately. See The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manaul for details. decaof is the ARM Object Format decoder. The -c option requests that decaof dissassemble the code area. 2.4.8 LDR Rd, =PC relative expression ------------------------------------- As well as numeric constants, the "LDR Rd, =" mechanism can cope with PC relative expressions, such as labels. Even if a PC relative ADD or SUB could be constructed, an LDR will be generated to load the PC relative expression. Thus if a PC relative ADD or SUB is desired then ADR should be used instead (see ADR and ADRL starting on page18). If no suitable literal is already available, then the literal placed into the next literal pool will be the offset into the AREA, and an AREA relative relocation directive will be added to ensure that the constant is appropriate wherever the containing AREA gets located by the linker. See The Handling of Relocation Directives starting on page11 of the Reference Manual for more information about relocation directives. As an example consider the code below. The instruction listed in the comment is the ARM instruction which gets produced by armasm. This code is for illustration purposes only, and is not intended to be executed. AREA Example, CODE, REL Start LDR R0, =Start ; => LDR R0, [PC, #offset to Litpool 1 LDR R1, =DataArea + 12 ; => LDR R1, [PC, #offset to Litpool 1 LDR R2, =DataArea + 6000 ; => LDR R2, [PC, #offset to Litpool 1 LTORG ; Literal Pool 1 holds three literals LDR R3, =DataArea + 6000 ; => LDR R2, [PC, #offset to Litpool 1 ; (sharing with previous literal) ; LDR R4, =DataArea + 6004 ; If uncommented will produce an error ; as Litpool 2 is out of range DataArea % 8000 END ; Literal Pool 2 is out of range of ; the LDR instructions above 2.4.9 Assembling the Example ---------------------------- The above code is available in loadcon3.s in the examples directory. To assemble it first set the current directory to examples and then issue the command: armasm loadcon3.s -o loadcon3.o -li To confirm that armasm produced the correct code, the code area can be disassembled by looking at the output from: decaof -c loadcon3.o Explanation ----------- The -li argument can be omitted if the tools have been configured appropriately. See The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details. decaof is the ARM Object Format decoder. The -c option requests that decaof dissassemble the code area. 2.4.10 ADR and ADRL ------------------- Sometimes it is important for efficiency purposes that loading an address does not perform a memory access. The assembler provides two pseudo instructions which make it easier to do this. Whereas MOV and MVN only accept numeric constants, ADR and ADRL accept numeric constants, PC relative expressions (labels within the same area) and register relative expressions. ADR will attempt to produce a single data processing instruction to load an address into a register. This instruction will be one of MOV, MVN, ADD or SUB, in the same way as the "LDR Rd, =" mechanism produces instructions. If the desired address cannot be constructed in a single instruction an error will be produced. ADRL will attempt to produce either two data processing instructions to load an address into a register. Even if it is possible to produce a single data processing instruction to load the address into the register then a second, redundant instruction will be produced (this is a consequence of the strict two-pass nature of armasm) . In cases where it is not possible to construct the address using two data processing instructions ADRL will produce an error - the LDR, = mechanism is probably the best option in this case. As an example consider the code below. The instructions listed in the comments are the ARM instruction which are produced by armasm. This code is for illustration purposes only, and is not intended to be executed. AREA Example, CODE, REL Start ADR R0, &8000 ; => MOV R0, #&8000 ; ADR R1, &8001 ; This would fail as it cannot be ; constructed by a MOV or MVN ADR R2, Start ; => SUB R2, PC, #offset to Start ADR R3, DataArea ; => ADD R3, PC, #offset to DataArea ; ADR R4, DataArea+4300 ; This would fail as the offset is cannot ; be expressed by operand2 of an ADD ADRL R5, DataArea+4300 ; => ADD R5, PC, #offset1 ; ADD R5, R5, #offset2 ADRL R6, &8001 ; => MOV R6, #1 ; ADD R6, R6, #&8000 ; ADRL R7, &55555555 ; This would fail as the constant cannot ; be constructed by 2 data processing ; instructions DataArea % 8000 END 2.4.11 Assembling the Example ----------------------------- The above code is available in loadcon4.s in the examples directory. To assemble it first set the current directory to examples and then issue the command: armasm loadcon4.s -o loadcon4.o -li To confirm that armasm produced the correct code, the code area can be disassembled by looking at the output from: decaof -c loadcon4.o Explanation ----------- The -li argument can be omitted if the tools have been configured appropriately. See The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details. decaof is the ARM Object Format decoder. The -c option requests that decaof dissassemble the code area. 2.4.12 Related topics --------------------- For more information on the capabilities of the barrel shifter see Using the Barrel Shifter starting on page10.