This document is Copyright 1994 ARM Ltd, and has been included on this disc with their kind permission. This manual is supplied "as is"; ARM Limited ("ARM") makes no warranty, express or implied, of the merchantability of this document or its fitness for any particular purpose. In no circumstances shall ARM be liable for any damage, loss of profits, or any indirect or consequential loss arising out of the use of these recipes or inability to use these recipes, even if ARM has been advised of the possibility of such loss. --------------------------------------------------------------------------- 5. Programming in C ~~~~~~~~~~~~~~~~~~~ 5.1 A Very Simple C Program --------------------------- 5.1.1 About this Recipe ----------------------- This recipe gives you a simple exercise in using the ARM Software Development Toolkit (the toolkit) to write a program in C. By following it, you will learn how to: use the ARM C compiler armcc to create a runnable program; use the ARM source level debugger armsd to run your program on a (simulated) ARM system; use armcc to compile a C program to an object file; use the ARM linker armlink to create a runnable program from an object file and the ARM C library. 5.1.2 Prerequisites ------------------- Before you can try this recipe, the toolkit must be properly installed on your computer. Instructions for installation are given in the installation notes distributed with every toolkit. If you experience any difficulties, please refer to these notes. 5.1.3 Making a Simple Runnable Program -------------------------------------- The "Hello World" program shown below, is included in the on-line examples as file hellow.c in the directory examples: #include int main( int argc, char **argv ) { printf("Hello World\n"); return 0; } If you set your working directory to be the examples directory you can compile this program to runnable form in a single step using: armcc hellow.c -li -apcs 3/32bit Explanation ----------- The argument -li says that the target is little endian and -apcs 3/32bit says that the 32 bit ARM procedure call standard should be used. If the compiler has been configured to use these options by default then these arguments need not be given (see The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for details). The executable program is left in a file called hellow. 5.1.4 Running the Program ------------------------- You can run the program (technically an AIF Image) using armsd. You should follow the sample dialog below: host-prompt> armsd -li hellow A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file hellow armsd: go Hello world Program terminated normally at PC = 0x000082a0 0x000082a0: 0xef000011 .... : > swi 0x11 armsd: quit Quitting host-prompt> Explanation ----------- The -li argument to armsd tells it to emulate a little endian arm. If armsd has been configured to be little endian by default then -li can be omitted (see The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for how to configure the ARM development tools). When armsd comes up with its "armsd:" prompt and waits for your command, you should type "goCR". At the next prompt type "quitCR" to exit armsd. 5.1.5 Separate Compiling ------------------------ You can invoke the compiler and the linker separately. You can use: armcc -c hellow.c -li -apcs 3/32bit to make an object file (in this example called hellow.o, by default). Explanation ----------- The -c flag tells the compiler to make an object file but not to link it with the C library. 5.1.6 Separate Linking ---------------------- When you have finished compiling, you can link your object file with the C library to make a runnable program using: armlink -o hellow hellow.o somewhere/armlib.32l Where we have written somewhere, above, you must type the name of the directory containing the ARM C libraries. Notes ----- You now have to be very explicit; you must specify: the name of the file which will contain the runnable program (here, hellow); the name of the object file (here, hellow.o); the location and name of the C library you wish to use. In simple cases, armcc can reduce the need to be so explicit. 5.1.7 Related Topics -------------------- Please refer to the index to find topics of particular interest. 5.2 Writing Efficient C for the ARM ----------------------------------- 5.2.1 About This Recipe ----------------------- The ARM C compiler can generate very good machine code for if you present it with the right sort of input. From this note, you will learn: what the C compiler compiles well and why; how to help the C compiler to generate excellent machine code. Some of the rules of thumb presented are quite general; some are quite specific to the ARM or the ARM C compiler. It should be quite clear from context which rules are portable. The first subsection below is concerned with how to design collections of C functions to maximise low-level efficiency. The following subsection is concerned with the efficiency of larger and more complicated functions. 5.2.2 Function Design Considerations ------------------------------------ Unlike on many earlier CISC processor architectures, function call overhead on the ARM is small and often in proportion to the work done by the called function. Several feaures contribute to this: the minimal ARM call-return sequence is BL... MOV pc, lr, which is extremely economical; STM and LDM reduce the cost of entry to and exit from functions which must create a stack frame and/or save registers; the ARM Procedure Call Standard has been carefully designed to allow two very important types of function call to be optimised so that the entry and exit overheads are minimal. Good general advice is to keep functions small, because function calling overheads are low. In the remainder of this subsection you will learn precisely when function call overhead is very low. In following subsections you will learn how small functions help the ARM C compiler; you will also learn how to assist the C compiler when functions cannot be kept small. Leaf Functions -------------- In 'typical' programs, about half of all function calls made are to leaf functions (a leaf function is one which makes no calls from within its body). Often, a leaf function is rather simple. On the ARM, if it is simple enough to compile using just 5 registers (a1-a4 and ip), it will carry no function entry or exit overhead. A surprising proportion of useful leaf functions can be compiled within this constraint. Once registers have to be saved, it is efficient to save them using STM. In fact the more you can save at one go, the better. In a leaf function, all and only the registers which need to be saved will be saved by a single STMFD sp!,{regs,lr} on entry and a matching LDMFD sp!,{regs,pc} on exit. In general, the cost of pushing some registers on entry and popping them on exit is very small compared to the cost of the useful work done by a leaf function which is complicated enough to need more than 5 registers. Overall, you should expect a leaf function to carry virtually no function entry and exit overhead; and at worst, a small overhead, most likely in proportion to the useful work done by it. Veneer Functions (Simple Tail Continued Functions) -------------------------------------------------- Historically, abstraction veneers have been relatively expensive. The kind of veneer function which merely changes the types of its arguments, or which calls a low-level implementation with an extra argument (say), has often cost much more in entry and exit overhead than it was worth in useful work. On the ARM, if a function ends with a call to another function, that call can be converted to a tail continuation. In functions which need to save no registers, the effect can be dramatic. Consider, for example: extern void *__sys_alloc(unsigned type, unsigned n_words); #define NOTGCable 0x80000000 #define NOTMovable 0x40000000 void *malloc(unsigned n_bytes) { return __sys_alloc(NOTGCable+NOTMovable, n_bytes/4); } Here, armcc generates (the version of armcc supplied with this release may produce slightly different output): malloc MOV a2,a1,LSR #2 MOV a1,#&c0000000 B |__sys_alloc| There is no function entry or exit overhead - just useful work massaging arguments - and the function return has disappeared entirely - return is direct from __sys_alloc to malloc's caller. In this case, the basic call-return cost for the function pair has been reduced from: BL + BL + MOV pc,lr + MOV pc,lr to: BL + B + MOV pc,lr a saving of 25%. More complicated functions in which the only function calls are immediately before a return, collapse equally well. An artificial example is: extern int f1(int), int f2(int, int); int f(int a, int b) { if (b == 0) return a; else if (b < 0) return f2(a, -b); else return f2(b, a); /* argument order swapped */ } armcc generates the following, wonderfully efficient code (the version of armcc supplied with this release may produce slightly different output): f CMP a2,#0 MOVEQS pc,lr RSBLT a2,a2,#0 BLT f2 MOV a3,a1 MOV a1,a2 MOV a2,a3 B f2 Fast Paths and Slow Paths - A Useful Transformation --------------------------------------------------- Inevitably, not all functions can be leaves or small abstraction functions. And, inevitably, non-leaf functions must carry the cost of establishing a call frame on entry and removing it on exit, perhaps also the cost of saving and restoring some registers. How does this hurt performance? Consider the following example: int f(Buffer *b) { if (b->n > 0) { /* The usual path through the function... */ /* 95% of all calls.*/ /* Simple calculation involving b->buf, b->n, etc.*/ return ...; } /* Exceptional path through the function... */ /* 5% of all calls. */ /* Complicated calculation involving calls /* to other functions.*/ return ...; } In this case, the entry and register-save overhead caused by the infrequent heavyweight path through the function applies to the much more frequent lightweight path through it. To fix this, turn the heavyweight path into a tail call. Yes, introducing another layer of function call yields much more efficient code! int f2(Buffer *b) { /* Exceptional path through the function... */ /* 5% of all calls. */ /* Complicated calculation involving calls */ /* to other functions.*/ return ...; } int f(Buffer *b) { if (b->n > 0) { /* The usual path through the function... */ /* 95% of all calls.*/ /* Simple calculation involving b->buf, b->n, etc.*/ return ...;] } return f2(b); } If you are lucky, f() will now compile using only a1-a4 and ip and so incur no entry overhead whatsoever. 95% of the time, the overhead on the original f() will be reduced to zero. This is quite a general source transformation technique and you should look for opportunities to use it and analogous transformations. It works for any processor to some extent; it works particulary well for the ARM because of the careful optimisation of tail continuation in lightweight functions. Repeated application of this technique to the chain of six or so functions called for every character processed by the preprocessing phase of the ARM C compiler, improved the performance of the preprocessor (running on the ARM) by about 30%. Function Arguments and Argument Passing --------------------------------------- The final aspect of function design which influences low-level efficiency is argument passing. Under the ARM Procedure Call Standard, up to four argument words can be passed to a function in registers. Functions of up to four integral (not floating point) arguments are particularly efficient and incur very little overhead beyond that required to compute the argument expressions themselves (there may be a little register juggling in the called function, depending on its complexity). If more arguments are needed, then the 5th, 6th, etc., words will be passed on the stack. This incurs the cost of an STR in the calling function and an LDR in the called function for each argument word beyond four. How can argument passing overhead be minimised? ----------------------------------------------- Try to ensure that small functions take four or fewer arguments. These will compile particualrly well. If a function needs many arguments, try to ensure that it does a significant amount of work on every call, so that the cost of passing arguments is amortised. Factor out read-mostly global control state and make this static. If it has to be passed as an argument (e.g. to support multiple clients) then wrap it up in a struct and pass a pointer to it. The characteristics of such control state are: it's logically global to the compilation unit or program it's read-mostly, often read-only except in response to user input, and for almost all functions cannot be changed by them or any function called from them; references to it are ubiquitous, but in any function, references are relatively rare (frequent references should be replaced by references to a local, non-static copy). Don't confuse such control state with compuational arguments, the values of which differ on every call. Collect related data into structs. Decide whether to pass pointers or struct values based on the use of each struct in the called function: If few fields are read or written then passing a pointer is best. The cost of passing a struct via the stack is typically a share in an LDM-STM pair for each word of the struct. This can be better than passing a pointer if (i) on average, each field is used at least once and (ii) the register pressure in the function is high enough to force a pointer to be repeatedly re-loaded. As a rule of thumb, you can't lose much efficiency if you pass pointers to structs rather than struct values. To gain efficiency by passing struct values rather than pointers usually requires careful study of a function's machine code. 5.2.3 Register Allocation and How To Help It -------------------------------------------- It is well known that register allocation is critical to the efficiency of code compiled for RISC processors. It is particularly critical for the ARM, which has only 16 registers rather than the 'traditional' 32. The ARM C compiler has a highly efficient register allocator which operates on complete functions and which tries to allocate the most frequently used variables to registers (taking loop nesting into account). It produces very good results unless the demand for registers seriously outstrips supply. And it has one shortcoming, namely that it allocates whole variables to registers, not separate live ranges. As code generation proceeds for a function, new variables are created for expression temporaries. These are never reused in later expressions and cannot be spilled to memory. Usually, this causes no problems. However, a particularly pathological expression could, in principle, occupy most of the allocatable registers, forcing almost all program variables to be spilled to memory. Because the number of registers required to evaluate an expression is a logarithmic function of the number terms in it, it takes an expression of more than 32 terms to threaten the use of any variable register. As a rule of thumb, avoid very large expressions (more than 30 terms). The more serious problem is with long scope program variables. Our allocator either allocates a variable to a chosen register everywhere the variable is live, or it leaves the variable in memory. To help visualise the problem - and to see how to help the allocator - consider the following two program schemata: int f() int f() { int i, j, ...; { int j, ...; { int i; for (i = 0; i < lim; ++i) for (i = 0; i < lim; ++i) { { ... ... } } } { int i; for (i = 0; i < lim; ++i) for (i = 0; i < lim; ++i) { /* register pressure in this { loop forces 'i' to memory */ } } } { int i; for (i = 0; i < lim; ++i) for (i = 0; i < lim; ++i) { { ... ... } } } } } In the left hand case, because the scope of 'i' is the whole function, if 'i' cannot be allocated to a register everywhere then all three loops will suffer their loop index being in memory. On the other hand, in the right hand case there are three separate variables called 'i', each of which will be allocated separately by the register allocator. As a rule of thumb, keep variable declarations local, especially in large functions. Use additional block structure as illustrated here (right hand example), if necessary. On the other hand, if this transformation is carried to excess, there may be bad results. When a local variable is spilled to memory, there is a stack adjustment on each entry to and exit from its containing scope. The ARM C compiler does this to minimise the space used by local variables. Suppose, for example, that in the right hand case above, each block declared a 1KB buffer as well as 'i'. Then adjusting the stack at every scope leads to stack usage of just over 1KB whereas adjusting it only at function entry leads to usage of more than 3KB. In principle, the compiler could be more intelligent about adjusting the stack locally for large variables and only at function entry for small variables. For the moment, the programmer must be aware of these issues. So, a modified rule of thumb is to cluster variable declarations into reasonable sub-scopes within large functions and to avoid doing so within the most deeply nested loops. This will most likely help the allocator without introducing unwanted costs associated with local stack adjustment. 5.2.4 Static and Extern Variables - Minimising Access Costs ----------------------------------------------------------- A variable in a register costs nothing to access: it is just there, waiting to be used. A local (auto) variable is addressed via the sp register, which is always available for the purpose. A static variable, on the other hand, can only be accessed after the static base for the compilation unit has been loaded. So, the first such use in a function always costs 2 LDRs or an LDR and an STR. However, if there are many uses of static variables within a function then there is a good chance that the static base will become a global common subexpression (CSE) and that, overall, access to static variables will be no more expensive than to auto variables on the stack. Extern variables are fundamentally more expensive: each has its own base pointer. Thus each access to an extern is likely to cost 2 LDRs or an LDR and an STR. It is much less likely that a pointer to an extern will become a global CSE - and almost certain that there cannot be several such CSEs - so if a function accesses lots of extern variables, it is bound to incur significant access costs. A further cost occurs when a function is called: the compiler has to assume - in the absence of inter-procedural data flow analysis - that any non- const static or extern variable could be side-effected by the call. This severly limits the scope across which the value of a static or extern variable can be held in a register. Sometimes a programmer can do better than a compiler could do, even a compiler that did interprocedural data flow analysis. An example in C is given by the standard streams: stdin, stdout and stderr. These are not pointers to const objects (the underlying FILE structs are modified by I/O operations), nor are they necessarily const pointers (they may be assignable in some implementations). Nonetheless, a function can almost always safely slave a reference to a stream in a local FILE * variable. It is a common programming paradigm to mimic the standard streams in applications. Consider, for example, the shape of a typical non-leaf printing function: extern FILE *out; extern FILE *out; /* the output stream */ /* the output stream */ void print_it(Thing *t) void print_it(Thing *t) { { FILE *f = out; fprintf(out, ...); fprintf(f, ...); print_1(t->first); print_1(t->first); fprintf(out, ...); fprintf(f, ...); print_2(t->second); print_2(t->second); fprintf(out, ...); fprintf(f, ...); ... ... } } In the left hand case, out has be be re-computed or re-loaded after each call to print_... (and after each fprintf...). In the right hand case, 'f' can be held in a register throughout the function (and probably will be). Uniform application of this transformation to the disassembly module of the ARM C compiler saved more than 5% of its code space. In general, it is difficult and potentially dangerous to assert that no function you call (or any functions they in turn call) can affect the value of any static or extern variables of which you currently have local copies. However, the rewards can be considerable so it is usually worthwhile to work out at the program design stage which global variables are slavable locally and which are not. Trying to retrofit this improvement to exisiting code is usually hazardous, except in very simple cases like the above. 5.2.5 The switch() Statement ---------------------------- The switch() statement can be used to transfer control to one of several destinations - conceptually an indexed transfer of control - or to generate a value related to the controlling expression (in effect computing an in-line function of the controlling expression). In the first role, switch() is hard to improve upon: the ARM C compiler does a good job of deciding when to compile jump tables and when to compile trees of if-then-elses. It is rare for a programmer to be able to improve upon this by writing if-then-else trees explicitly in the source. In the second role, however, use of switch() is often mistaken. You can probably do better by being more aware of what is being computed and how. In the example below, which is abstracted from an early version of the disassembly module of the ARM C Compiler, you will learn: the cost of implementing an in-line function using switch(); how to implement the same function more economically. The function below used for illustrative purposes maps a 4-bit field of an ARM instruction to a 2-character condition code mnemonic. The real case was more complicated, decoding two 4-bit fields to a 3-char mnemonic, but for illustration the simple example serves just as well. The real case was also embedded in a larger function, but this is irrelevant to the discussion. char *cond_of_instr(unsigned instr) { char *s;Ô switch (instr & 0xf0000000) { case 0x00000000: s = "EQ"; break; case 0x10000000: s = "NE"; break; ... ... ... case 0xF0000000: s = "NV"; break; } return s; } The compiler handles this code fragment well, generating 276 bytes of code and string literals. But we could do better. If performance were not critical (as it never is in disassembly) then we could look up the code in a table of codes, in something like: char *cond_of_instr(unsigned instr) { static struct {char name[3]; unsigned code;} conds[] = { "EQ", 0x00000000, "NE", 0x10000000, .... "NV", 0xf0000000, }; int j; for (j = 0; j < sizeof(conds)/sizeof(conds[0]); ++j) if ((instr & 0xf0000000) == conds[j].code) return conds[j].name; return ""; } This fragment compiles to 68 bytes of code and 128 bytes of table data. Already this is a 30% improvement on the switch() case, but this schema has other advantages: it copes well with a random code to string mapping and if the mapping is not random admits further optimisation. For example, if the code is stored in a byte (char) instead of an unsigned and the comparison is with (instr >> 28) rather than (instr & 0xF0000000) then only 60 bytes of code and 64 bytes of data are generated for a total of 124 bytes. Another advantage we have heard of for table lookup is that is is possible to share the same table between a disassembler and an assembler - the assembler looks up the mnemonic to obtain the code value, rather than the code value to obtain the mnemonic. Where performance is not critical, the symmetric property of lookup tables can sometimes be exploited to yield significant space savings. Finally, by exploiting the denseness of the indexing and the uniformity of the returned value it is possible to do better again, both in size and performance, by direct indexing: char *cond_of_instr(unsigned instr) { return "\ EQ\0\0NE\0\0CC\0\0CS\0\0MI\0\0PL\0\0VS\0\0VC\0\0\ HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4; } This expression of the problem causes a miserly 16 bytes of code and 64 bytes of string literal to be generated and is probably close to what an experienced assembly language programmer would naturally write if asked to code this function. It is the solution finally adopted in the ARM C compiler's disassembler module. The uniform application of this transformation to the disassembler module of the ARM C compiler saved between 5% and 10% of its code space. The moral of this tale is to think before using switch() to compute an in-line function, especially if code size is an important consideration. Switch() compiles to high performance code but often table lookup will be smaller; where the function's domain is dense, or piecewise dense, direct indexing into a table will often be both faster and smaller. 5.2.6 Related Topics -------------------- ARM Assembly Programming Performance Issues starting on page55. Register Usage under the ARM Procedure Call Standard starting on page62. Passing and Returning structs starting on page67. 5.3 C Programming for Deeply Embedded Applications -------------------------------------------------- 5.3.1 About this Recipe ----------------------- In this recipe you will learn about the standalone runtime support system for C programming in deeply embedded applications. In particular you will discover: what rtstand.s supports; how to make use of it by looking at example programs; how to extend it by adding extra fuctionality from the C library; the size of the standalone run time library; 5.3.2 Introduction ------------------ The semi hosted ANSI C library provides all the standard C library facilities (and thus is quite large). This is acceptable when running under emulation with plenty of memory available, or maybe even when running on development hardware with access to a real debugging channel and plenty of memory. However, in a deeply embedded application many of the facilities of the C library may no longer be relevent, eg. file access functions, time and date functions, and the size of the semi hosted ANSI C library may be prohibitive if the memory available is severely limited. For deeply embedded applications a minimal C runtime system is needed which takes up as little memory as possible, is easily portable to the target hardware, and only supports those functions required for such an application. The ARM Software Development Toolkit comes with a minimal runtime system in source form. The Õbehind the scenesÕ jobs which it performs are: setting up the initial stack and heap, and calling main; program termination - either automatic (returning from main() or forced - explicitly calling __rt_exit); simple heap allocation (__rt_alloc); stack limit checking; setjmp and longjmp support; divide and remainder functions (calls to which can be generated by armcc); high level error handler support (__err_handler); optional floating point support, and a means to detect whether floating point support is available or not (__rt_fpavailable); The source code rtstand.s documents the options which you may want to change for your target. These are not covered in this recipe. The header file rtstand.h documents the functions which rtstand.s provides to the C programmer. Note that no support is provided for outputting data down the debugging channel. This can be done, but is specific to the target application. The example C programs described below use the ARM Debug Monitor available under armsd to output messages using in-line SWIs. See ARM Debug Monitor starting on page104 of the Technical Specifications for full details of the facilities which the ARM Debug Monitor provides, and see In-Line SWIs starting on page72 for more information about in-line swis. 5.3.3 Using the Standalone Runtime System ----------------------------------------- In this section the main features of the standalone runtime system are demonstrated by example programs. Before attempting any of the demonstrations below create a working directory, and set this up as your current directory. Copy the contents of the clstand directory into your working directory, and also copy the files fpe*.o from the fpe340 directory of the cl directory into your working directory. You are now ready to experiment with the C standalone runtime system. In the examples below, the following options are passed to armcc, armasm, and in the first case armsd: -li This specifies that the the target is a little endian ARM. -apcs 3/32bit This specifies that the 32 bit variant of APCS 3 should be used. For armasm this is used to set the built in variable {CONFIG} to 32. These arguments can be changed if the target hardware differs from this configuration. If the ARM Software Tools have been configured as desired then these options may be omitted, as the tools will default to the configuration time values. See The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of the User Manual for how to configure the ARM Software Tools. These demonstrations are likely to be most useful if the sources rtstand.s, errtest.c and memtest.c are studied in conjunction with this recipe. 5.3.4 A Simple Program ---------------------- Let us compile the example program errtest.c, and assemble the standalone runtime system. These can then be linked together to provide an executable image, errtest: armcc -c errtest.c -li -apcs 3/32bit armasm rtstand.s -o rtstand.o -li -apcs 3/32bit armlink -o errtest errtest.o rtstand.o We can then execute this image under the armsd as follows: > armsd -li errtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file errtest armsd: go (the floating point instruction-set is not available) Using integer arithmetic ... 10000 / 0X0000000A = 0X000003E8 10000 / 0X00000009 = 0X00000457 10000 / 0X00000008 = 0X000004E2 10000 / 0X00000007 = 0X00000594 10000 / 0X00000006 = 0X00000682 10000 / 0X00000005 = 0X000007D0 10000 / 0X00000004 = 0X000009C4 10000 / 0X00000003 = 0X00000D05 10000 / 0X00000002 = 0X00001388 10000 / 0X00000001 = 0X00002710 Program terminated normally at PC = 0x00008550 0x00008550: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > The '>' prompt is the Operating System prompt, and the 'armsd:' prompt is output by armsd to indicate that user input is required. Already several of the standalone runtime system's facilities have been demonstrated: the C stack and heap have been set up; main has clearly been called; the fact that floating point support is not available has been detected; the integer division functions have been used by the compiler. program termination was successful. 5.3.5 Error Handling -------------------- The same program, errtest, can also be used to demonstrate error handling, by recompiling errtest.c and predefining the DIVIDE_ERROR macro: armcc -c errtest.c -li -apcs 3/32bit -DDIVIDE_ERROR armlink -o errtest errtest.o rtstand.o Again, we can now execute this image under the armsd as follows: > armsd -li errtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file errtest armsd: go (the floating point instruction-set is not available) Using integer arithmetic ... 10000 / 0X0000000A = 0X000003E8 10000 / 0X00000009 = 0X00000457 10000 / 0X00000008 = 0X000004E2 10000 / 0X00000007 = 0X00000594 10000 / 0X00000006 = 0X00000682 10000 / 0X00000005 = 0X000007D0 10000 / 0X00000004 = 0X000009C4 10000 / 0X00000003 = 0X00000D05 10000 / 0X00000002 = 0X00001388 10000 / 0X00000001 = 0X00002710 10000 / 0X00000000 = errhandler called: code = 0X00000001: divide by 0 caller's pc = 0X00008304 returning... run time error: divide by 0 program terminated Program terminated normally at PC = 0x0000854c 0x0000854c: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > This time an integer division by zero has been detected by the standalone runtime system, which called __err_handler. __err_hander output the first set of error messages in the above output. Control was then returned to the runtime system which output the second set of error messages and terminated execution. 5.3.6 longjmp and setjmp ------------------------ A further demonstration can be made using errtest by predefining the macro LONGJMP to perform a longjmp out of __err_handler back into the user program, thus catching and dealing with the error. First recompile and link errtest: armcc -c errtest.c -li -apcs 3/32bit -DDIVIDE_ERROR -DLONGJMP armlink -o errtest errtest.o rtstand.o Then rerun errtest under armsd. We expect the integer divide by zero to occur once again: > armsd -li errtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file errtest armsd: go (the floating point instruction-set is not available) Using integer arithmetic ... 10000 / 0X0000000A = 0X000003E8 10000 / 0X00000009 = 0X00000457 10000 / 0X00000008 = 0X000004E2 10000 / 0X00000007 = 0X00000594 10000 / 0X00000006 = 0X00000682 10000 / 0X00000005 = 0X000007D0 10000 / 0X00000004 = 0X000009C4 10000 / 0X00000003 = 0X00000D05 10000 / 0X00000002 = 0X00001388 10000 / 0X00000001 = 0X00002710 10000 / 0X00000000 = errhandler called: code = 0X00000001: divide by 0 caller's pc = 0X00008310 returning... Returning from __err_handler() with errnum = 0X00000001 Program terminated normally at PC = 0x00008558 0x00008558: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > The runtime system detected the integer divide by zero, and as before __err_handler was called, which produced the error messages. However, this time __err_handler used longjmp to return control to the program, rather than the runtime system. 5.3.7 Floating Point Support ---------------------------- Using errtest we can also demonstrate floating point support. You should already have copied the appropriate floating point emulator object code into your working directory. For the configuration used in this example fpe_32l.o is the correct object file. However, in addition to this it is also necessary to link with an fpe stub, which we must compile from the source given (fpestub.s). armasm fpestub.s -o fpestub.o -li -apcs 3/32bit armlink -o errtest errtest.o rtstand.o fpestub.o fpe_32l.o -d The resulting executable, errtest, can be run under armsd as before: > armsd -li errtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file errtest armsd: go (the floating point instruction-set is available) Using Floating point, but casting to int ... 10000 / 0X0000000A = 0X000003E8 10000 / 0X00000009 = 0X00000457 10000 / 0X00000008 = 0X000004E2 10000 / 0X00000007 = 0X00000594 10000 / 0X00000006 = 0X00000682 10000 / 0X00000005 = 0X000007D0 10000 / 0X00000004 = 0X000009C4 10000 / 0X00000003 = 0X00000D05 10000 / 0X00000002 = 0X00001388 10000 / 0X00000001 = 0X00002710 10000 / 0X00000000 = errhandler called: code = 0X80000202: Floating Point Exception : Divide By Zero caller's pc = 0XE92DE000 returning... Returning from __err_handler() with errnum = 0X80000202 Program terminated normally at PC = 0x00008558 (__rt_exit + 0x10) +0010 0x00008558: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > This time the floating point instruction set is found to be available, and when a floating point division by zero is attempted, __err_handler is called with the details of the floating point divide by zero exception. Note that if you have compiled errtest.c other than as in longjmp and setjmp starting on page90, you will not see precisely this dialogue with armsd. 5.3.8 Running Out of Heap ------------------------- A second example program, memtest.c demonstrates how the standalone runtime system copes with allocating stack space, and also demonstrates the simple memory allocation function __rt_alloc. Let us first compile this program so that it should repeatedly request more memory, until there is none left: armcc -li -apcs 3/32bit memtest.c -c armlink -o memtest memtest.o rtstand.o This can be run under armsd in the usual way: > armsd -li memtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file memtest armsd: go kernel memory management test force stack to 4KB request 0 words of heap - allocate 256 words at 0X000085A0 force stack to 8KB .. force stack to 60KB request 33211 words of heap - allocate 33211 words at 0X00049388 force stack to 64KB request 49816 words of heap - allocate 5739 words at 0X00069A74 memory exhausted, 105376 words of heap, 64KB of stack Program terminated normally at PC = 0x0000847c 0x0000847c: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > This demonstrates that allocating space on the stack is working correctly, and also that the __rt_alloc routine is working as expected. The program terminated because in the end __rt_alloc could not allocate the requested amount of memory. 5.3.9 Stack Overflow Checking ----------------------------- memtest can also be used to demonstrate stack overflow checking by recompiling with the macro STACK_OVERFLOW defined. In this case the amount of stack required is increased until there is not enough stack available, and stack overflow detection causes the program to be aborted. To recompile and link memtest.c issue the following commands: armcc -li -apcs 3/32bit memtest.c -c -DSTACK_OVERFLOW armlink -o memtest memtest.o rtstand.o Running this program under armsd produces the following output: > armsd -li memtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file memtest armsd: go kernel memory management test force stack to 4KB ... force stack to 256KB request 1296 words of heap - allocate 1296 words at 0X0000AE20 force stack to 512KB run time error: stack overflow program terminated Program terminated normally at PC = 0x0000847c 0x0000847c: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > Clearly stack overlfow checking did indeed catch the case where too much stack was required, and caused the runtime system to terminate the program after giving an appropriate diagnostic. 5.3.10 Extending the Standalone Runtime System ---------------------------------------------- For many applications it may be desirable to have access to more of the standard C library than just the minimal runtime system provides. This section demonstrates how to take out a part of the standard C library and plug it into the standalone runtime system. The function which we will add to rtstand is memmove. Although this is small, and easily extracted from the C library source, the same methodology can be applied to larger sections of the C library, eg. the dynamic memory allocation system (malloc, free, etc). The source of the C library can be found in the cl directory. The source for the memmove function is in string.c. The extracted source for memmove has been put into memmove.c, and the compile time option _copywords has been removed. The function declaration for memmove and a typedef for size_t ( extracted from include/stddef.h) have been put into memmove.h. Our memmove module can be compiled as follows. armcc -c memmove.c -li -apcs 3/32bit The output, memmove.o can be linkedwith the user's other object modules together with rtstand.o in the normal way (see previous examples in this section). 5.3.11 The Size of the Standalone Runtime Library ------------------------------------------------- rtstand.s has been separated into several code Areas. The advantage of this is that armlink can detect if any Areas are unreferenced, and then eliminate them from the output image. The table below shows the typical size of the Areas in rtstand.o: Area Size (bytes) Functions ---- ------------ --------- C$$data 4 C$$code$$__main 96 __main, __rt_exit C$$code$$__rt_fpavailable 8 __rt_fpavailable C$$code$$__rt_trap 128 __rt_trap C$$code$$__rt_alloc 68 __rt_alloc C$$code$$__rt_stkovf 76 __rt_stkovf_split_* C$$code$$__jmp 100 longjmp, setjmp C$$code$$__divide 256 __rt_sdiv, __rt_udiv, __rt_udiv10, __rt_sdiv10, __rt_divtest All Areas 736 If floating point support is definitely not required, then the EnsureNoFPSupport variable can be set to {TRUE}, and some extra space will be saved. After making any modifications to rtstand.s, the size of the various areas can be found by using the command: decaof -b rtstand.o From the above table it is clear that for many applications the standalone runtime library will be roughly 0.5Kb. 5.3.12 Related Topics --------------------- Register Usage under the ARM Procedure Call Standard starting on page62; In-Line SWIs starting on page72. 5.4 ARM Shared Libraries ------------------------ 5.4.1 About This Recipe ----------------------- In this recipe you will learn: what an ARM shared library is; how the shared library mechanism works; how to instruct the ARM linker to make a shared library; how to make a toy shared library from the string section of the ANSI C library. 5.4.2 About ARM Shared Libraries -------------------------------- ARM shared libraries support the sharing of utility, service or library functions between several concurrently executing client applications in a single address space. Such shared code is necessarily reentrant. If a function is reentrant, each of its concurrently active clients must have a separate copy of the data it manipulates for them. The data cannot be associated with the code itself unless the data is read-only. In the ARM shared library architecture, a dedicated register (called sb) is used to address (indirectly) the static data associated with a client. An ARM shared library is read only, reentrant and usually position independent. A shared library made exclusively from object code compiled by the ARM C compiler will have all three of these attributes. Library components implemented in ARM Assembly Language need not be reentrant and position independent, but in practice, only position independence is inessential. A library with all three of these attributes in an ideal candidate for packing into a system ROM. Some shared library mechanisms associate a shared library's data with the library itself and put only a place holder in the stub. At run time, a copy of the library's initialised static data is copied into the client's place holder by the dynamic linker or by library initialisation code. The ARM shared library mechanism supports these ways of working provided the data is free of values which require link-time (or run time) relocation. In other words, it can be supported provided the input data areas are free of relocation directives. 5.4.3 How ARM shared Libraries Work ----------------------------------- Stubs and Proxy Functions ------------------------- When a client application is linked with a shared library, it is linked not with the library itself but with a stub object containing: an entry vector; a copy of the library's static data or a place holder for it. Each member of the entry vector is a proxy for a function in the matching shared library. When a client first calls a proxy function, the call is directed to a dynamic linker. This is a small function (typically about 50-60 ARM instructions) which: locates the matching shared library; if required, copies an initial image of the library's static data from the library to the place holding area in the stub; patches the entry vector so each proxy function points at the corresponding library function; resumes the call. Once an entry vector has been patched, all future proxy calls proceed directly to the target library function with only minimal indirection delay and no intervention by the dynamic linker. Of course, making an inter-link-unit call like this is more expensive than making a straightforward local procedure call, but not a lot so. It is also the only supported way to call a function more than 32MBytes away. 5.4.4 Locating a Library Which Matches the Stub ----------------------------------------------- Locating a matching shared library is specific to a target system and you must provide code to do the location, but the remainder of the dynamic linking process is generic to all target systems. Consequently, in order to use ARM shared libraries, you have to design and implement a library location mechanism and adapt the dynamic linker to it. In practice, this is quite straightforward: the ARM Linker provides support for parameterising a location mechanism; a basic dynamic linker with neither location nor failure reporting mechanisms is a mere 42 ARM instructions. Please refer to ARM Shared Library Format starting on page16 of the Reference Manual for a full explanation of parameter blocks. How the Dynamic Linker Works ---------------------------- The dynamic linker is entered via a proxy call with r0 pointing at the dynamic linker's 16-byte entry stub. Following this stub code is a copy of the parameter block for the shared library. Stored in the parameter block is the identity of the library - perhaps a 32-bit unique identifier or perhaps a string name. Either way, it can be passed to the library location mechanism. You have to decide how to identify your shared libraries and, hence, what to put in their parameter blocks. The library location function is required to return the address of the start of the library's offset table. A primitive location mechanism might be to search a ROM for a matching string. This would identify the start of the parameter block of the matching shared library. Immediately preceding it will be negative offsets to library entry points and a non-negative count word containing the number of entry points. By working backwards through memory and counting, you can be sure you have found the entry vector and can return the address of its count word to the dynamic linker. More sophisticated location schemes are possible, for example: You might include in your library a header containing code to execute when the library is first loaded (into RAM) or initialised (in ROM) which registers the library's name with a library manager. Obviously, the library manager has to be locatable without using the library manager, so either it's address has to be known or its function has to be supported by an underlying system call. Acorn's RISC OS operating system supports a module mechanism which is sometimes used to implement shared libraries. A RISC OS module may, by declaring so in its module header, be called when software interrupts (SWIs) in a declared range occur. When such a module is loaded, it extends the range of SWIs interpreted by RISC OS. We can use this mechanism to locate a shared library by storing the identity of a library location SWI in the library's parameter block and by implementing this SWI in the library module's header. 5.4.5 Instructing the Linker to Make a Shared Library ----------------------------------------------------- Prerequisites ------------- A shared library can be made from any number of object files, including reentrant stubs of other shared libraries, but two simple rules must be followed: each object file must conform to a reentrant version of the ARM Procedure Call Standard and each code area must have the REENTRANT attribute; there may be no unresolved references resulting from linking together the component objects. An immediate consequence of the second rule is that it is impossible to make two shared libraries which refer to one another: to make the second library and its stub would require the stub of the first, but to make the first and its stub would require the stub of the second. The first rule is not 100% necessary and is difficult to enforce. The ARM Linker warns you if it finds a non-reentrant code area in the list of objects to be linked into a shared library but it will build the library and its matching stub anyway. You have to decide whether the warning is real, or merely a formality. Linker Outputs -------------- The ARM linker generates a shared library as two files: a plain binary file containing the read-only, reentrant, usually position independent, shared code; an AOF format stub file with which client applications can be linked. The linker can also generate a reentrant stub suitable for inclusion in another shared library. The library image file contains, in order: read only code sections from your input objects; if so requested, a read only copy of the initialised static data from the input objects; a table of (negative) offsets from the end of the library to its entry points; if so requested, the size and offset of the static data image; a copy of the library's parameter block. You request a copy of the initialised static data to be included in a library when you describe to the linker how to make a shared library. If you request this, the linker writes the length and offset of the data image immediately after the entry vector. During linking, armlink defines symbols SHL$$data$$Size and SHL$$data$$Base to have these values; components of your library may refer to these symbols. Instead of including the static data in the stub armlink includes a zero initialised place holding area of the same size. It also writes the length and (relocatable) address of this place holding, zero initialised stub data area immediately after the dynamic linker's entry veneer, giving the dynamic linker sufficient information to initialise the place holder at run time. During linking, the linker symbols SHL$$data$$Size and $$0$$Base describe this length and relocatable address. Obviously, any data included in your shared library must be free of relocation directives. Please refer to ARM Shared Library Format starting on page16 of the Reference Manual for a full explanation of what kind of data can be included in a shared library. You specify a parameter block when you describe to the linker how to make a shared library. You might, for example, include the name of the library in its parameter block, to aid its location. An identical copy of the parameter block is included in the library's entry vector in the stub file. Describing a Shared Library to the Linker ----------------------------------------- To describe a shared library to the linker you have to prepare a file which describes: the name of the library; the library parameter block; what data areas to include; what entry points to export. For precise details of how to do this, please refer to ARM Shared Library Format starting on page16 of the Reference Manual. Below is an intuitive example you can work with and adapt: ; First, give the name of the file to contain the library - ; strlib - and its parameter block - the single word 0x40000... > strlib \ 0x40000 ; ...then include all suitable data areas... + () ; ... finally export all the entry points... ; ... mostly omitted here for brevity of exposition. memcpy ... strtok The name of this file is passed to armlink as the argument to the -SHL command line option (please refer to The ARM Linker (armlink) starting on page19 of the User Manual for further details). 5.4.6 Making a Toy String Library --------------------------------- This section refers to the files collected in the strlib subdirectory of the examples directory of the release. The header files config.h and interns.h let you compile cl/string.c locally. Little-endian code is assumed. If you want to make a big-endian string library you should edit config.h. Similarly, if you want to alter which functions are included or whether static data is initialised by copying from the library, then you should edit config.h. You do not need to edit interns.h. If you use config.h unchanged you will build a little-endian library which includes a data image and which exports all of its functions. Compiling the String Library ---------------------------- To compile string.c, use the following command: armcc -li -apcs /reent -zps1 -c -I. ../../cl/string.c The -li flag tells armcc to compile for a little-endian ARM. The -apcs /reent flag tells armcc to compile reentrant code. The -zps1 flag turns off software stack limit checking and allows the string library to be independent of all other objects and libraries. With software stack limit checking turned on, the library would depend on the stack limit checking functions which, in turn, depend on other sections of the C run time library. While such dependencies do not much obstruct the construction of full scale, production quality shared libraries, they are major impediments to a simple demonstration of the underlying mechanisms. The -I. flag tells armcc to look for needed header files in the current directory. Linking the String Library -------------------------- To make a shared library and matching stub from string.o, use the following linker command: armlink -o strstub.o -shl strshl -s syms string.o strlib's stub will be put in strstub.o as directed by the -o option. The file strshl contains instructions for making a shared library called strlib. A shortened version of it was shown in the earlier section Describing a Shared Library to the Linker starting on page98. The option -s syms asks for a listing of symbol values in a file called syms. You may later need to look up the value of EFT$$Offset (it will be 0xA38 if you have changed nothing). As supplied, the dynamic linker expects a library's extenal function table (EFT) to be at the address 0x40000. So, unless you extend the dynamic linker with a library location mechanism (please refer to the discussion in the earlier section How the Dynamic Linker Works starting on page96), you will have to load strlib at the address 0x40000-EFT$$Offset. Making the Test Program and Dynamic Linker ------------------------------------------ Now you should assemble the dynamic linker and compile the test code: armasm -li dynlink.s dynlink.o armcc -li -c strtest.c You can extend the test code to probe lots of string functions, but this is left as an exercise to help you understand what is going on. To make the test program you must link together the test code, the dynamic linker, the string library stub and the appropriate ARM C library (so that references to library members other than the string functions can be resolved): armlink -d -o strtest strtest.o dynlink.o strstub.o ../../lib/armlib.32l Running the Test Program with the Shared String Library ------------------------------------------------------- Now you are ready to try everything under the control of command-line armsd: host-prompt armsd strtest A.R.M. Source-level Debugger version ... ARMulator V1.30, 4 Gb memory, MMU present, Demon 1.1,... Object program file strtest armsd: getfile strlib 0x40000-0xa38 armsd: go strerror(42) returns unknown shared string-library error 0x0000002A Program terminated normally at PC = 0x00008354 (__rt_exit + 0x24) +0024 0x00008354: 0xef000011 .... : swi 0x11 armsd: q Quitting host-prompt Before starting strtest you must load the shared string library by using: getfile strlib 0x40000-0xa38 strlib is the name of the file containing the library; 0x40000 is the hard wired address at which the dynamic linker expects to find the external function table; and 0xa38 is the value of EFT$$Offset, the offset of the external function table from the start of the library. When strtest runs, it calls strerror(42) which causes the dynamic linker to be entered, the static data to be copied, the stub vector to be patched and the call to be resumed. You can watch this is more detail by setting a breakpoint on __rt_dynlink and single stepping. 5.4.7 Suggested Further Exercises --------------------------------- Library Location Mechanisms --------------------------- Locating a library's EFT at 0x40000 is not very satisfactory, so an obvious exercise is to extend the dynamic linker to locate a library by looking for it. Try, for example, adding a header to the start of the library which contains: offset to the next loaded library or 0 the total length of the library the offset to the external function table the string name of the library Hint: when you link this area with the other library contents you have to ensure that it wil precede all other areas in the library. Please refer to Area Placement and Sorting Rules starting on page9 of the Reference Manual for further details. Your dynamic linker could now search a list of libraries loaded at 0x40000 onwards. Self-Loading Libraries ---------------------- You could extend the header mechanism described in the previous subsection so that a library could copy itself to the next free location above 0x40000. This would allow libraries to be loaded at 0x8000 and ÕexecutedÕ there. Of course, you would want your header to begin with a branch to the code which will copy the library from 0x8000 to its destination above 0x40000. Multiple Shared Libraries Once you have built location and loading mechanisms, you can build more than one shared library. Try making one of your own and linking a test program with the stubs of two or more libraries. Inter-Library Calls ------------------- Once you have multiple libraries working, you can try making one library call functions in another (but remember that if library A refers to library B then library B may not refer to library A). To do this you will have to make a reentrant stub for the library you wish to refer to and link this into the library making the reference. 5.4.8 Related Topics -------------------- Register Usage under the ARM Procedure Call Standard starting on page62