This document is Copyright 1994 ARM Ltd, and has been included on this 
disc with their kind permission. This manual is supplied "as is"; ARM 
Limited ("ARM") makes no warranty, express or implied, of the 
merchantability of this document or its fitness for any particular 
purpose. In no circumstances shall ARM be liable for any damage, loss 
of profits, or any indirect or consequential loss arising out of the 
use of these recipes or inability to use these recipes, even if ARM has 
been advised of the possibility of such loss.
---------------------------------------------------------------------------

5. Programming in C
~~~~~~~~~~~~~~~~~~~
5.1 A Very Simple C Program
---------------------------
5.1.1 About this Recipe
-----------------------
This recipe gives you a simple exercise in using the ARM Software 
Development Toolkit (the toolkit) to write a program in C. By following it, 
you will learn how to:

     use the ARM C compiler armcc to create a runnable program;
     use the ARM source level debugger armsd to run your program on a 
     (simulated) ARM system;
     use armcc to compile a C program to an object file;
     use the ARM linker armlink to create a runnable program from an object 
     file and the ARM C library.

5.1.2 Prerequisites
-------------------
Before you can try this recipe, the toolkit must be properly installed on 
your computer. Instructions for installation are given in the installation 
notes distributed with every toolkit. If you experience any difficulties, 
please refer to these notes.

5.1.3 Making a Simple Runnable Program
--------------------------------------
The "Hello World" program shown below, is included in the on-line examples 
as file hellow.c in the directory examples:

#include <stdio.h>

int main( int argc, char **argv )
{ printf("Hello World\n");
  return 0;
}

If you set your working directory to be the examples directory you can 
compile this program to runnable form in a single step using:
armcc hellow.c -li -apcs 3/32bit

Explanation
-----------
The argument -li says that the target is little endian and -apcs 3/32bit 
says that the 32 bit ARM procedure call standard should be used. If the 
compiler has been configured to use these options by default then these 
arguments need not be given (see The ARM Tool Reconfiguration Utility 
(reconfig) starting on page45 of the User Manual for details). The 
executable program is left in a file called hellow.

5.1.4 Running the Program
-------------------------
You can run the program (technically an AIF Image) using armsd. You should 
follow the sample dialog below:

host-prompt> armsd -li hellow
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file hellow
armsd: go
Hello world
Program terminated normally at PC = 0x000082a0
      0x000082a0: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
host-prompt>

Explanation
-----------
The -li argument to armsd tells it to emulate a little endian arm. If armsd 
has been configured to be little endian by default then -li can be omitted 
(see The ARM Tool Reconfiguration Utility (reconfig) starting on page45 of 
the User Manual for how to configure the ARM development tools).
When armsd comes up with its "armsd:" prompt and waits for your command, you 
should type "goCR". At the next prompt type "quitCR" to exit armsd.

5.1.5 Separate Compiling
------------------------
You can invoke the compiler and the linker separately. You can use:

armcc -c hellow.c -li -apcs 3/32bit
to make an object file (in this example called hellow.o, by default).

Explanation
-----------
The -c flag tells the compiler to make an object file but not to link it 
with the C library.

5.1.6 Separate Linking
----------------------
When you have finished compiling, you can link your object file with the C 
library to make a runnable program using:

armlink -o hellow hellow.o somewhere/armlib.32l

Where we have written somewhere, above, you must type the name of the 
directory containing the ARM C libraries.

Notes
-----
You now have to be very explicit; you must specify:

     the name of the file which will contain the runnable program (here, 
     hellow);
     the name of the object file (here, hellow.o);
     the location and name of the C library you wish to use.

In simple cases, armcc can reduce the need to be so explicit.

5.1.7 Related Topics
--------------------
Please refer to the index to find topics of particular interest.

5.2 Writing Efficient C for the ARM
-----------------------------------
5.2.1 About This Recipe
-----------------------
The ARM C compiler can generate very good machine code for if you present it 
with the right sort of input. From this note, you will learn:

     what the C compiler compiles well and why;
     how to help the C compiler to generate excellent machine code.

Some of the rules of thumb presented are quite general; some are quite 
specific to the ARM or the ARM C compiler. It should be quite clear from 
context which rules are portable.
The first subsection below is concerned with how to design collections of C 
functions to maximise low-level efficiency. The following subsection is 
concerned with the efficiency of larger and more complicated functions.

5.2.2 Function Design Considerations
------------------------------------
Unlike on many earlier CISC processor architectures, function call overhead 
on the ARM is small and often in proportion to the work done by the called 
function. Several feaures contribute to this:

     the minimal ARM call-return sequence is BL... MOV pc, lr, which is 
     extremely economical;
     STM and LDM reduce the cost of entry to and exit from functions which 
     must create a stack frame and/or save registers;
     the ARM Procedure Call Standard has been carefully designed to allow 
     two very important types of function call to be optimised so that the 
     entry and exit overheads are minimal.

Good general advice is to keep functions small, because function calling 
overheads are low. In the remainder of this subsection you will learn 
precisely when function call overhead is very low. In following subsections 
you will learn how small functions help the ARM C compiler; you will also 
learn how to assist the C compiler when functions cannot be kept small.

Leaf Functions
--------------
In 'typical' programs, about half of all function calls made are to leaf 
functions (a leaf function is one which makes no calls from within its body).
Often, a leaf function is rather simple. On the ARM, if it is simple enough 
to compile using just 5 registers (a1-a4 and ip), it will carry no function 
entry or exit overhead. A surprising proportion of useful leaf functions can 
be compiled within this constraint.
Once registers have to be saved, it is efficient to save them using STM. In 
fact the more you can save at one go, the better. In a leaf function, all 
and only the registers which need to be saved will be saved by a single 
STMFD sp!,{regs,lr} on entry and a matching LDMFD sp!,{regs,pc} on exit.
In general, the cost of pushing some registers on entry and popping them on 
exit is very small compared to the cost of the useful work done by a leaf 
function which is complicated enough to need more than 5 registers.
Overall, you should expect a leaf function to carry virtually no function 
entry and exit overhead; and at worst, a small overhead, most likely in 
proportion to the useful work done by it.

Veneer Functions (Simple Tail Continued Functions)
--------------------------------------------------
Historically, abstraction veneers have been relatively expensive. The kind 
of veneer function which merely changes the types of its arguments, or which 
calls a low-level implementation with an extra argument (say), has often 
cost much more in entry and exit overhead than it was worth in useful work.
On the ARM, if a function ends with a call to another function, that call 
can be converted to a tail continuation. In functions which need to save no 
registers, the effect can be dramatic. Consider, for example:

extern void *__sys_alloc(unsigned type, unsigned n_words);
#define  NOTGCable   0x80000000
#define  NOTMovable  0x40000000

void *malloc(unsigned n_bytes)
{   return __sys_alloc(NOTGCable+NOTMovable, n_bytes/4);
}

Here, armcc generates (the version of armcc supplied with this release may 
produce slightly different output):

malloc
    MOV     a2,a1,LSR #2
    MOV     a1,#&c0000000
    B       |__sys_alloc|

There is no function entry or exit overhead - just useful work massaging 
arguments - and the function return has disappeared entirely - return is 
direct from __sys_alloc to malloc's caller. In this case, the basic 
call-return cost for the function pair has been reduced from:

 BL + BL + MOV pc,lr + MOV pc,lr

to:

 BL + B  +             MOV pc,lr

a saving of 25%.
More complicated functions in which the only function calls are immediately 
before a return, collapse equally well. An artificial example is:

extern int f1(int), int f2(int, int);

int f(int a, int b)
{   if (b == 0)
        return a;
    else if (b < 0)
        return f2(a, -b);
    else
        return f2(b, a);  /* argument order swapped */
}

armcc generates the following, wonderfully efficient code (the version of 
armcc supplied with this release may produce slightly different output):

f   CMP     a2,#0
    MOVEQS  pc,lr
    RSBLT   a2,a2,#0
    BLT     f2
    MOV     a3,a1
    MOV     a1,a2
    MOV     a2,a3
    B       f2

Fast Paths and Slow Paths - A Useful Transformation
---------------------------------------------------
Inevitably, not all functions can be leaves or small abstraction functions. 
And, inevitably, non-leaf functions must carry the cost of establishing a 
call frame on entry and removing it on exit, perhaps also the cost of saving 
and restoring some registers. How does this hurt performance? Consider the 
following example:

int f(Buffer *b)
{    if (b->n > 0)
     {   /* The usual path through the function... */
         /*     95% of all calls.*/
         /* Simple calculation involving b->buf, b->n, etc.*/
         return ...;
     }
     /* Exceptional path through the function... */
     /*     5% of all calls.  */
     /* Complicated calculation involving calls
     /*     to other functions.*/
     return ...;
}

In this case, the entry and register-save overhead caused by the infrequent 
heavyweight path through the function applies to the much more frequent 
lightweight path through it. To fix this, turn the heavyweight path into a 
tail call. Yes, introducing another layer of function call yields much more 
efficient code!

int f2(Buffer *b)
{    /* Exceptional path through the function... */
     /*     5% of all calls.  */
     /* Complicated calculation involving calls */
     /*     to other functions.*/
     return ...;
}

int f(Buffer *b)
{    if (b->n > 0)
     {   /* The usual path through the function... */
         /*     95% of all calls.*/
         /* Simple calculation involving b->buf, b->n, etc.*/
         return ...;]
     }
     return f2(b);
}

If you are lucky, f() will now compile using only a1-a4 and ip and so incur 
no entry overhead whatsoever. 95% of the time, the overhead on the original 
f() will be reduced to zero.
This is quite a general source transformation technique and you should look 
for opportunities to use it and analogous transformations. It works for any 
processor to some extent; it works particulary well for the ARM because of 
the careful optimisation of tail continuation in lightweight functions.
Repeated application of this technique to the chain of six or so functions 
called for every character processed by the preprocessing phase of the ARM 
C compiler, improved the performance of the preprocessor (running on the 
ARM) by about 30%.

Function Arguments and Argument Passing
---------------------------------------
The final aspect of function design which influences low-level efficiency is 
argument passing.
Under the ARM Procedure Call Standard, up to four argument words can be 
passed to a function in registers. Functions of up to four integral (not 
floating point) arguments are particularly efficient and incur very little 
overhead beyond that required to compute the argument expressions themselves 
(there may be a little register juggling in the called function, depending 
on its complexity).
If more arguments are needed, then the 5th, 6th, etc., words will be passed 
on the stack. This incurs the cost of an STR in the calling function and an 
LDR in the called function for each argument word beyond four.

How can argument passing overhead be minimised?
-----------------------------------------------

     Try to ensure that small functions take four or fewer arguments. These 
     will compile particualrly well.
     If a function needs many arguments, try to ensure that it does a 
     significant amount of work on every call, so that the cost of passing 
     arguments is amortised.
     Factor out read-mostly global control state and make this static. If it 
     has to be passed as an argument (e.g. to support multiple clients) then 
     wrap it up in a struct and pass a pointer to it. The characteristics of 
     such control state are:

     it's logically global to the compilation unit or program
     it's read-mostly, often read-only except in response to user input, and 
     for almost all functions cannot be changed by them or any function 
     called from them;
     references to it are ubiquitous, but in any function, references are 
     relatively rare (frequent references should be replaced by references 
     to a local, non-static copy).

Don't confuse such control state with compuational arguments, the values of 
which differ on every call.

     Collect related data into structs. Decide whether to pass pointers or 
     struct values based on the use of each struct in the called function:
     If few fields are read or written then passing a pointer is best.
     The cost of passing a struct via the stack is typically a share in an 
     LDM-STM pair for each word of the struct. This can be better than 
     passing a pointer if (i) on average, each field is used at least once 
     and (ii) the register pressure in the function is high enough to force 
     a pointer to be repeatedly re-loaded. 

As a rule of thumb, you can't lose much efficiency if you pass pointers to 
structs rather than struct values. To gain efficiency by passing struct 
values rather than pointers usually requires careful study of a function's 
machine code.

5.2.3 Register Allocation and How To Help It
--------------------------------------------
It is well known that register allocation is critical to the efficiency of 
code compiled for RISC processors. It is particularly critical for the ARM, 
which has only 16 registers rather than the 'traditional' 32.
The ARM C compiler has a highly efficient register allocator which operates 
on complete functions and which tries to allocate the most frequently used 
variables to registers (taking loop nesting into account). It produces very 
good results unless the demand for registers seriously outstrips supply. And 
it has one shortcoming, namely that it allocates whole variables to 
registers, not separate live ranges.
As code generation proceeds for a function, new variables are created for 
expression temporaries. These are never reused in later expressions and 
cannot be spilled to memory. Usually, this causes no problems. However, a 
particularly pathological expression could, in principle, occupy most of the 
allocatable registers, forcing almost all program variables to be spilled to 
memory. Because the number of registers required to evaluate an expression 
is a logarithmic function of the number terms in it, it takes an expression 
of more than 32 terms to threaten the use of any variable register.
As a rule of thumb, avoid very large expressions (more than 30 terms).
The more serious problem is with long scope program variables. Our allocator 
either allocates a variable to a chosen register everywhere the variable is 
live, or it leaves the variable in memory. To help visualise the problem - 
and to see how to help the allocator - consider the following two program 
schemata:

int f()                            int f()
{   int i, j, ...;                 {   int j, ...;
                                     { int i;
    for (i = 0;  i < lim;  ++i)        for (i = 0;  i < lim;  ++i)
    {                                  {
        ...                               ...
    }                                  }
                                     }
                                     { int i;
    for (i = 0;  i < lim;  ++i)        for (i = 0;  i < lim;  ++i)
    {  /* register pressure in this    {
       loop forces 'i' to memory */
    }                                  }
                                     }
                                     { int i;
    for (i = 0;  i < lim;  ++i)        for (i = 0;  i < lim;  ++i)
    {                                  {
        ...                                ...
    }                                  }
                                     }
}                                  }

In the left hand case, because the scope of 'i' is the whole function, if 
'i' cannot be allocated to a register everywhere then all three loops will 
suffer their loop index being in memory. On the other hand, in the right 
hand case there are three separate variables called 'i', each of which will 
be allocated separately by the register allocator.
As a rule of thumb, keep variable declarations local, especially in large 
functions. Use additional block structure as illustrated here (right hand 
example), if necessary.
On the other hand, if this transformation is carried to excess, there may be 
bad results. When a local variable is spilled to memory, there is a stack 
adjustment on each entry to and exit from its containing scope. The ARM C 
compiler does this to minimise the space used by local variables. Suppose, 
for example, that in the right hand case above, each block declared a 1KB 
buffer as well as 'i'. Then adjusting the stack at every scope leads to 
stack usage of just over 1KB whereas adjusting it only at function entry 
leads to usage of more than 3KB.
In principle, the compiler could be more intelligent about adjusting the 
stack locally for large variables and only at function entry for small 
variables. For the moment, the programmer must be aware of these issues.
So, a modified rule of thumb is to cluster variable declarations into 
reasonable sub-scopes within large functions and to avoid doing so within 
the most deeply nested loops. This will most likely help the allocator 
without introducing unwanted costs associated with local stack adjustment.

5.2.4 Static and Extern Variables - Minimising Access Costs
-----------------------------------------------------------
A variable in a register costs nothing to access: it is just there, waiting 
to be used. A local (auto) variable is addressed via the sp register, which 
is always available for the purpose.
A static variable, on the other hand, can only be accessed after the static 
base for the compilation unit has been loaded. So, the first such use in a 
function always costs 2 LDRs or an LDR and an STR. However, if there are 
many uses of static variables within a function then there is a good chance 
that the static base will become a global common subexpression (CSE) and 
that, overall, access to static variables will be no more expensive than to 
auto variables on the stack.
Extern variables are fundamentally more expensive: each has its own base 
pointer. Thus each access to an extern is likely to cost 2 LDRs or an LDR 
and an STR. It is much less likely that a pointer to an extern will become a 
global CSE - and almost certain that there cannot be several such CSEs - so 
if a function accesses lots of extern variables, it is bound to incur 
significant access costs.
A further cost occurs when a function is called: the compiler has to 
assume - in the absence of inter-procedural data flow analysis - that any 
non- const static or extern variable could be side-effected by the call. 
This severly limits the scope across which the value of a static or extern 
variable can be held in a register.
Sometimes a programmer can do better than a compiler could do, even a 
compiler that did interprocedural data flow analysis. An example in C is 
given by the standard streams: stdin, stdout and stderr. These are not 
pointers to const objects (the underlying FILE structs are modified by I/O 
operations), nor are they necessarily const pointers (they may be assignable 
in some implementations). Nonetheless, a function can almost always safely 
slave a reference to a stream in a local FILE * variable.
It is a common programming paradigm to mimic the standard streams in 
applications. Consider, for example, the shape of a typical non-leaf 
printing function:

extern FILE *out;                  extern FILE *out;
    /* the output stream */            /* the output stream */

void print_it(Thing *t)            void print_it(Thing *t)
{                                  {   FILE *f = out;
    fprintf(out, ...);                 fprintf(f, ...);
    print_1(t->first);                 print_1(t->first);
    fprintf(out, ...);                 fprintf(f, ...);
    print_2(t->second);                print_2(t->second);
    fprintf(out, ...);                 fprintf(f, ...);
    ...                                ...
}                                  }

In the left hand case, out has be be re-computed or re-loaded after each 
call to print_... (and after each fprintf...). In the right hand case, 'f' 
can be held in a register throughout the function (and probably will be).
Uniform application of this transformation to the disassembly module of the 
ARM C compiler saved more than 5% of its code space.
In general, it is difficult and potentially dangerous to assert that no 
function you call (or any functions they in turn call) can affect the value 
of any static or extern variables of which you currently have local copies. 
However, the rewards can be considerable so it is usually worthwhile to 
work out at the program design stage which global variables are slavable 
locally and which are not. Trying to retrofit this improvement to exisiting 
code is usually hazardous, except in very simple cases like the above.

5.2.5 The switch() Statement
----------------------------
The switch() statement can be used to transfer control to one of several 
destinations - conceptually an indexed transfer of control - or to generate 
a value related to the controlling expression (in effect computing an 
in-line function of the controlling expression).
In the first role, switch() is hard to improve upon: the ARM C compiler does 
a good job of deciding when to compile jump tables and when to compile 
trees of if-then-elses. It is rare for a programmer to be able to improve 
upon this by writing if-then-else trees explicitly in the source.
In the second role, however, use of switch() is often mistaken. You can
 probably do better by being more aware of what is being computed and how.
In the example below, which is abstracted from an early version of the 
disassembly module of the ARM C Compiler, you will learn:

     the cost of implementing an in-line function using switch();
     how to implement the same function more economically. 

The function below used for illustrative purposes maps a 4-bit field of an 
ARM instruction to a 2-character condition code mnemonic. The real case was 
more complicated, decoding two 4-bit fields to a 3-char mnemonic, but for 
illustration the simple example serves just as well. The real case was also 
embedded in a larger function, but this is irrelevant to the discussion.

char *cond_of_instr(unsigned instr)
{   char *s;�
    switch (instr & 0xf0000000)
    {
case 0x00000000:  s = "EQ";  break;
case 0x10000000:  s = "NE";  break;
     ...          ...        ...
case 0xF0000000:  s = "NV";  break;
    }
    return s;
}

The compiler handles this code fragment well, generating 276 bytes of code 
and string literals. But we could do better. If performance were not 
critical (as it never is in disassembly) then we could look up the code in a 
table of codes, in something like:

char *cond_of_instr(unsigned instr)
{
    static struct {char name[3];  unsigned code;}
        conds[] = {
            "EQ", 0x00000000,
            "NE", 0x10000000,
            ....
            "NV", 0xf0000000,
        };
    int j;
    for (j = 0;  j < sizeof(conds)/sizeof(conds[0]);  ++j)
        if ((instr & 0xf0000000) == conds[j].code)
            return conds[j].name;
    return "";
}

This fragment compiles to 68 bytes of code and 128 bytes of table data. 
Already this is a 30% improvement on the switch() case, but this schema has 
other advantages: it copes well with a random code to string mapping and if 
the mapping is not random admits further optimisation. For example, if the 
code is stored in a byte (char) instead of an unsigned and the comparison is 
with (instr >> 28) rather than (instr & 0xF0000000) then only 60 bytes of 
code and 64 bytes of data are generated for a total of 124 bytes.
Another advantage we have heard of for table lookup is that is is possible 
to share the same table between a disassembler and an assembler - the 
assembler looks up the mnemonic to obtain the code value, rather than the 
code value to obtain the mnemonic. Where performance is not critical, the 
symmetric property of lookup tables can sometimes be exploited to yield 
significant space savings.
Finally, by exploiting the denseness of the indexing and the uniformity of 
the returned value it is possible to do better again, both in size and 
performance, by direct indexing:

char *cond_of_instr(unsigned instr)
{
    return "\
EQ\0\0NE\0\0CC\0\0CS\0\0MI\0\0PL\0\0VS\0\0VC\0\0\
HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4;
}

This expression of the problem causes a miserly 16 bytes of code and 64 
bytes of string literal to be generated and is probably close to what an 
experienced assembly language programmer would naturally write if asked to 
code this function. It is the solution finally adopted in the ARM C 
compiler's disassembler module.
The uniform application of this transformation to the disassembler module of 
the ARM C compiler saved between 5% and 10% of its code space.
The moral of this tale is to think before using switch() to compute an 
in-line function, especially if code size is an important consideration. 
Switch() compiles to high performance code but often table lookup will be 
smaller; where the function's domain is dense, or piecewise dense, direct 
indexing into a table will often be both faster and smaller.

5.2.6 Related Topics
--------------------
      ARM Assembly Programming Performance Issues starting on page55.
      Register Usage under the ARM Procedure Call Standard starting on page62.

      Passing and Returning structs starting on page67.

5.3 C Programming for Deeply Embedded Applications
--------------------------------------------------
5.3.1 About this Recipe
-----------------------
In this recipe you will learn about the standalone runtime support system 
for C programming in deeply embedded applications.  In particular you will 
discover:

     what rtstand.s supports;
     how to make use of it by looking at example programs;
     how to extend it by adding extra fuctionality from the C library;
     the size of the standalone run time library;

5.3.2 Introduction
------------------
The semi hosted ANSI C library provides all the standard C library 
facilities (and thus is quite large).  This is acceptable when running  
under emulation with plenty of memory available, or maybe even when running 
on development hardware with access to a real debugging channel and plenty 
of memory. However, in a deeply embedded application many of the facilities 
of the C library may no longer be relevent, eg. file access functions, time 
and date functions, and the size of the semi hosted ANSI C library may be 
prohibitive if the memory available is severely limited.
For deeply embedded applications a minimal C runtime system is needed which 
takes up as little memory as possible, is easily portable to the target 
hardware, and only supports those functions required for such an application.
The ARM Software Development Toolkit comes with a minimal runtime system in 
source form.  The �behind the scenes� jobs which it performs are:

     setting up the initial stack and heap, and calling main;
     program termination - either automatic (returning from main() or 
     forced - explicitly calling __rt_exit);
     simple heap allocation (__rt_alloc);
     stack limit checking;
     setjmp and longjmp support;
     divide and remainder functions (calls to which can be generated by 
     armcc);
     high level error handler support (__err_handler);
     optional floating point support, and a means to detect whether floating 
     point support is available or not (__rt_fpavailable); 

The source code rtstand.s documents the options which you may want to change 
for your target.  These are not covered in this recipe.  The header file 
rtstand.h documents the functions which rtstand.s provides to the C 
programmer.
Note that no support is provided for outputting data down the debugging 
channel.  This can be done, but is specific to the target application.  The 
example C programs described below use the ARM Debug Monitor available under 
armsd to output messages using in-line SWIs.  See ARM Debug Monitor starting 
on page104 of the Technical Specifications for full details of the 
facilities which the ARM Debug Monitor provides, and see In-Line SWIs 
starting on page72 for more information about in-line swis.

5.3.3 Using the Standalone Runtime System
-----------------------------------------
In this section the main features of the standalone runtime system are 
demonstrated by example programs.
Before attempting any of the demonstrations below create a working 
directory, and set this up as your current directory.  Copy the contents of 
the clstand directory into your working directory, and also copy the files 
fpe*.o from the fpe340 directory of the cl directory into your working 
directory.  You are now ready to experiment with the C standalone runtime 
system.
In the examples below, the following options are passed to armcc, armasm, 
and in the first case armsd:

-li            This specifies that the the target is a little endian ARM.
-apcs 3/32bit  This specifies that the 32 bit variant of APCS 3 should be 
               used.  For armasm this is used to set the built in variable 
               {CONFIG} to 32.

These arguments can be changed if the target hardware differs from this 
configuration.  If the ARM Software Tools have been configured as desired 
then these options may be omitted, as the tools will default to the 
configuration time values.  See The ARM Tool Reconfiguration Utility 
(reconfig) starting on page45 of the User Manual for how to configure the 
ARM Software Tools.
These demonstrations are likely to be most useful if the sources rtstand.s, 
errtest.c and memtest.c are studied in conjunction with this recipe.

5.3.4 A Simple Program
----------------------
Let us compile the example program errtest.c, and assemble the standalone 
runtime system.  These can then be linked together to provide an executable 
image, errtest:

armcc -c errtest.c -li -apcs 3/32bit
armasm rtstand.s -o rtstand.o -li -apcs 3/32bit
armlink -o errtest errtest.o rtstand.o

We can then execute this image under the armsd as follows:

> armsd -li errtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file errtest
armsd: go
(the floating point instruction-set is not available)
Using integer arithmetic ...
10000 / 0X0000000A = 0X000003E8
10000 / 0X00000009 = 0X00000457
10000 / 0X00000008 = 0X000004E2
10000 / 0X00000007 = 0X00000594
10000 / 0X00000006 = 0X00000682
10000 / 0X00000005 = 0X000007D0
10000 / 0X00000004 = 0X000009C4
10000 / 0X00000003 = 0X00000D05
10000 / 0X00000002 = 0X00001388
10000 / 0X00000001 = 0X00002710
Program terminated normally at PC = 0x00008550
      0x00008550: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
> 

The '>' prompt is the Operating System prompt, and the 'armsd:' prompt is 
output by armsd to indicate that user input is required.

Already several of the standalone runtime system's facilities have been 
demonstrated:

     the C stack and heap have been set up;
     main has clearly been called;
     the fact that floating point support is not available has been detected;
     the integer division functions  have been used by the compiler.
     program termination was successful.

5.3.5 Error Handling
--------------------
The same program, errtest, can also be used to demonstrate error handling, 
by recompiling errtest.c and predefining the DIVIDE_ERROR macro:

armcc -c errtest.c -li -apcs 3/32bit -DDIVIDE_ERROR
armlink -o errtest errtest.o rtstand.o

Again, we can now execute this image under the armsd as follows:

> armsd -li errtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file errtest
armsd: go
(the floating point instruction-set is not available)
Using integer arithmetic ...
10000 / 0X0000000A = 0X000003E8
10000 / 0X00000009 = 0X00000457
10000 / 0X00000008 = 0X000004E2
10000 / 0X00000007 = 0X00000594
10000 / 0X00000006 = 0X00000682
10000 / 0X00000005 = 0X000007D0
10000 / 0X00000004 = 0X000009C4
10000 / 0X00000003 = 0X00000D05
10000 / 0X00000002 = 0X00001388
10000 / 0X00000001 = 0X00002710
10000 / 0X00000000 = errhandler called: code = 0X00000001: divide by 0
caller's pc = 0X00008304
returning...

run time error: divide by 0
program terminated

Program terminated normally at PC = 0x0000854c
      0x0000854c: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
> 

This time an integer division by zero has been detected by the standalone 
runtime system, which called __err_handler.  __err_hander output the first 
set of error messages in the above output.  Control was then returned to the 
runtime system which output the second set of error messages and terminated 
execution.

5.3.6 longjmp and setjmp
------------------------
A further demonstration can be made using errtest by predefining the macro 
LONGJMP to perform a longjmp out of __err_handler back into the user 
program, thus catching and dealing with the error.  First recompile and link 
errtest:

armcc -c errtest.c -li -apcs 3/32bit -DDIVIDE_ERROR -DLONGJMP
armlink -o errtest errtest.o rtstand.o

Then rerun errtest under armsd.  We expect the integer divide by zero to 
occur once again:

> armsd -li errtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file errtest
armsd: go
(the floating point instruction-set is not available)
Using integer arithmetic ...
10000 / 0X0000000A = 0X000003E8
10000 / 0X00000009 = 0X00000457
10000 / 0X00000008 = 0X000004E2
10000 / 0X00000007 = 0X00000594
10000 / 0X00000006 = 0X00000682
10000 / 0X00000005 = 0X000007D0
10000 / 0X00000004 = 0X000009C4
10000 / 0X00000003 = 0X00000D05
10000 / 0X00000002 = 0X00001388
10000 / 0X00000001 = 0X00002710
10000 / 0X00000000 = errhandler called: code = 0X00000001: divide by 0
caller's pc = 0X00008310
returning...

Returning from __err_handler() with errnum = 0X00000001

Program terminated normally at PC = 0x00008558
      0x00008558: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
> 

The runtime system detected the integer divide by zero, and as before 
__err_handler was called, which produced the error messages.  However, this 
time __err_handler used longjmp to return control to the program, rather 
than the runtime system.

5.3.7 Floating Point Support
----------------------------
Using errtest we can also demonstrate floating point support.  You should 
already have copied the appropriate floating point emulator object code into 
your working directory.  For the configuration used in this example 
fpe_32l.o is the correct object file.
However, in addition to this it is also necessary to link with an fpe stub, 
which we must compile from the source given (fpestub.s).

armasm fpestub.s -o fpestub.o -li -apcs 3/32bit
armlink -o errtest errtest.o rtstand.o fpestub.o fpe_32l.o -d

The resulting executable, errtest, can be run under armsd as before:

> armsd -li errtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file errtest
armsd: go
(the floating point instruction-set is available)
Using Floating point, but casting to int ...
10000 / 0X0000000A = 0X000003E8
10000 / 0X00000009 = 0X00000457
10000 / 0X00000008 = 0X000004E2
10000 / 0X00000007 = 0X00000594
10000 / 0X00000006 = 0X00000682
10000 / 0X00000005 = 0X000007D0
10000 / 0X00000004 = 0X000009C4
10000 / 0X00000003 = 0X00000D05
10000 / 0X00000002 = 0X00001388
10000 / 0X00000001 = 0X00002710
10000 / 0X00000000 = errhandler called: code = 0X80000202: Floating Point
Exception : Divide By Zero

caller's pc = 0XE92DE000
returning...

Returning from __err_handler() with errnum = 0X80000202

Program terminated normally at PC = 0x00008558 (__rt_exit + 0x10)
+0010 0x00008558: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
> 

This time the floating point instruction set is found to be available, and 
when a floating point division by zero is attempted, __err_handler is called 
with the details of the floating point divide by zero exception.
Note that if you have compiled errtest.c other than as in longjmp and 
setjmp starting on page90, you will not see precisely this dialogue with 
armsd.

5.3.8 Running Out of Heap
-------------------------
A second example program, memtest.c demonstrates how the standalone runtime 
system copes with allocating stack space, and also demonstrates the simple 
memory allocation function __rt_alloc.  Let us first compile this program so 
that it should repeatedly request more memory, until there is none left:

armcc -li -apcs 3/32bit memtest.c -c
armlink -o memtest memtest.o rtstand.o

This can be run under armsd in the usual way:

> armsd -li memtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file memtest
armsd: go
kernel memory management test
force stack to 4KB
request 0 words of heap - allocate 256 words at 0X000085A0
force stack to 8KB
..
force stack to 60KB
request 33211 words of heap - allocate 33211 words at 0X00049388
force stack to 64KB
request 49816 words of heap - allocate 5739 words at 0X00069A74
memory exhausted, 105376 words of heap, 64KB of stack
Program terminated normally at PC = 0x0000847c
      0x0000847c: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
> 

This demonstrates that allocating space on the stack is working correctly, 
and also that the __rt_alloc routine is working as expected.  The program 
terminated because in the end __rt_alloc could not allocate the requested 
amount of memory.

5.3.9 Stack Overflow Checking
-----------------------------
memtest can also be used to demonstrate stack overflow checking by 
recompiling with the macro STACK_OVERFLOW defined.  In this case the amount 
of stack required is increased until there is not enough stack available, 
and stack overflow detection causes the program to be aborted.

To recompile and link memtest.c issue the following commands:

armcc -li -apcs 3/32bit memtest.c -c -DSTACK_OVERFLOW
armlink -o memtest memtest.o rtstand.o

Running this program under armsd produces the following output:

> armsd -li memtest
A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992]
ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian.
Object program file memtest
armsd: go
kernel memory management test
force stack to 4KB
...
force stack to 256KB
request 1296 words of heap - allocate 1296 words at 0X0000AE20
force stack to 512KB

run time error: stack overflow
program terminated

Program terminated normally at PC = 0x0000847c
      0x0000847c: 0xef000011 .... : >  swi     0x11
armsd: quit
Quitting
> 

Clearly stack overlfow checking did indeed catch the case where too much 
stack was required, and caused the runtime system to terminate the program 
after giving an appropriate diagnostic.

5.3.10 Extending the Standalone Runtime System
----------------------------------------------
For many applications it may be desirable to have access to more of the 
standard C library than just the minimal runtime system provides.  This 
section demonstrates how to take out a part of the standard C library and 
plug it into the standalone runtime system.
The function which we will add to rtstand is memmove.  Although this is 
small, and easily extracted from the C library source, the same methodology 
can be applied to larger sections of the C library, eg. the dynamic memory 
allocation system (malloc, free, etc).
The source of the C library can be found in the cl directory.  The source 
for the memmove function is in string.c.  The extracted source for memmove 
has been put into memmove.c, and the compile time option _copywords has been 
removed.  The function declaration for memmove and a typedef for size_t (
extracted from include/stddef.h) have been put into memmove.h.
Our memmove module can be compiled as follows.

armcc -c memmove.c -li -apcs 3/32bit

The output, memmove.o can be linkedwith the user's other object modules 
together with rtstand.o in the normal way (see previous examples in this 
section).

5.3.11 The Size of the Standalone Runtime Library
-------------------------------------------------
rtstand.s has been separated into several code Areas.  The advantage of this 
is that armlink can detect if any Areas are unreferenced, and then eliminate 
them from the output image.
The table below shows the typical size of the Areas in rtstand.o:
  
Area                    Size (bytes)       Functions
----                    ------------       ---------
C$$data                      4
C$$code$$__main              96            __main, __rt_exit
C$$code$$__rt_fpavailable    8             __rt_fpavailable
C$$code$$__rt_trap           128           __rt_trap
C$$code$$__rt_alloc          68            __rt_alloc
C$$code$$__rt_stkovf         76            __rt_stkovf_split_*
C$$code$$__jmp               100           longjmp, setjmp
C$$code$$__divide            256           __rt_sdiv, __rt_udiv, __rt_udiv10,
                                           __rt_sdiv10, __rt_divtest
All Areas                    736

If floating point support is definitely not required, then the 
EnsureNoFPSupport variable can be set to {TRUE}, and some extra space will 
be saved.  After making any modifications to rtstand.s, the size of the 
various areas can be found by using the command:

decaof -b rtstand.o

From the above table it is clear that for many applications the standalone 
runtime library will be roughly 0.5Kb.

5.3.12 Related Topics
---------------------

     Register Usage under the ARM Procedure Call Standard starting on page62;
     In-Line SWIs starting on page72.

5.4 ARM Shared Libraries
------------------------
5.4.1 About This Recipe
-----------------------
In this recipe you will learn:

     what an ARM shared library is;
     how the shared library mechanism works;
     how to instruct the ARM linker to make a shared library;
     how to make a toy shared library from the string section of the ANSI C 
     library.

5.4.2 About ARM Shared Libraries
--------------------------------
ARM shared libraries support the sharing of utility, service or library 
functions between several concurrently executing client applications in a 
single address space. Such shared code is necessarily reentrant.
If a function is reentrant, each of its concurrently active clients must 
have a separate copy of the data it manipulates for them. The data cannot be 
associated with the code itself unless the data is read-only. In the ARM 
shared library architecture, a dedicated register (called sb) is used to 
address (indirectly) the static data associated with a client.
An ARM shared library is read only, reentrant and usually position 
independent. A shared library made exclusively from object code compiled by 
the ARM C compiler will have all three of these attributes. Library 
components implemented in ARM Assembly Language need not be reentrant and 
position independent, but in practice, only position independence is 
inessential.
A library with all three of these attributes in an ideal candidate for 
packing into a system ROM.
Some shared library mechanisms associate a shared library's data with the 
library itself and put only a place holder in the stub. At run time, a copy 
of the library's initialised static data is copied into the client's place 
holder by the dynamic linker or by library initialisation code.
The ARM shared library mechanism supports these ways of working provided the 
data is free of values which require link-time (or run time) relocation. In 
other words, it can be supported provided the input data areas are free of 
relocation directives.

5.4.3 How ARM shared Libraries Work
-----------------------------------
Stubs and Proxy Functions
-------------------------
When a client application is linked with a shared library, it is linked not 
with the library itself  but with a stub object containing:

     an entry vector;
     a copy of the library's static data or a place holder for it.

Each member of the entry vector is a proxy for a function in the matching 
shared library.
When a client first calls a proxy function, the call is directed to a 
dynamic linker. This is a small function (typically about 50-60 ARM 
instructions) which:

     locates the matching shared library;
     if required, copies an initial image of the library's static data from 
     the library to the place holding area in the stub;
     patches the entry vector so each proxy function points at the 
     corresponding library function;
     resumes the call.

Once an entry vector has been patched, all future proxy calls proceed 
directly to the target library function with only minimal indirection delay 
and no intervention by the dynamic linker.
Of course, making an inter-link-unit call like this is more expensive than 
making a straightforward local procedure call, but not a lot so. It is also 
the only supported way to call a function more than 32MBytes away.

5.4.4 Locating a Library Which Matches the Stub
-----------------------------------------------
Locating a matching shared library is specific to a target system and you 
must provide code to do the location, but the remainder of the dynamic 
linking process is generic to all target systems. Consequently, in order to 
use ARM shared libraries, you have to design and implement a library 
location mechanism and adapt the dynamic linker to it. In practice, this is 
quite straightforward:

     the ARM Linker provides support for parameterising a location mechanism;
     a basic dynamic linker with neither location nor failure reporting 
     mechanisms is a mere 42 ARM instructions.

Please refer to ARM Shared Library Format starting on page16 of the 
Reference Manual for a full explanation of parameter blocks.

How the Dynamic Linker Works
----------------------------
The dynamic linker is entered via a proxy call with r0 pointing at the 
dynamic linker's 16-byte entry stub. Following this stub code is a copy of 
the parameter block for the shared library.
Stored in the parameter block is the identity of the library - perhaps a  
32-bit unique identifier or perhaps a string name. Either way, it can be 
passed to the library location mechanism. You have to decide how to identify 
your shared libraries and, hence, what to put in their parameter blocks.
The library location function is required to return the address of the start 
of the library's offset table.
A primitive location mechanism might be to search a ROM for a matching 
string. This would identify the start of the parameter block of the matching 
shared library. Immediately preceding it will be negative offsets to library 
entry points and a non-negative count word containing the number of entry 
points. By working backwards through memory and counting, you can be sure 
you have found the entry vector and can return the address of its count word 
to the dynamic linker.
More sophisticated location schemes are possible, for example:

    You might include in your library a header containing code to execute 
    when the library is first loaded (into RAM) or initialised (in ROM) 
    which registers the library's name with a library manager. Obviously, 
    the library manager has to be locatable without using the library 
    manager, so either it's address has to be known or its function has to 
    be supported by an underlying system call.

    Acorn's RISC OS operating system supports a module mechanism which is 
    sometimes used to implement shared libraries. A RISC OS module may, by 
    declaring so in its module header, be called when software interrupts 
    (SWIs) in a declared range occur. When such a module is loaded, it 
    extends the range of SWIs interpreted by RISC OS. We can use this 
    mechanism to locate a shared library by storing the identity of a 
    library location SWI in the library's parameter block and by 
    implementing this SWI in the library module's header.

5.4.5 Instructing the Linker to Make a Shared Library
-----------------------------------------------------
Prerequisites
-------------
A shared library can be made from any number of object files, including 
reentrant stubs of other shared libraries, but two simple rules must be 
followed:

     each object file must conform to a reentrant version of the ARM 
     Procedure Call Standard and each code area must have the REENTRANT 
     attribute;
     there may be no unresolved references resulting from linking together 
     the component objects.

An immediate consequence of the second rule is that it is impossible to make 
two shared libraries which refer to one another: to make the second library 
and its stub would require the stub of the first, but to make the first and 
its stub would require the stub of the second.
The first rule is not 100% necessary and is difficult to enforce. The ARM 
Linker warns you if it finds a non-reentrant code area in the list of 
objects to be linked into a shared library but it will build the library and 
its matching stub anyway. You have to decide whether the warning is real, or 
merely a formality.

Linker Outputs
--------------
The ARM linker generates a shared library as two files:

     a plain binary file containing the read-only, reentrant, usually 
     position independent, shared code;
     an AOF format stub file with which client applications can be linked.

The linker can also generate a reentrant stub suitable for inclusion in 
another shared library.
The library image file contains, in order:

     read only code sections from your input objects;
     if so requested, a read only copy of the initialised static data from 
     the input objects;
     a table of (negative) offsets from the end of the library to its entry 
     points;
     if so requested, the size and offset of the static data image;
     a copy of the library's parameter block.

You request a copy of the initialised static data to be included in a 
library when you describe to the linker how to make a shared library. If you 
request this, the linker writes the length and offset of the data image 
immediately after the entry vector. During linking, armlink defines symbols 
SHL$$data$$Size and SHL$$data$$Base to have these values; components of your 
library may refer to these symbols. Instead of including the static data in 
the stub armlink includes a zero initialised place holding area of the same 
size. It also writes the length and (relocatable) address of this place 
holding, zero initialised stub data area immediately after the dynamic 
linker's entry veneer, giving the dynamic linker sufficient information to 
initialise the place holder at run time. During linking, the linker symbols 
SHL$$data$$Size and $$0$$Base describe this length and relocatable address.
Obviously, any data included in your shared library must be free of 
relocation directives. Please refer to ARM Shared Library Format starting on 
page16 of the Reference Manual for a full explanation of what kind of data 
can be included in a shared library.
You specify a parameter block when you describe to the linker how to make a 
shared library. You might, for example, include the name of the library in 
its parameter block, to aid its location. An identical copy of the parameter 
block is included in the library's entry vector in the stub file.

Describing a Shared Library to the Linker
-----------------------------------------
To describe a shared library to the linker you have to prepare a file which 
describes:

     the name of the library;
     the library parameter block;
     what data areas to include;
     what entry points to export.

For precise details of how to do this, please refer to ARM Shared Library 
Format starting on page16 of the Reference Manual. Below is an intuitive 
example you can work with and adapt:

; First, give the name of the file to contain the library -
; strlib - and its parameter block - the single word 0x40000...
> strlib \
  0x40000
; ...then include all suitable data areas...
+ ()
; ... finally export all the entry points...
; ... mostly omitted here for brevity of exposition.
memcpy
...
strtok

The name of this file is passed to armlink as the argument to the -SHL 
command line option (please refer to The ARM Linker (armlink) starting on 
page19 of the User Manual for further details).

5.4.6 Making a Toy String Library
---------------------------------
This section refers to the files collected in the strlib subdirectory of the 
examples directory of the release.
The header files config.h and interns.h let you compile cl/string.c locally. 
Little-endian code is assumed. If you want to make a big-endian string 
library you should edit config.h. Similarly, if you want to alter which 
functions are included or whether static data is initialised by copying from 
the library, then you should edit config.h. You do not need to edit 
interns.h. If you use config.h unchanged you will build a little-endian 
library which includes a data image and which exports all of its functions.

Compiling the String Library
----------------------------
To compile string.c, use the following command:

armcc -li -apcs /reent -zps1 -c -I. ../../cl/string.c

The -li flag tells armcc to compile for a little-endian ARM.
The -apcs /reent flag tells armcc to compile reentrant code.
The -zps1 flag turns off software stack limit checking and allows the string 
library to be independent of all other objects and libraries. With software 
stack limit checking turned on, the library would depend on the stack limit 
checking functions which, in turn, depend on other sections of the C run 
time library. While such dependencies do not much obstruct the construction 
of full scale, production quality shared libraries, they are major 
impediments to a simple demonstration of the underlying mechanisms.
The -I. flag tells armcc to look for needed header files in the current 
directory.

Linking the String Library
--------------------------
To make a shared library and matching stub from string.o, use the following 
linker command:

armlink -o strstub.o -shl strshl -s syms string.o

strlib's stub will be put in strstub.o as directed by the -o option.
The file strshl contains instructions for making a shared library called 
strlib. A shortened version of it was shown in the earlier section 
Describing a Shared Library to the Linker starting on page98.
The option -s syms asks for a listing of symbol values in a file called 
syms. You may later need to look up the value of EFT$$Offset (it will be 
0xA38 if you have changed nothing). As supplied, the dynamic linker expects 
a library's extenal function table (EFT) to be at the address 0x40000. So, 
unless you extend the dynamic linker with a library location mechanism 
(please refer to the discussion in the earlier section How the Dynamic 
Linker Works starting on page96), you will have to load strlib at the 
address 0x40000-EFT$$Offset.

Making the Test Program and Dynamic Linker
------------------------------------------
Now you should assemble the dynamic linker and compile the test code:

armasm -li dynlink.s dynlink.o
armcc -li -c strtest.c

You can extend the test code to probe lots of string functions, but this is 
left as an exercise to help you understand what is going on.
To make the test program you must link together the test code, the dynamic 
linker, the string library stub and the appropriate ARM C library (so that 
references to library members other than the string functions can be 
resolved):

armlink -d -o strtest strtest.o dynlink.o strstub.o ../../lib/armlib.32l

Running the Test Program with the Shared String Library
-------------------------------------------------------
Now you are ready to try everything under the control of command-line armsd:

host-prompt armsd strtest
A.R.M. Source-level Debugger version ...
ARMulator V1.30, 4 Gb memory, MMU present, Demon 1.1,...
Object program file strtest
armsd: getfile strlib 0x40000-0xa38
armsd: go

strerror(42) returns unknown shared string-library error 0x0000002A

Program terminated normally at PC = 0x00008354 (__rt_exit + 0x24)
+0024 0x00008354: 0xef000011 .... :    swi      0x11
armsd: q
Quitting
host-prompt

Before starting strtest you must load the shared string library by using:
getfile strlib 0x40000-0xa38
strlib is the name of the file containing the library; 0x40000 is the hard 
wired address at which the dynamic linker expects to find the external 
function table; and 0xa38 is the value of EFT$$Offset, the offset of the 
external function table from the start of the library.
When strtest runs, it calls strerror(42) which causes the dynamic linker to 
be entered, the static data to be copied, the stub vector to be patched and 
the call to be resumed. You can watch this is more detail by setting a 
breakpoint on __rt_dynlink and single stepping.

5.4.7 Suggested Further Exercises
---------------------------------
Library Location Mechanisms
---------------------------
Locating a library's EFT at 0x40000 is not very satisfactory, so an obvious 
exercise is to extend the dynamic linker to locate a library by looking for 
it. Try, for example, adding a header to the start of the library which 
contains:

     offset to the next loaded library or 0
     the total length of the library
     the offset to the external function table
     the string name of the library

Hint: when you link this area with the other library contents you have to 
ensure that it wil precede all other areas in the library. Please refer to 
Area Placement and Sorting Rules starting on page9 of the Reference Manual 
for further details.

Your dynamic linker could now search a list of libraries loaded at 0x40000 
onwards.

Self-Loading Libraries
----------------------
You could extend the header mechanism described in the previous subsection 
so that a library could copy itself to the next free location above 0x40000. 
This would allow libraries to be loaded at 0x8000 and �executed� there. Of 
course, you would want your header to begin with a branch to the code which 
will copy the library from 0x8000 to its destination above 0x40000.
Multiple Shared Libraries
Once you have built location and loading mechanisms, you can build more than 
one shared library. Try making one of your own and linking a test program 
with the stubs of two or more libraries.

Inter-Library Calls
-------------------
Once you have multiple libraries working, you can try making one library 
call functions in another (but remember that if library A refers to library 
B then library B may not refer to library A). To do this you will have to 
make a reentrant stub for the library you wish to refer to and link this 
into the library making the reference.

5.4.8 Related Topics
--------------------
      Register Usage under the ARM Procedure Call Standard starting on 
      page62