Origins of the SWP instruction

From: john@acorn.co.uk (John Bowler)Subject: Re: Multiprocessing Archimedes??Date: 16 Aug 91 11:10:50 GMTtorq@GNU.AI.MIT.EDU (Andrew Mell) writes:>I notice that the Arm3 has a new instruction over the Arm2 which is>SWP. It swaps a byte or a word between register and external memory.>(uninterruptible between the read and write)  ^^^^^^^^^^^^^^^Indeed, but not necessarily not interleavable with other memory operations(sorry about the double negative :-).  In particular, to fully support theSWP on a system with multiple memory bus masters the memory control logicwhich decides which bus master has access to the memory next would have toforce an interlock between the memory read and memory write of the SWPinstruction.  Now, the ARM3 has a LOCK pin for this, but to supportmulti-processors you need to connect it to something :-).>All very interesting you might say, but it intrigues me as this sort>of instruction is usually only used in multiprocessor systems as a >software semaphore.>>Why did Acorn add this instruction to the Arm3?Because a long time ago, when we were very young (;-) we tried to write amulti-threaded OS (ARX) and we ``found'' (sic, thought)  that it wasspending a lot of time going into supervisor mode and disabling interruptsso that it could implement mutexes (for user mode code - including the OS,which ran in user mode too).  In theory SWP allows user code to implementmutexes efficiently.As far as I am concerned the MP aspects of SWP are bonuses (clearly thesewere considered at the same time - or the LOCK pin wouldn't be there).Notice that SWP always bypasses the cache; again this is MP support, howeverthere is an ommission here in that it is impossible to do a (reliable) readfrom external memory (you might get the cache contents instead!)John Bowler (jbowler@acorn.co.uk)From: john@acorn.co.uk (John Bowler)Subject: Re: Multiprocessing Archimedes??Date: 19 Aug 91 16:25:33 GMTjulian@bridge.welly.gen.nz writes:>john@acorn.co.uk (John Bowler) writes:>>> Notice that SWP always bypasses the cache; again this is MP support, however>> there is an ommission here in that it is impossible to do a (reliable) read>> from external memory (you might get the cache contents instead!)>>If you're using it to implement semaphores, this is not a problem, as you'd>never need to access the semaphore with any instruction other than SWP.Yes; there is no problem with the semaphore, but the semaphore must beprotecting some state which is shared.  When a processor has claimed thatsemaphor it probably needs to read the state and to obtain consistentresults when it reads it.  If the data is in cacheable memory the only wayit can do that is to use sequences of the form:-             SWP rx, rx, [raddr]         ; read a value out             STR rx, [raddr]             ; and put it back... :-(The alternative is to allocate shared data in uncacheable memory.  Thisrequires some OS intervention (a user program cannot simply allocateshareable data structures out of its own heap unless the whole heap isuncacheable) and uncacheable data obviously has a performance hit.>BTW. You wouldn't happen to know the instruction format for SWP, by any>     chance? If a software emulator can be written for it for ARM2 machines>     (like the FPE - or even add it to the FPE) then we can all start using>     it.RISC iX 1.2 emulates the SWP instruction on machines which do not supportit.  RISC OS doesn't.  The assembler syntax is:-         SWP{cond}{B}      Rd, Rm, [Rn]the semantics (except for the cache behaviour and so on) are:-         MOV              , Rm         LDR{cond}{B}     Rd, [Rn]         STR{cond}{B}     , [Rn](ie the SWP Rx, Rx, [Raddr] example above *does* store the *old* Rx valuein [Raddr]... :-).The instruction format is:-     bit 31                                                        bit 0       c.o.n.d.0.0.0.1 0.B.0.0.n.n.n.n d.d.d.d.0.0.0.0 1.0.0.1.m.m.m.m       c.o.n.d - the condition       B       - 0 = swap word                 1 = swap byte       n.n.n.n - Rn       d.d.d.d - Rd       m.m.m.m - RmData aborts (from the memory manager) leave Rd/Rm as they were before.SWP bypasses the ARM3 cache, although the write operation still updatesthe cache (if the address is cached).  I don't know whether the readwill cause the rest of that part of the cache to be updated (I assumenot, and the programmer should not care :-)John Bowler (jbowler@acorn.co.uk)From: dseal@armltd.co.uk (David Seal)Subject: Re: ARM3 instructions.Date: 4 Sep 92 15:01:12 GMTIn article <4422@gos.ukc.ac.uk> amsh1@ukc.ac.uk (Brian May#2) writes:>  I don't have an Archie myself but have used them quite a lot in the past.>I was recently mucking about with a friend's A5000, trying to find the new>instructions that turned the cache on and off. I found them, they were>co-processor instructions with the processor itself as (I think) number 0.Coprocessor 15, in fact.>  Anyway, as I was disassembling away I found a new instruction (well, I had>never come across it before). It was 'SWP' and I imagine it swaps registers>with registers, maybe with memory as well? I can't remember. If it does>reg<->mem as well, and is uninterruptable, perhaps it is for use as a>semaphore in multi-processor systems?The SWP instruction was new to the ARM2as macrocell. I believe ARM3 was thefirst full chip which contained it. More recent macrocells and chips likeARM6, ARM60, ARM600 and ARM610 also contain it.It only swaps a register with a memory location (either a byte or a word),and not two registers. It can however read the new contents of the memorylocation from one register, and write the old contents of the memorylocation to another register - i.e. it doesn't have to do a pure swap. Thismay be the source of your idea that it can swap two registers. It is indeeduninterruptable, and yes, it is intended for semaphores.>  Of course I won't be the first person to notice this so I wondered, could>someone post some info on this, and also on the co-processor instructions>relevant to the CPU itself?The SWP instruction:  Bits 31..28: Usual condition field  Bits 27..23: 00010  Bit 22:      0 for a word swap, 1 for a byte swap  Bits 21..20: 00  Bits 19..16: Base register (addresses the memory location involved)  Bits 15..12: Destination register (where the old memory contents go)  Bits 11..4:  00001001  Bits 3..0:   Source register (where the new memory contents come from)  Byte swaps use the bottom byte of the source and destination registers,  and clear the top three bytes of the destination register. There are  various rules about how R15 works in each register position, similar to  those for LDR and STR instructions. The destination and source registers  are allowed to be the same for a pure swap. I don't know offhand what  would happen if the base register were equal to one or both of the others,  but I don't think I'd recommend doing it!  Assembler syntax is (using <> around optional sections):    SWP Rdest,Rsrc,[Rbase]The ARM3 cache control registers are all coprocessor 15 registers, accessedby MRC and MCR instructions in non-user modes. (They will produce invalidoperation traps in user mode.)Coprocessor 15 register 0 is read only and identifies the chip - e.g.:  Bits 31..24: &41 - designer code for ARM Ltd.  Bits 23..16: &56 - manufacturer code for VLSI Technology Inc.  Bits 15..8:  &03 - identifies chip as an ARM3.  Bits 7..0:   &00 - revision of chip.Coprocessor 15 register 1 is simply a write-sensitive location - writing anyvalue to it flushes the cache.Coprocessor 15 register 2: a miscellaneous control register.  Bit 0 turns the cache on (if 1) or off (if 0).  Bit 1 determines whether user mode and non-user modes use the same address    mapping. Bit 1 is 1 if they do, 0 if they have separate address    mappings. It should be 1 for use with MEMC.  Bit 2 is 0 for normal operation, 1 for a special "monitor mode" in which    the processor is always run at memory speed and all addresses and data    are put on the external pins, even if the memory request was satisfied    by the cache. This allows external hardware like a logic analyser to    trace the program properly.  Other bits are reserved for future expansion. Code which is trying to set    the whole control register (e.g. at system initialisation time) should    write these bits as zeros to ensure compatibility with any such future    expansions. Code which is just trying to change one or two bits (e.g.    turn the cache on or off) should read this register, modify the bits    concerned and write it back: this ensures that it won't have unexpected    side effects in the future like turning as-yet-undefined features off.  This register is reset to all zeros when the ARM3 is reset.Coprocessor 15 register 3: controls whether areas of memory are cacheable,    in 2 megabyte chunks. All accesses to an uncacheable area of memory go    to the real memory and not to the cache - this is a suitable setting    e.g. for areas containing memory-mapped IO, or for doubly mapped areas    of memory.  Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are cacheable, 0 if they    are not.  Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are cacheable, 0 if they    are not.  :  :  Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are cacheable, 0 if    they are not.Coprocessor 15 register 4: controls whether areas of memory are updateable,    in 2 megabyte chunks. All write accesses to a non-updateable area of    memory go to the real memory only, not to the cache - this is a suitable    setting for areas of memory that contain ROMs, for instance, since you    don't want the cached values to be altered by an attempt to write to the    ROM. (Or, as in MEMC, by an attempt to write to write-only locations    that share an address with the read-only ROMs.)  Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are updateable, 0 if    they are not.  Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are updateable, 0 if    they are not.  :  :  Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are updateable, 0 if    they are not.Coprocessor 15 register 5: controls whether areas of memory are disruptive,    in 2 megabyte chunks. Any write access to a disruptive area of memory    will cause the cache to be flushed. This is a suitable setting for areas    of memory which if written, could cause cache contents to become invalid    in some way. E.g. on MEMC, writing to the physically addressed memory at    addresses &2000000-&2FFFFFF will also usually change a virtually    addressed location's contents: if this location is in cache, a    subsequent attempt to read it would read the old value. To avoid this    problem, the physically addressed memory should be marked as disruptive    in a MEMC system. Similarly, any remapping of memory on a MEMC or other    memory controller should act disruptively, since the cache contents are    liable to have become invalid.  Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are disruptive, 0 if    they are not.  Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are disruptive, 0 if    they are not.  :  :  Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are disruptive, 0 if    they are not.Coprocessor 15 registers 3-5 are in an undefined state after power-up: theymust be programmed correctly before the cache is turned on.Note that you should check the identity code in coprocessor 15 register 0identifies the chip as an ARM3 before assuming that the other registers canbe used as stated above, unless you are absolutely certain your code canonly ever be run on an ARM3. Otherwise you are likely to run into problemswith other chips - e.g. an ARM600 uses the same coprocessor 15 registers tocontrol its cache and MMU, but in a completely different way. Just about theonly thing they do have in common is that coprocessor 15 register 0 containsan identification code as described above.David Sealdseal@armltd.co.ukAll opinions are mine only...From: mhardy@acorn.co.uk (Michael Hardy)Subject: Re: Risc-OS DocumentationDate: 15 Aug 91 09:45:14 GMTOrganization: Acorn Computers Ltd, Cambridge, EnglandARM3 SUPPORT============Introduction and Overview=========================The ARM3Support module provides commands to control the use of the ARM3 processor's cache, where one is fitted to a machine. The module willimmediately  kill itself if you try to run it on a machine that only has anARM2 processor fitted.Summary of facilities---------------------* Commands are provided: one to configure whether or not the cache isenabled at  a power-on or reset, and the other to independently turn thecache on or off.There is also a SWI to turn the cache on or off. A further SWI forces thecache to be  flushed. Finally, there is also a set of SWIs that control howvarious areas of  memory interact with the cache.The default setup is such that all RISC OS programs should run unchangedwith  the ARM3's cache enabled. Consequently, you are unlikely to need touse the SWIs  (beyond, possibly, turning the cache on or off).Notes-----A few poorly-written programs may not work correctly with ARM3 processors, because they make assumptions about processor timing or clock rates.Finding out more----------------For more details of the ARM3 processor, see the Acorn RISC Machine familyData  Manual. VLSI Technology Inc. (1990) Prentice-Hall, Englewood Cliffs,NJ, USA: ISBN  0-13-781618-9.SWI Calls=========Cache_Control (SWI &280)========================Turns the cache on or offOn entry--------R0 = EOR maskR1 = AND maskOn exit-------R0 = old state (0 => cacheing was disabled, 1 => cacheing was enabled)Interrupts----------Interrupts are disabledFast interrupts are enabledProcessor mode--------------Processor is in SVC modeRe-entrancy-----------Not definedUse---This call turns the cache on or off. Bit 0 of the ARM3's control register 2is altered  by being masked with R1 and then exclusive ORd with R0: ie newvalue = ((old  value AND R1) XOR R0). Bit 1 of the control register is alsoset, forcing the memory  controller to use the same translation table forboth User and Supervisor Modes  (as indeed the MEMC chip should). Other bitsof the control register are set to  zero.Related SWIs------------NoneRelated vectors---------------NoneCache_Cacheable (SWI &281)==========================Controls which areas of memory may be cachedOn entry--------R0 = EOR maskR1 = AND maskOn exit-------R0 = old value (bit n set => 2MBytes starting at n*2MBytes are cacheable)Interrupts----------Interrupts are disabledFast interrupts are enabledProcessor mode--------------Processor is in SVC modeRe-entrancy-----------Not definedUse---This call controls which areas of memory may be cached (ie are cacheable).The  ARM3's control register 3 is altered by being masked with R1 and thenexclusive  ORd with R0: ie new value = ((old value AND R1) XOR R0). If bit nof the control  register is set, the 2MBytes starting at n*2MBytes arecacheable.The default value stored is &FC007FFF, so ROM, the RAM disc and logicalnon-screen RAM are  cacheable, but I/O space, physical memory and logicalscreen  memory are not.(You may find a value of &FC007CFF - which disables cacheing the RAM disc -gives better performance.)Related SWIs------------Cache_Updateable (SWI &282), Cache_Disruptive (SWI &283)Related vectors---------------NoneCache_Updateable (SWI &282)===========================Controls which areas of memory will be automatically updated in the cacheOn entry--------R0 = EOR maskR1 = AND maskOn exit-------R0 = old value (bit n set => 2MBytes starting at n*2MBytes are cacheable)Interrupts----------Interrupts are disabledFast interrupts are enabledProcessor mode--------------Processor is in SVC modeRe-entrancy-----------Not definedUse---This call controls which areas of memory will be automatically updated inthe  cache when the processor writes to that area (ie are updateable). TheARM3's control  register 4 is altered by being masked with R1 and thenexclusive ORd with R0: ie  new value = ((old value AND R1) XOR R0). If bit nof the control register is set, the  2MBytes starting at n*2MBytes areupdateable.The default value stored is &00007FFF, so logical non-screen RAM isupdateable,  but ROM/CAM/DAG, I/O space, physical memory and logical screenmemory are  not.Related SWIs------------Cache_Cacheable (SWI &281), Cache_Disruptive (SWI &283)Related vectors---------------NoneCache_Disruptive (SWI &283)===========================Controls which areas of memory cause automatic flushing of the cache on awriteOn entry--------R0 = EOR maskR1 = AND maskOn exit-------R0 = old value (bit n set => 2MBytes starting at n*2MBytes are disruptive)Interrupts----------Interrupts are disabledFast interrupts are enabledProcessor mode--------------Processor is in SVC modeRe-entrancy-----------Not definedUse---This call controls which areas of memory cause automatic flushing of thecache  when the processor writes to that area (ie are disruptive). TheARM3's control  register 5 is altered by being masked with R1 and thenexclusive ORd with R0: ie  new value = ((old value AND R1) XOR R0). If bit nof the control register is set, the  2MBytes starting at n*2MBytes areupdateable.The default value stored is &F0000000, so the CAM map is disruptive, but ROM/DAG, I/O space, physical memory and logical memory are not. This causes automatic flushing whenever MEMC's page mapping is altered, which allows programs written for the ARM2 (including RISC OS itself) to run unaltered,but at  the expense of unnecessary flushing on page swaps.Related SWIs------------Cache_Cacheable (SWI &281), Cache_Updateable (SWI &282)Related vectors---------------NoneCache_Flush (SWI &284)======================Flushes the cacheOn entry---------On exit--------Interrupts----------Interrupts are disabledFast interrupts are enabledProcessor mode--------------Processor is in SVC modeRe-entrancy-----------Not definedUse---This call flushes the cache by writing to the ARM3's control register 1.Related SWIs------------NoneRelated vectors---------------None* Commands==========*Cache======Turns the cache on or off, or gives the cache's current stateSyntax------*Cache [On|Off]Parameters----------On or OffUse---*Cache turns the cache on or off. With no parameter, it gives the cache'scurrent  state.Example-------*Cache OffRelated commands----------------*Configure CacheRelated SWIs------------Cache_Control (SWI &280)Related vectors---------------None*Configure Cache================Sets the configured cache state to be on or offSyntax------*Configure Cache On|OffParameters----------On or OffUse---*Configure Cache sets the configured cache state to be on or off.Example-------*Configure Cache OnRelated commands----------------*CacheRelated SWIs------------Cache_Control (SWI &280)Related vectors---------------None******************************************************************************I hope this helps.- Michael J Hardy           Email:      mhardy@acorn.co.uk  Acorn Computers Ltd       Telephone:  +44 223 214411  Cambridge TechnoPark      Fax:        +44 223 214382  645 Newmarket Road        Telex:      81152 ACNNMR G  Cambridge CB5 8PB  England                   Disclaimer: All opinions are my own, not Acorn'sFrom: osmith@acorn.co.uk (Owen Smith)Subject: Re: Risc-OS DocumentationDate: 13 Aug 91 15:06:19 GMTThe ARM3 SWIs really aren't all that interesting, and I've just totallyfailed to find a documentation file for them. However, as a tester, hereis a bit of BASIC (courtesy of Brian Brunswick) which marks the RAM diskarea as not cacheable. This in fact makes it go faster.SYS "Cache_Cacheable", 0, &fffffcffSYS "Cache_Updateable", 0, &fffffcffThe reason it goes faster is that because such large amounts of data arebeing slurped around, the memory copy loop tends to get flushed out ofthe cache, particularly since it is a long piece of loop unrolled code(for speed on an ARM2). So you end up with a cache full of data, very littleof which is ever accessed again before it gets flushed out of the cache bysome more data. The loop does an LDM and STM 10 registers at a time inRamFS, so in theory there are two words that get cached (ARM3 read 4 wordsat a time), but this saving is swallowed up by the cache synchronisationdelays.You have to be careful though. Brian has his own re-sizing ram diskwhich uses the system sprite area. Marking the system sprite are as notcacheable makes it go slower. We (Brian and I) think this is because heuses the C function memcpy(), in which the LDM and STM is 4 registersat a time. Since this is a multiple of four, it hits the ARM bug whereit loads 5 words and then throws the fifth one away, which results inloading 8 words on an ARM3 (it always reads 4 word chunks even with thecache off). So with the cache off, you load 8 then throw 4 away, load thenext 8 (including the 4 you just threw away) and throw 4 away etc. Soyou are effectively reading all the data twice. With the cache on thisgoes down to once. Yes the code will probably get flushed out, but itis a tight loop (not unrolled) so it is not very likely and the cost ofreloading the code is less than the saving on the data loads.The moral of this is to be careful with the ARM3 SWIs, and don't justthink that it ought to go faster, do timings, in lots of different screenmodes.Owen.

poppy@poppyfields.net