Re: Simple ARM code generation

Richard Earnshaw <Richard.Earnshaw@xxxxxxxxxxxxxxxxxxxxxxx> · Mon, 21 Aug 2006 19:22:34 +0100

On Sun, 20 Aug 2006 15:43:03 PDT, Steve Freeland wrote:
> 
> > From: Richard Earnshaw <Richard.Earnshaw@xxxxxxxxxxxxxxxxxxxxxxx>
> > On Mon, 14 Aug 2006 15:15:46 PDT, Steve Freeland wrote:
> > > Hello,
> > > 
> > > I'm attempting to use gcc (the WinARM build, see -v output below) to comp
> ile 
> > > the following stub program:
> > > 
> > > int AEEMod_Load(IShell *pIShell, void *ph, IModule **ppMod)
> > > {
> > >     return ENOMEMORY; /* defined to "2" */
> > > }
> > > 
> > > The output disassembles to the following:
> > > 
> > > Disassembly of section .text:
> > > 
> > > 00000000 <AEEMod_Load>:
> > >    0:   e1a0c00d        mov     ip, sp
> > >    4:   e92dd800        stmdb   sp!, {fp, ip, lr, pc}
> > >    8:   e24cb004        sub     fp, ip, #4      ; 0x4
> > >    c:   e24dd00c        sub     sp, sp, #12     ; 0xc
> > >   10:   e50b0010        str     r0, [fp, #-16]
> > >   14:   e50b1014        str     r1, [fp, #-20]
> > >   18:   e50b2018        str     r2, [fp, #-24]
> > >   1c:   e3a03002        mov     r3, #2  ; 0x2
> > >   20:   e1a00003        mov     r0, r3
> > >   24:   e24bd00c        sub     sp, fp, #12     ; 0xc
> > >   28:   e89da800        ldmia   sp, {fp, sp, pc}
> > > 
> > > This code crashes the device I'm working with.
> > > 
> > > There seems to be a problem with the use of stmdb and ldmia to save and r
> esto
> > > re the register context to the stack.  The stmdb instruction saves 4 regi
> ster
> > > s, and the ldmia only restores 3 of them, one of which (sp) isn't in the 
> orig
> > > inal 4.  This trick seems to be common, but I don't understand how it wor
> ks. 
> > >  What order are the registers saved and loaded in?  As far as I can tell,
>  at 
> > > the end the pc register ends up with either the original fp or sp.
> > > 
> > > What needs to happen is for the original sp to be restored, and the origi
> nal 
> > > lr to be loaded into pc (the "return").
> > > 
> > > If I change the two last lines to the following:
> > > 
> > >   24:   e24bd004        sub     sp, fp, #4      ; 0x4
> > >   28:   e12fff1e        bx      lr
> > > 
> > > The code works correctly.  Can anyone explain this to me?
> 
> First of all, thanks for the response!
> 
> > Hmm, the code looks correct to me (note that although IP is saved, this is 
> just a copy of SP).
> 
> Aaah yes, I missed that.  Although the issue is tangential to my immediate pr
> oblem, I'd really appreciate it if you could explain how that stmdb/ldmia tri
> ck works.  My understanding is that registers are saved in order (r0 first, r
> 15/pc last) and loaded in reverse order.  But that would mean the original va
> lue of r15/pc that gets saved onto the stack at 0x4 gets *loaded* right back 
> into r15/pc at 0x28.  Obviously that can't be right...  what am I missing?
> 

ldm and stm can normally be matched in pairs, so for example stmdb 
(D_ecrement B_efore) can be matched with ldmia (I_ncrement A_fter.  The 
registers in both instructions are always in the same order, with the 
lowest numbered register at the lowest address in memory and incrementing 
upwards from there.

To make things a bit easier when talking about stacks you can also talk 
about the stack layout in the instructions, you can then write your ldm 
and stm instructions using the stack mnemonics, the most common of which 
is a 'full-descending' stack (the stack grows by moving to a lower address 
and the bottom, addresed, word contains data -- it's full).  So

	stmdb sp!, {r4, r5, r6}
	ldmdb sp!, {r4, r5, r6}

will push r4-r6 onto the stack and then pop them off again.

In fact, the above idiom is so common on ARM processors that in Thumb mode 
the equivalent instructions are known literally as push and pop[1], and 
for the above you would simply write

	push {r4, r5, r6}
	pop  {r4, r5, r6}

Now, going back to your original example, you will see that the compiler 
pushes 4 words onto the stack at the start of the function, but at the end 
it only pops 3 words off.  How does this work and not leave the stack 
corrupted?

The answer is that GCC is also saving the stack pointer on the stack, so 
when the pop happens at the end the original value of the stack pointer 
(which we copied int IP before we started messing with the stack at all) 
is restored directly.

Another question you are probably asking is 'why do we push all those 
registers which are never needed?'  The answer to that one is twofold:
1) The compiler is setting up a stack frame.  These are useful when trying 
to debug your code, particularly on older debuggers.  The stack frame 
allows a debugger to print out the call back-trace if things go wrong.  
With modern debug technology and debug descriptions such as dwarf a lot of 
this information is no-longer strictly necessary and the debugger can work 
out what happened with much less help from the compiled code -- and indeed 
it will do better if you address step 2 below ;-)
2) You didn't turn the optimizer on (use -O or -O2 -- or even -Os) and the 
compiler will simplify the code generated significantly.

> Also, this: http://en.wikibooks.org/wiki/ARM/Programmer%27s_Model#Program_Cou
> nter claims that r15/pc can't always be manipulated like any other register. 
>  Is that correct?  If so, is using ldmia into r15/pc always ok?  
> 

R15 is the program counter, and it's true that you can't treat it entirely 
like a normal register, but you can load and store its value; you can copy 
it to other registers; and you can use it in some simple addressing 
operations (either to generate an address value in a register, or directly 
in a pc-relative load instruction).  For example, it's perfectly 
acceptable to write

	ldr	r0, [pc, #32]

and this will load the value that is 40 bytes beyond the address of the 
current instruction (the PC always reads 8 higher than the current 
address).  However, the above is somewhat hard to remember, so the 
assembler will normally allow you to write

	ldr	r0, L32
	...
L32:
	.word	0x12345678

and it will work out the values for you (provided your label is within 4Kb 
of the instruction).

> > Let me take a guess.  You are using something like an ARM920 (or an ARM7TDM
> I) device, and you are calling
> > this function from Thumb code.  If so, then you need to compile your functi
> on with -mthumb-interwork, then it
> > will generate a return sequence that switches correctly back to Thumb.
> 
> You're correct that it's an ARM7TDMI device.  The code which calls my code is
>  in firmware, so I'm not entirely sure whether it's Thumb or not...  If you'r
> e right, then presumably the calling procedure puts the appropriate flag in t
> he LSB of r14/lr and expects the procedure being called to use bx and not ldm
> ia to return.  Is it the case, then, that the switch to or from Thumb mode ca
> n't be done by modifying pc with ldmia?  If so, that would certainly explain 
> the crash.  I'll try that as soon as I get the chance.

The ARM7TDMI is an implementation of version 4T of the ARM architecture, 
often termed ARMv4T.  This was the first revision of the architecture to 
support Thumb and support for compiling your application as a mixture of 
both ARM and Thumb code (termed interworking) was fairly limited: the only 
way you could switch states was by using the BX instruction.  The next 
revision of the architecture added support for state switching to ldr and 
ldm instructions as well, which makes interworking much more efficient.  
If you compile your original example with the same code compiled with 
-mthumb-interwork you'll probably see a return sequence something like

   24:   e24bd00c        sub     sp, fp, #12     ; 0xc
   28:   e89da800        ldmia   sp, {fp, sp, lr}
   2c:   e12fff1e        bx      lr

> 
> Thanks again, and it'd be great if you could sort me out with respect to the 
> stmdb/ldmia trick.
> 

Anyway, I've probably bamboozled you with more than enough information by 
now, so I'd better stop.  Hope the above helps,

R.

[1] The latest version of the ARM state assembly syntax has introduced 
this idiom too, but you'll need a very recent version of the GNU assembler 
to use it, and GCC still uses the old syntax at present.