RE: Help needed: Optimization of bytecode interpreter for ARMpaltform

"de Brebisson, Cyrille (Calculator Division)" <cyrille@xxxxxx> · Fri, 8 Dec 2006 12:30:04 -0700

Hello,

Lots of questions are being ask that can probably better answer if I try
to explain what I am doing and why I am doing it (believe it or not,
there is method in my madness!) 

BTW, I am using arm-elf-gcc (GCC) 4.0.2, that was in my original post,
but got deleted afterward, sorry about that

So, let me try to answer the main question.

I need to create an interpreter/simulator for an old (wired) CPU (with
1024 instructions) that will reside in a small embedded system.
Therefore, memory (executable size) is an issue as well as speed. 

My first version was of course a large switch case
Switch (*pc++)
{
  Case xxx: execute; continue;
}

But, this would cause 1 extra jump (3 cycles) for each loop plus extra
testing at each loop (is the number to switch to too large?), all
together, the 'loop and execute the next instruction' code was over 12
cycles. For comparison, the most used instructions when executing the
bytecode (ie: virtual code) are jumps which take 2 cycle to emulate, so
the overhead of the switch case/loop is extremely significant!

So I tried to use table of jump locations (code in previous messages) in
order to replace the poor code generated by the switch by 2 instructions
(5 cycles ldrh instr, [instr_pc] #2 mov pc, [jump_table, instr, asl #2]
which should provide a 60% speed increase on the most executed
instructions! After all this is the exact reason why table of labels
were introduced in gcc (see help files!).

But, the compiler if fighting me, not liking the jump in the inline code
(it basically does not see the jumps and optimizes out all the code!

Replacing the assembly by a goto *jump[pc]; does help a bit, but the
code generated is not optimal (and makes the whole program too large to
fit in memory!). (because it loads the jump address in a register first
and then moves it in PC instead of loading directly in PC.

The 2nd problem is the fact that one of the most used variable (the jump
table address) is moved on the stack instead of being kept in a
register. Probably because the optimizer uses the register for some
other local optimizations (knowing my luck for instructions that are
pretty much never emulated) at the expanse of a much more effective
global optimization.

The best way for me to solve this problem (which is due to the fact that
I am doing definitely non-standard code) would be to allow me to specify
to the compiler where I want global optimization turned on or off...
then I could let the compiler optimize local things, but would turn it
off for the main loops (where I write my own code).

So, is there any hope for me?

If needed I can provide the full code (in order to simplify things, I
have only put an example that would show the problem in my messages, the
real code being 1000 lines long). The main different is that with the
real code the jump table address is put on the stack while with the
example, it is not...

Thanks for your help, Cyrille

-----Original Message-----
From: Richard Earnshaw [mailto:rearnsha@xxxxxxx] 
Sent: 08 December 2006 11:12
To: Andrew Haley
Cc: de Brebisson, Cyrille (Calculator Division); gcc-help@xxxxxxxxxxx
Subject: RE: Help needed: Optimization of bytecode interpreter for
ARMpaltform

On Fri, 2006-12-08 at 17:21 +0000, Andrew Haley wrote:
> de Brebisson, Cyrille (Calculator Division) writes:
> 
>  > [snip] trying to re-code, using inline assembly goto
*jump[*progc++]
>  > I used inline assembly to do:
>  > Ldrh instr, [progc], #2       // note that in most cases, there is
an
>  >                               // extra instruction here that allows
to
>  >                               // cancel the waitstate caused by the
use
>  >                               // of register instr on the next
>  > instruction
>  > ldr pc, [jump, instr, asl #2]
>  > 
>  > because the compiler generates the highly unoptimized (and too
large for
>  > the memory in my device)
>  > 	ldrh	r1, [r4], #2
>  > 	ldr	r8, .L2691+4
>  > 	ldr	fp, [r8, r1, asl #2]
>  > 	mov	pc, fp	@ indirect register jump
>  > [/snip]
>  > 
>  > >This is the crucial mistake: you can't jump out of an inline asm.
>  > 
>  > So, how can I optimize my code? Is there a way to force the
compiler to
>  > 1: put a variable in a register? As the asm ("register");
constraint
>  > does not seem to do a lot of forcing
> 
> Definitely: if declaring a global register variable doesn't work,
> that's a bug.  What exactly did you try?
> 
>  > 2: get the compiler to condense the last 2 instructions in 1?
> 
> I'm not sure why gcc generates that sequence.  Forwarding to Richard
> Earnshaw for comment.

First of all, you don't mention which version of the compiler you are
using, so it's hard to know precisely why you get the code you do.
GCC-4.1 is used in my example below.

Trying to second guess the compiler is rarely profitable, but it's not
clear to me why the address of the jump table is not being hoisted out
of the loop.  There is a hack that will effectively force this in this
instance.  By loading a global variable (or you could pass it in as an
additional parameter such that it is always zero), we force the address
calculation into a local variable that the compiler can't (easily)
optimize away.  For the following test-case:

int offset = 0;

void runprog(unsigned short *prog, int count)
{
    __label__ code0, code1, code2, code3;
    static const void* const jump[4] = 
	{
	    &&code0, &&code1, &&code2, &&code3
	};
    const void* const* interp = jump+offset;

    while (count--)
	{
	    goto *interp[*prog++];
    code0:
	    foo();
	    continue;
    code1:
	    bar();
	    continue;
    code2:
	    wibble();
	    continue;
    code3:
	    wombat();
	    break;
	}
}

The critical part of the loop then compiles to:

        ldrh    r3, [r5], #2
        ldr     pc, [r6, r3, asl #2]    @ indirect memory jump

which looks fine to me.  Note, however, that if your 'switch' statement
is large, then you'll quite probably get spilling of variables.  The
value of interp is higly likely to be a candidate here because it's used
exactly once per iteration, so you'll then be back to where you started.

I'm somewhat confused as to why you haven't just used a switch table for
this, though.  The equivalent code:

void runprog(unsigned short *prog, int count)
{
    while (count--)
	{
	    switch(*prog++)
		{
		case 0:
		    foo();
		    continue;
		case 1:
		    bar();
		    continue;
		case 2:
		    wibble();
		    continue;
		case 3:
		    wombat();
		    goto done;
		}
	}
 done:
    ;
}

is much easier to understand and much more ammenable to the standard
optimizer framework.

R.