Hello, Lots of questions are being ask that can probably better answer if I try to explain what I am doing and why I am doing it (believe it or not, there is method in my madness!) BTW, I am using arm-elf-gcc (GCC) 4.0.2, that was in my original post, but got deleted afterward, sorry about that So, let me try to answer the main question. I need to create an interpreter/simulator for an old (wired) CPU (with 1024 instructions) that will reside in a small embedded system. Therefore, memory (executable size) is an issue as well as speed. My first version was of course a large switch case Switch (*pc++) { Case xxx: execute; continue; } But, this would cause 1 extra jump (3 cycles) for each loop plus extra testing at each loop (is the number to switch to too large?), all together, the 'loop and execute the next instruction' code was over 12 cycles. For comparison, the most used instructions when executing the bytecode (ie: virtual code) are jumps which take 2 cycle to emulate, so the overhead of the switch case/loop is extremely significant! So I tried to use table of jump locations (code in previous messages) in order to replace the poor code generated by the switch by 2 instructions (5 cycles ldrh instr, [instr_pc] #2 mov pc, [jump_table, instr, asl #2] which should provide a 60% speed increase on the most executed instructions! After all this is the exact reason why table of labels were introduced in gcc (see help files!). But, the compiler if fighting me, not liking the jump in the inline code (it basically does not see the jumps and optimizes out all the code! Replacing the assembly by a goto *jump[pc]; does help a bit, but the code generated is not optimal (and makes the whole program too large to fit in memory!). (because it loads the jump address in a register first and then moves it in PC instead of loading directly in PC. The 2nd problem is the fact that one of the most used variable (the jump table address) is moved on the stack instead of being kept in a register. Probably because the optimizer uses the register for some other local optimizations (knowing my luck for instructions that are pretty much never emulated) at the expanse of a much more effective global optimization. The best way for me to solve this problem (which is due to the fact that I am doing definitely non-standard code) would be to allow me to specify to the compiler where I want global optimization turned on or off... then I could let the compiler optimize local things, but would turn it off for the main loops (where I write my own code). So, is there any hope for me? If needed I can provide the full code (in order to simplify things, I have only put an example that would show the problem in my messages, the real code being 1000 lines long). The main different is that with the real code the jump table address is put on the stack while with the example, it is not... Thanks for your help, Cyrille -----Original Message----- From: Richard Earnshaw [mailto:rearnsha@xxxxxxx] Sent: 08 December 2006 11:12 To: Andrew Haley Cc: de Brebisson, Cyrille (Calculator Division); gcc-help@xxxxxxxxxxx Subject: RE: Help needed: Optimization of bytecode interpreter for ARMpaltform On Fri, 2006-12-08 at 17:21 +0000, Andrew Haley wrote: > de Brebisson, Cyrille (Calculator Division) writes: > > > [snip] trying to re-code, using inline assembly goto *jump[*progc++] > > I used inline assembly to do: > > Ldrh instr, [progc], #2 // note that in most cases, there is an > > // extra instruction here that allows to > > // cancel the waitstate caused by the use > > // of register instr on the next > > instruction > > ldr pc, [jump, instr, asl #2] > > > > because the compiler generates the highly unoptimized (and too large for > > the memory in my device) > > ldrh r1, [r4], #2 > > ldr r8, .L2691+4 > > ldr fp, [r8, r1, asl #2] > > mov pc, fp @ indirect register jump > > [/snip] > > > > >This is the crucial mistake: you can't jump out of an inline asm. > > > > So, how can I optimize my code? Is there a way to force the compiler to > > 1: put a variable in a register? As the asm ("register"); constraint > > does not seem to do a lot of forcing > > Definitely: if declaring a global register variable doesn't work, > that's a bug. What exactly did you try? > > > 2: get the compiler to condense the last 2 instructions in 1? > > I'm not sure why gcc generates that sequence. Forwarding to Richard > Earnshaw for comment. First of all, you don't mention which version of the compiler you are using, so it's hard to know precisely why you get the code you do. GCC-4.1 is used in my example below. Trying to second guess the compiler is rarely profitable, but it's not clear to me why the address of the jump table is not being hoisted out of the loop. There is a hack that will effectively force this in this instance. By loading a global variable (or you could pass it in as an additional parameter such that it is always zero), we force the address calculation into a local variable that the compiler can't (easily) optimize away. For the following test-case: int offset = 0; void runprog(unsigned short *prog, int count) { __label__ code0, code1, code2, code3; static const void* const jump[4] = { &&code0, &&code1, &&code2, &&code3 }; const void* const* interp = jump+offset; while (count--) { goto *interp[*prog++]; code0: foo(); continue; code1: bar(); continue; code2: wibble(); continue; code3: wombat(); break; } } The critical part of the loop then compiles to: ldrh r3, [r5], #2 ldr pc, [r6, r3, asl #2] @ indirect memory jump which looks fine to me. Note, however, that if your 'switch' statement is large, then you'll quite probably get spilling of variables. The value of interp is higly likely to be a candidate here because it's used exactly once per iteration, so you'll then be back to where you started. I'm somewhat confused as to why you haven't just used a switch table for this, though. The equivalent code: void runprog(unsigned short *prog, int count) { while (count--) { switch(*prog++) { case 0: foo(); continue; case 1: bar(); continue; case 2: wibble(); continue; case 3: wombat(); goto done; } } done: ; } is much easier to understand and much more ammenable to the standard optimizer framework. R.