On Thu, Mar 13, 2008 at 1:56 AM, Ian Lance Taylor <iant@xxxxxxxxxx> wrote: > Since you mention the number of registers you are using, note that > that only matters if they are inputs or outputs. If you need a > temporary register, just pick one, and add it the clobber list. But > if you really have that many inputs and outputs, then you are stuck. I'm using input and outputs because I want the compiler to pick the registers and I want to have named values. The inline block looks something like: asm ( "... bunch-o-vmx code ..." : [rIn0] "=rv" (rIn0), [gIn0] "=rv" (gIn0), [bIn0] "=rv" (bIn0), ... : [rpix] "r" (rpix), [gpix] "r" (gpix), [bpix] "r" (bpix), ... : "memory" ); Writing this type of code using %0 %1 ... %n would be very painful and unpleasant to maintain. If gcc 4.2.x did a sane jobs scheduling the C intrinsic version of this code I wouldn't need to use inline assembly. I wrote the C version in the order that shouldn't have any stalls. But the compiler re-orders the code and takes offset constants and recomputes them inside the loop [values like 0, 16, 32, ... 112]. With all the write/read stalls and extra addi instructions, the C intrinsic version runs at >5 cycles per instruction and overall the asm version is ~10x faster, ouch. --Clem