> Jay Groven <grovenjl@xxxxxxxxxxxxxxx> writes: > >> All of the SSE functions listed in the gcc manual return a value, i.e. >> we have >> >> v4sf __builtin_ia32_mulps (v4sf, v4sf) >> >> which returns the component-wise product of the two given vectors. >> However, >> the actual sse instruction mulps is accumulator-based. This seems to >> make >> gcc use quite a few temp registers when you call mulps, since it's >> trying to >> give a return value for an instruction that doesn't work that way. > > The register allocator should do this for you. Do you have concrete > code where this doesn't work? > > -- > Falk > Specifically, I'm using the following dot product function (grabbed this off opengl.org, and I don't really understand it too well yet, but it seems to work): inline void dot(float *a, float *b, float *dot) { v4sf r; r = __builtin_ia32_mulps(__builtin_ia32_loadaps(a), __builtin_ia32_loadaps(b)); r = __builtin_ia32_addps(__builtin_ia32_movhlps(r, r), r); r = __builtin_ia32_addss(__builtin_ia32_shufps(r, r, 1), r) ; __builtin_ia32_storeups(dot, r); } If I only compile using gcc -msse -S dotprod.c, the function looks like: dot: pushl %ebp movl %esp, %ebp subl $24, %esp movl 8(%ebp), %eax movaps (%eax), %xmm1 movl 12(%ebp), %eax movaps (%eax), %xmm0 mulps %xmm0, %xmm1 movaps %xmm1, %xmm0 movaps %xmm0, -24(%ebp) movaps -24(%ebp), %xmm1 movaps -24(%ebp), %xmm0 movhlps %xmm0, %xmm1 movaps %xmm1, %xmm0 addps -24(%ebp), %xmm0 movaps %xmm0, -24(%ebp) movaps -24(%ebp), %xmm0 shufps $1, -24(%ebp), %xmm0 addss -24(%ebp), %xmm0 movaps %xmm0, -24(%ebp) movl 16(%ebp), %eax movaps -24(%ebp), %xmm0 movups %xmm0, (%eax) leave ret It looks to me like the compiler is only using xmm0 and xmm1, and it's doing quite a bit of unnecessary shuffling between them to make things work out. If I do gcc -O3 -S -msse dotprod.c, I get: dot: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movaps (%eax), %xmm3 movl 12(%ebp), %eax movaps (%eax), %xmm0 movl 16(%ebp), %eax mulps %xmm0, %xmm3 movaps %xmm3, %xmm2 movhlps %xmm3, %xmm2 addps %xmm3, %xmm2 movaps %xmm2, %xmm1 shufps $1, %xmm2, %xmm2 addss %xmm1, %xmm2 movups %xmm2, (%eax) popl %ebp ret This is a lot shorter, and it appears to be doing what one would expect; no registers are being shuffled, and they're all being used. So, obviously one could just use optimization, but I would thing that gcc would use all registers even without specifying an optimization level. I'm also curious why the gcc team decided to make the builtin functions return a value, rather than just working the way that the processor works.