Re: X86 Built-in Functions

grovenjl@xxxxxxxxxxxxxxx · Mon, 2 Feb 2004 19:42:16 -0500 (EST)

> Jay Groven <grovenjl@xxxxxxxxxxxxxxx> writes:
>
>> All of the SSE functions listed in the gcc manual return a value, i.e.
>> we have
>>
>> v4sf __builtin_ia32_mulps (v4sf, v4sf)
>>
>> which returns the component-wise product of the two given vectors.
>> However,
>> the actual sse instruction mulps is accumulator-based.  This seems to
>> make
>> gcc use quite a few temp registers when you call mulps, since it's
>> trying to
>> give a return value for an instruction that doesn't work that way.
>
> The register allocator should do this for you. Do you have concrete
> code where this doesn't work?
>
> --
> 	Falk
>

Specifically, I'm using the following dot product function (grabbed this
off opengl.org, and I don't really understand it too well yet, but it
seems to work):

inline void dot(float *a, float *b, float *dot)
{
        v4sf r;

        r = __builtin_ia32_mulps(__builtin_ia32_loadaps(a),
                                 __builtin_ia32_loadaps(b));
        r = __builtin_ia32_addps(__builtin_ia32_movhlps(r, r), r);
        r = __builtin_ia32_addss(__builtin_ia32_shufps(r, r, 1), r) ;

        __builtin_ia32_storeups(dot, r);
}

If I only compile using gcc -msse -S dotprod.c, the function looks like:

dot:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $24, %esp
        movl    8(%ebp), %eax
        movaps  (%eax), %xmm1
        movl    12(%ebp), %eax
        movaps  (%eax), %xmm0
        mulps   %xmm0, %xmm1
        movaps  %xmm1, %xmm0
        movaps  %xmm0, -24(%ebp)
        movaps  -24(%ebp), %xmm1
        movaps  -24(%ebp), %xmm0
        movhlps %xmm0, %xmm1
        movaps  %xmm1, %xmm0
        addps   -24(%ebp), %xmm0
        movaps  %xmm0, -24(%ebp)
        movaps  -24(%ebp), %xmm0
        shufps  $1, -24(%ebp), %xmm0
        addss   -24(%ebp), %xmm0
        movaps  %xmm0, -24(%ebp)
        movl    16(%ebp), %eax
        movaps  -24(%ebp), %xmm0
        movups  %xmm0, (%eax)
        leave
        ret

It looks to me like the compiler is only using xmm0 and xmm1, and it's
doing quite a bit of unnecessary shuffling between them to make things
work out.  If I do gcc -O3 -S -msse dotprod.c, I get:

dot:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        movaps  (%eax), %xmm3
        movl    12(%ebp), %eax
        movaps  (%eax), %xmm0
        movl    16(%ebp), %eax
        mulps   %xmm0, %xmm3
        movaps  %xmm3, %xmm2
        movhlps %xmm3, %xmm2
        addps   %xmm3, %xmm2
        movaps  %xmm2, %xmm1
        shufps  $1, %xmm2, %xmm2
        addss   %xmm1, %xmm2
        movups  %xmm2, (%eax)
        popl    %ebp
        ret

This is a lot shorter, and it appears to be doing what one would expect;
no registers are being shuffled, and they're all being used.  So,
obviously one could just use optimization, but I would thing that gcc
would use all registers even without specifying an optimization level. 
I'm also curious why the gcc team decided to make the builtin functions
return a value, rather than just working the way that the processor works.