Re: Can a 32 bit GCC running on a 64 bit Intel be made to emit SSE SIMD instructions automatically?

Georg GCC user <georggcc@xxxxxxxxxxxxxx> · Sun, 20 May 2012 12:12:18 +0200

On Sun, May 20, 2012 at 4:03 AM, Tim Prince <n8tm@xxxxxxx> wrote:
> On 5/19/2012 7:34 PM, Georg GCC user wrote:

>> VECTOR f(double a, double b, double c, double d)
>> {
>>    return (VECTOR) { a OP c, b OP d };
>> }
>>
>> then 32 bit GCC 4.6.x running on Intel 64bit/SSE hardware can
>> (cannot) translate the two OPs into a single SIMD instruction
>> that computes [a OP d, b OP d] in one parallel step.
>> (At least not without explicitly calling intrinsics.)
>>
>> Is one of these statements known to be true?
>>
>> Georg
>
> I don't have experience with the 32-bit block vectorization, but the
> alignment differences could be the clue.  Where the x86_64 OS specifies
> 16-byte default alignments, the 32-bit OS has smaller defaults (presumably
> for space saving).  You may be able to overcome this difference by
> assume_aligned data declarations.  Gcc rightly avoids unaligned simd loads
> and stores on Intel architectures prior to AVX (where it will avoid AVX-256
> unaligned).

Alignment seems to be a clue. I managed to make this 128 bits,
no more complaints about unaligned, but, alas, now 32 bit GCC
generates two xxxsd plus one xxxpd plus corresponding
unpacking and moving, for what appears to be an internal rewrite
of f along the lines

VECTOR  f'(double a, double b, double c, double d)
{
   double x = a OP c;  // -> xxxsd
   double y = b OP d;  // -> xxxsd
   VECTOR return_value = { x, y }  // -> xxxpd

   return return_value;
}

Doubly slowing down the program. Does an effect like this qualify
as a bug (enhancement class)?

Georg