On Sun, May 20, 2012 at 4:03 AM, Tim Prince <n8tm@xxxxxxx> wrote: > On 5/19/2012 7:34 PM, Georg GCC user wrote: >> VECTOR f(double a, double b, double c, double d) >> { >> return (VECTOR) { a OP c, b OP d }; >> } >> >> then 32 bit GCC 4.6.x running on Intel 64bit/SSE hardware can >> (cannot) translate the two OPs into a single SIMD instruction >> that computes [a OP d, b OP d] in one parallel step. >> (At least not without explicitly calling intrinsics.) >> >> Is one of these statements known to be true? >> >> Georg > > I don't have experience with the 32-bit block vectorization, but the > alignment differences could be the clue. Where the x86_64 OS specifies > 16-byte default alignments, the 32-bit OS has smaller defaults (presumably > for space saving). You may be able to overcome this difference by > assume_aligned data declarations. Gcc rightly avoids unaligned simd loads > and stores on Intel architectures prior to AVX (where it will avoid AVX-256 > unaligned). Alignment seems to be a clue. I managed to make this 128 bits, no more complaints about unaligned, but, alas, now 32 bit GCC generates two xxxsd plus one xxxpd plus corresponding unpacking and moving, for what appears to be an internal rewrite of f along the lines VECTOR f'(double a, double b, double c, double d) { double x = a OP c; // -> xxxsd double y = b OP d; // -> xxxsd VECTOR return_value = { x, y } // -> xxxpd return return_value; } Doubly slowing down the program. Does an effect like this qualify as a bug (enhancement class)? Georg