Re: is -O2 breaking sse2 alignment?

Brian Dessent <brian@xxxxxxxxxxx> · Wed, 12 Mar 2008 17:28:15 -0700

JP Fournier wrote:

> In the example below, compiling with -O2 results in incorrect output
> from the program.  -O seems OK.  Am I missing something alignment wise
> (or otherwise) or is -O2 breaking my alignment?

If it was an alignment problem you'd most likely be getting a
segmentation fault.  The __m128i type should already include the proper
alignment so you don't need the __attribute__((aligned (16))) stuff.

>         // array of 2 8 byte ints
>         long int *a  = _mm_malloc(16, 16);
>         long int *b  = _mm_malloc(16, 16);
>         long int *c  = _mm_malloc(16, 16);
> 
>         __m128i ai __attribute__ ((aligned (16)));
>         __m128i bi __attribute__ ((aligned (16)));
>         __m128i ci __attribute__ ((aligned (16)));
> 
>         a[0] = a[1] = 1;
>         b[0] = b[1] = 1;
>         c[0] = c[1] = 0;
> 
>         ai = _mm_load_si128( (__m128i *) (void*)a );
>         bi = _mm_load_si128( (__m128i *) (void*)b );
> 
>         ci = _mm_add_epi8( ai, bi );
>         _mm_store_si128( (__m128i *) (void*)c, ci );
>         printf("c0=%ld c1=%ld\n", c[0], c[1] );
> }

You're violates the C aliasing rules.  You can't store through a casted
pointer like that.  You also don't have to do the load/store, the
compiler know what you want when you use a union instead:

  union { __m128i v; long l[2]; } a, b, c;

   a.l[0] = a.l[1] = 1;
   b.l[0] = b.l[1] = 1;

   c.v = _mm_add_epi8 (a.v, b.v);
   printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]);

There's an even more natural way to do this though using gcc's built-in
vector extensions without any of the Intel mmintrin.h stuff.  This way
will result in code that will vectorize to altivec, sse2, spu, whatever
the machine supports, it's not hardware specific:

  typedef int v4si __attribute__ ((vector_size (16)));

  v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c;

  c = a + b;

You can use all the normal C operators like + and * as if they were
scalars but they will be compiled using the corresponding SIMD
instructions.  See
<http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more.  If
you want access to the individual parts you can again use the union,
e.g.

  union { v4si v; int i[4]; } u;

  u.v = a + b;

  printf ("%d,%d,%d,%d\n", v.i[0], v.i[1], v.i[2], v.i[3]);

Brian