JP Fournier wrote: > In the example below, compiling with -O2 results in incorrect output > from the program. -O seems OK. Am I missing something alignment wise > (or otherwise) or is -O2 breaking my alignment? If it was an alignment problem you'd most likely be getting a segmentation fault. The __m128i type should already include the proper alignment so you don't need the __attribute__((aligned (16))) stuff. > // array of 2 8 byte ints > long int *a = _mm_malloc(16, 16); > long int *b = _mm_malloc(16, 16); > long int *c = _mm_malloc(16, 16); > > __m128i ai __attribute__ ((aligned (16))); > __m128i bi __attribute__ ((aligned (16))); > __m128i ci __attribute__ ((aligned (16))); > > a[0] = a[1] = 1; > b[0] = b[1] = 1; > c[0] = c[1] = 0; > > ai = _mm_load_si128( (__m128i *) (void*)a ); > bi = _mm_load_si128( (__m128i *) (void*)b ); > > ci = _mm_add_epi8( ai, bi ); > _mm_store_si128( (__m128i *) (void*)c, ci ); > printf("c0=%ld c1=%ld\n", c[0], c[1] ); > } You're violates the C aliasing rules. You can't store through a casted pointer like that. You also don't have to do the load/store, the compiler know what you want when you use a union instead: union { __m128i v; long l[2]; } a, b, c; a.l[0] = a.l[1] = 1; b.l[0] = b.l[1] = 1; c.v = _mm_add_epi8 (a.v, b.v); printf("c0=%ld c1=%ld\n", c.l[0], c.l[1]); There's an even more natural way to do this though using gcc's built-in vector extensions without any of the Intel mmintrin.h stuff. This way will result in code that will vectorize to altivec, sse2, spu, whatever the machine supports, it's not hardware specific: typedef int v4si __attribute__ ((vector_size (16))); v4si a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c; c = a + b; You can use all the normal C operators like + and * as if they were scalars but they will be compiled using the corresponding SIMD instructions. See <http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html> for more. If you want access to the individual parts you can again use the union, e.g. union { v4si v; int i[4]; } u; u.v = a + b; printf ("%d,%d,%d,%d\n", v.i[0], v.i[1], v.i[2], v.i[3]); Brian