Re: is -O2 breaking sse2 alignment?

Maximillian Murphy <m@xxxxxxxxxxxxxxxx> · Sat, 15 Mar 2008 22:29:51 +0000

> > 
> > What I am _really_ trying to do is to implement is the addition of 
> > elements of two arrays.
> > 
> > Is there a more efficient way of doing this than this way?:
> > 
> 

Dear All,

Regarding the earlier code: the vector add is 8 bit wide, (_mm_add_epi8) even though we are loading the registers with 64 bit longs.  If we only want to add bytes we can do eight times as many at a shot, if we want to add longs the command needs to change to _mm_add_epi64.

I toyed with 64 bit adds and your code.  One of your codes.  I created three arrays of 10000000 longs and added them together, first without SSE, then using just one SSE load, add, unload, then using two load, add, unloads in parallel.  The answers that you ladies and gentlemen have been waiting for are:

Without SSE:
      3  In 1.600000e+05 jiffies
     97  In 1.700000e+05 jiffies
    181  In 1.800000e+05 jiffies
     23  In 1.900000e+05 jiffies
      1  In 2.000000e+05 jiffies
With one SSE load add cycle:
     51  In 2.200000e+05 jiffies
    204  In 2.300000e+05 jiffies
     50  In 2.400000e+05 jiffies
With two SSE load add cycles:
      1  In 2.000000e+05 jiffies
     56  In 2.100000e+05 jiffies
    177  In 2.200000e+05 jiffies
     69  In 2.300000e+05 jiffies
      1  In 2.400000e+05 jiffies

As we have eight registers we could have four add operations going on simultaneously, however it's clearly not going to beat vanilla code that ignores the SSE. (On my machine anyway and with one particular code.  If anyone can do better, please speak up so that we can compare notes!)

As you can tell, the clock is fairly coarse.  Repeating the tests makes up for that though.

Doing masses of 16 bit multiplies I can beat the standard gcc compiled code by a small factor.

I'm curious as to what is limiting the SSE computation.  Is it load time, in which case it's only worth using the SSE if there are several operations to do, or is it the width of the compute engine?  The latter seems unlikely, after all width is what SSE is all about.

Regards, A.N. Ewbie.