Re: using vector extension in gcc slows down my code

Brian Budge <brian.budge@xxxxxxxxx> · Wed, 10 Feb 2010 08:13:22 -0800

This makes a difference because the SSE unit can do two single loads,
an add, and a store, and it can be easily pipelined.  The ratio of
load/store to math is not ideal, but if you consider the amount of
work to do 2 doubles instead (4 loads, 2 adds, and 2 stores), it's
still beneficial.  You're also using unaligned loads and stores, which
for some architectures is very bad, and is usually less good than
aligned loads and stores.  Moreover, in your case, it's not just the
loads and stores, but all the integer math to calculate array indices,
etc... as well as using unions, which doesn't allow the results to
remain in registers, which makes for a not-very-optimal result.  Note
that if you are running on 64-bit, you are likely using SSE in the
first version of your code, but its using the scalar path (only the
first entry of each register).

The code is pretty confusing.  If I could understand what it's doing,
I'd write you a version using the intel SSE intrinsics (see
emmintrin.h and friends), that has a more appropriate data layout.
Note that I'm simply assuming that this is possible, but there may be
some valid reason why you cannot lay your data out in a SIMD-friendly
way.

  Brian

On Wed, Feb 10, 2010 at 7:58 AM, Da Zheng <zhengda1936@xxxxxxxxx> wrote:
> Hi,
>
> On 10-2-10 下午10:57, Brian Budge wrote:
>> Hi -
>>
>> To me it is not at all surprising.  These hairy strides and mods
>> certainly aren't going to help.  You're doing very little math vs
>> load/store which means that you're not going to get much out of the
> This is what my code needs to do. I cannot change it. I see GCC can
> auto-vectorize the code like:
>  for (i=0; i<256; i++){
>    a[i] = b[i] + c[i];
>  }
> It has even less math, but vectorization should achieve better performance in
> the code since GCC does it.
>> vector units.  Really you need more of a struct-of-arrays type layout
>> (pack your doubles together so you can load them in a less strided
>> fashion, and pack your ints together.  This may have the extra benefit
>> of unobfuscating the code :)
> I don't understand. What do you mean by less strided fashion? Do you mean all
> elements in the array should be of the v2df type and then I access each element
> in the loop by i++? Why will this make difference?
>
> Best regards,
> Zheng Da
>