This makes a difference because the SSE unit can do two single loads, an add, and a store, and it can be easily pipelined. The ratio of load/store to math is not ideal, but if you consider the amount of work to do 2 doubles instead (4 loads, 2 adds, and 2 stores), it's still beneficial. You're also using unaligned loads and stores, which for some architectures is very bad, and is usually less good than aligned loads and stores. Moreover, in your case, it's not just the loads and stores, but all the integer math to calculate array indices, etc... as well as using unions, which doesn't allow the results to remain in registers, which makes for a not-very-optimal result. Note that if you are running on 64-bit, you are likely using SSE in the first version of your code, but its using the scalar path (only the first entry of each register). The code is pretty confusing. If I could understand what it's doing, I'd write you a version using the intel SSE intrinsics (see emmintrin.h and friends), that has a more appropriate data layout. Note that I'm simply assuming that this is possible, but there may be some valid reason why you cannot lay your data out in a SIMD-friendly way. Brian On Wed, Feb 10, 2010 at 7:58 AM, Da Zheng <zhengda1936@xxxxxxxxx> wrote: > Hi, > > On 10-2-10 下午10:57, Brian Budge wrote: >> Hi - >> >> To me it is not at all surprising. These hairy strides and mods >> certainly aren't going to help. You're doing very little math vs >> load/store which means that you're not going to get much out of the > This is what my code needs to do. I cannot change it. I see GCC can > auto-vectorize the code like: > for (i=0; i<256; i++){ > a[i] = b[i] + c[i]; > } > It has even less math, but vectorization should achieve better performance in > the code since GCC does it. >> vector units. Really you need more of a struct-of-arrays type layout >> (pack your doubles together so you can load them in a less strided >> fashion, and pack your ints together. This may have the extra benefit >> of unobfuscating the code :) > I don't understand. What do you mean by less strided fashion? Do you mean all > elements in the array should be of the v2df type and then I access each element > in the loop by i++? Why will this make difference? > > Best regards, > Zheng Da >