On 2/27/2017 12:26 AM, Yifei wrote: > Hi there, > I have a class around __m256d representing a 3-d vector. Yet for comparison I also have a legacy array version made of double[3]. > The AVX version looks like this: > struct vector { > __m256d V; > // ctor > vector operator+(const vector& rhs) const { > return {V[0] + rhs.V[0] ... V[2] + rhs.V[2]}; > } > vector abs() const { > // clear sign bit vandpd > } > }; > Compiling with -O2 -march=native. For the vector version, g++ emits a lot of vextractf128 and vinsertf128 which, ugh, I don't understand why that's necessary. But, so far, okay. And for scalar version, g++ simply vmovsd and vaddsd. > > The thing is, the scalar version seems work better than AVX vector version. So I'm tuning back to the scalar one. > I tried with explicit intrinsics, and reinterpret_cast (this one is even worse, a lot of slow instructions and vunpcklpd, around 6x slower). Yet both failed and g++ persists and blindly relys on vextract128f which is incredibly slow and perform around 2x worse than the scalar one. I even tried inline assembly but I don't really know how to manually allocate stack space in assembly. > > I had also tried vaddpd directly, yet it's slightly slower than scalar version as well, may be subject to slightly more memory access, but I'm not sure. > If you are using unaligned data and targetting a CPU older than Haswell, performance problems with 256-bit memory access and need for vinsertf128 and vextractf128 are expected. -- Tim Prince