Hi there, I have a class around __m256d representing a 3-d vector. Yet for comparison I also have a legacy array version made of double[3]. The AVX version looks like this: struct vector { __m256d V; // ctor vector operator+(const vector& rhs) const { return {V[0] + rhs.V[0] ... V[2] + rhs.V[2]}; } vector abs() const { // clear sign bit vandpd } }; Compiling with -O2 -march=native. For the vector version, g++ emits a lot of vextractf128 and vinsertf128 which, ugh, I don't understand why that's necessary. But, so far, okay. And for scalar version, g++ simply vmovsd and vaddsd. The thing is, the scalar version seems work better than AVX vector version. So I'm tuning back to the scalar one. I tried with explicit intrinsics, and reinterpret_cast (this one is even worse, a lot of slow instructions and vunpcklpd, around 6x slower). Yet both failed and g++ persists and blindly relys on vextract128f which is incredibly slow and perform around 2x worse than the scalar one. I even tried inline assembly but I don't really know how to manually allocate stack space in assembly. I had also tried vaddpd directly, yet it's slightly slower than scalar version as well, may be subject to slightly more memory access, but I'm not sure. Any thoughts? Thanks, Yifei