g++ relys too much on slow AVX vinsertf128 on haswell

Yifei <hlfqdhj@xxxxxxx> · Mon, 27 Feb 2017 13:26:43 +0800 (CST)

Hi there,
I have a class around __m256d representing a 3-d vector. Yet for comparison I also have a legacy array version made of double[3].
The AVX version looks like this:
struct vector {
    __m256d V;
    // ctor
    vector operator+(const vector& rhs) const {
        return {V[0] + rhs.V[0] ... V[2] + rhs.V[2]};
    }
    vector abs() const {
        // clear sign bit vandpd
    }
};
Compiling with -O2 -march=native. For the vector version, g++ emits a lot of vextractf128 and vinsertf128 which, ugh, I don't understand why that's necessary. But, so far, okay. And for scalar version, g++ simply vmovsd and vaddsd.

The thing is, the scalar version seems work better than AVX vector version. So I'm tuning back to the scalar one.
I tried with explicit intrinsics, and reinterpret_cast (this one is even worse, a lot of slow instructions and vunpcklpd, around 6x slower). Yet both failed and g++ persists and blindly relys on vextract128f which is incredibly slow and perform around 2x worse than the scalar one. I even tried inline assembly but I don't really know how to manually allocate stack space in assembly.

I had also tried vaddpd directly, yet it's slightly slower than scalar version as well, may be subject to slightly more memory access, but I'm not sure.

Any thoughts?
Thanks,
Yifei