Re: g++ relys too much on slow AVX vinsertf128 on haswell

"Tim Prince via gcc-help" <gcc-help@xxxxxxxxxxx> · Mon, 27 Feb 2017 06:58:25 -0500



On 2/27/2017 12:26 AM, Yifei wrote:
> Hi there,
> I have a class around __m256d representing a 3-d vector. Yet for comparison I also have a legacy array version made of double[3].
> The AVX version looks like this:
> struct vector {
>     __m256d V;
>     // ctor
>     vector operator+(const vector& rhs) const {
>         return {V[0] + rhs.V[0] ... V[2] + rhs.V[2]};
>     }
>     vector abs() const {
>         // clear sign bit vandpd
>     }
> };
> Compiling with -O2 -march=native. For the vector version, g++ emits a lot of vextractf128 and vinsertf128 which, ugh, I don't understand why that's necessary. But, so far, okay. And for scalar version, g++ simply vmovsd and vaddsd.
>
> The thing is, the scalar version seems work better than AVX vector version. So I'm tuning back to the scalar one.
> I tried with explicit intrinsics, and reinterpret_cast (this one is even worse, a lot of slow instructions and vunpcklpd, around 6x slower). Yet both failed and g++ persists and blindly relys on vextract128f which is incredibly slow and perform around 2x worse than the scalar one. I even tried inline assembly but I don't really know how to manually allocate stack space in assembly.
>
> I had also tried vaddpd directly, yet it's slightly slower than scalar version as well, may be subject to slightly more memory access, but I'm not sure.
>
If you are using unaligned data and targetting a CPU older than Haswell,
performance problems with 256-bit memory access and need for vinsertf128
and vextractf128 are expected.

-- 
Tim Prince