Re: Why does GCC store XMM registers into RAM then load them back instead of using them directly?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2 May 2017, Liu Hao wrote:

This can be observed from the following example:
(For your reference: https://godbolt.org/g/toFOVc )

```c++
#include <emmintrin.h>

double my_fmax_1(double x, double y){
   return _mm_cvtsd_f64(_mm_max_sd(_mm_set_sd(x), _mm_set_sd(y)));
}
double my_fmax_2(double x, double y){
   double r;
   __asm__ (
       "maxsd   %%xmm1, %%xmm0"
       : "=x"(r)
       : "0"(x), "x"(y)
   );
   return r;
}
```

After being compiled with `-O3`, this snippet results in the following assembly:

```assembly
my_fmax_1(double, double):
       movsd   %xmm0, -24(%rsp)
       movsd   %xmm1, -16(%rsp)
       movsd   -24(%rsp), %xmm0
       movsd   -16(%rsp), %xmm1
       maxsd   %xmm1, %xmm0
       ret
my_fmax_2(double, double):
       maxsd   %xmm1, %xmm0
       ret
```

The first function seems very inefficient. Are there any particular reasons why GCC doesn't optimize it well (like the second function)

_mm_set_sd is not a NOP, it sets the upper part of the SSE register to 0, which is done with movq in recent versions but through the stack on older versions. In order to optimize that away, the compiler needs to know that the upper part of the registers is ignored (it isn't ignored by max, it is _mm_cvtsd_f64 afterwards that drops anything that depended on it). But the maxsd operation is largely opaque to the compiler for now (modeled in an unnaturally complicated way), so it does not notice it. Clang does a better job there... Feel free to file a bug report at https://gcc.gnu.org/bugzilla/ if you don't already see a similar one in the database.

--
Marc Glisse



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux