Re: Why does GCC store XMM registers into RAM then load them back instead of using them directly?

Marc Glisse <marc.glisse@xxxxxxxx> · Tue, 2 May 2017 08:41:10 +0200 (CEST)

On Tue, 2 May 2017, Liu Hao wrote:

This can be observed from the following example:
(For your reference: https://godbolt.org/g/toFOVc )

```c++
#include <emmintrin.h>

double my_fmax_1(double x, double y){
   return _mm_cvtsd_f64(_mm_max_sd(_mm_set_sd(x), _mm_set_sd(y)));
}
double my_fmax_2(double x, double y){
   double r;
   __asm__ (
       "maxsd   %%xmm1, %%xmm0"
       : "=x"(r)
       : "0"(x), "x"(y)
   );
   return r;
}
```

After being compiled with `-O3`, this snippet results in the following 
assembly:

```assembly
my_fmax_1(double, double):
       movsd   %xmm0, -24(%rsp)
       movsd   %xmm1, -16(%rsp)
       movsd   -24(%rsp), %xmm0
       movsd   -16(%rsp), %xmm1
       maxsd   %xmm1, %xmm0
       ret
my_fmax_2(double, double):
       maxsd   %xmm1, %xmm0
       ret
```

The first function seems very inefficient. Are there any particular reasons 
why GCC doesn't optimize it well (like the second function)

_mm_set_sd is not a NOP, it sets the upper part of the SSE register to 0, 
which is done with movq in recent versions but through the stack on older 
versions. In order to optimize that away, the compiler needs to know that 
the upper part of the registers is ignored (it isn't ignored by max, it is 
_mm_cvtsd_f64 afterwards that drops anything that depended on it). But the 
maxsd operation is largely opaque to the compiler for now (modeled in an 
unnaturally complicated way), so it does not notice it. Clang does a 
better job there... Feel free to file a bug report at 
https://gcc.gnu.org/bugzilla/ if you don't already see a similar one in 
the database.

--
Marc Glisse