On Tue, 2 May 2017, Liu Hao wrote:
This can be observed from the following example:
(For your reference: https://godbolt.org/g/toFOVc )
```c++
#include <emmintrin.h>
double my_fmax_1(double x, double y){
return _mm_cvtsd_f64(_mm_max_sd(_mm_set_sd(x), _mm_set_sd(y)));
}
double my_fmax_2(double x, double y){
double r;
__asm__ (
"maxsd %%xmm1, %%xmm0"
: "=x"(r)
: "0"(x), "x"(y)
);
return r;
}
```
After being compiled with `-O3`, this snippet results in the following
assembly:
```assembly
my_fmax_1(double, double):
movsd %xmm0, -24(%rsp)
movsd %xmm1, -16(%rsp)
movsd -24(%rsp), %xmm0
movsd -16(%rsp), %xmm1
maxsd %xmm1, %xmm0
ret
my_fmax_2(double, double):
maxsd %xmm1, %xmm0
ret
```
The first function seems very inefficient. Are there any particular reasons
why GCC doesn't optimize it well (like the second function)
_mm_set_sd is not a NOP, it sets the upper part of the SSE register to 0,
which is done with movq in recent versions but through the stack on older
versions. In order to optimize that away, the compiler needs to know that
the upper part of the registers is ignored (it isn't ignored by max, it is
_mm_cvtsd_f64 afterwards that drops anything that depended on it). But the
maxsd operation is largely opaque to the compiler for now (modeled in an
unnaturally complicated way), so it does not notice it. Clang does a
better job there... Feel free to file a bug report at
https://gcc.gnu.org/bugzilla/ if you don't already see a similar one in
the database.
--
Marc Glisse