Re: slowdown with -std=gnu18 with respect to -std=c99

Alexander Monakov via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 11 May 2022 16:26:05 +0300 (MSK)

On Fri, 6 May 2022, Alexander Monakov wrote:

> The primary issue here is false dependency on vcvtss2sd instruction. In the
> snippet shown in Stéphane's email, the slower variant begins with
> 
>     vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1
> 
> The cvtss2sd instruction is specified to take the upper bits of SSE register
> unmodified, so here it merges high bits of xmm1 with results of float->double
> conversion (in low bits) into new xmm1. Unless the CPU can track dependencies
> separately for vector register components, it has to delay this instruction
> until the previous computation that modified xmm1 has completed (AMD Zen2 is
> an example of a microarchitecture that apparently can).

For future reference, my statement in parenthesis was a bit inaccurate: Zen 2
avoids the false dependency provided that xmm1 carries all-zeroes in high bits
after being idiomatically zeroed (i.e. via pxor). Thanks to Andreas Abel for
pointing out there's a limitation.

(nevertheless, the "blessed" state seemingly survives context switches, so
it's quite useful, including this testcase)

Alexander