Re: slowdown with -std=gnu18 with respect to -std=c99

Alexander Monakov <amonakov@xxxxxxxxx> · Fri, 6 May 2022 12:27:39 +0300 (MSK)

On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote:

> here are latency metrics (still on i5-4590):
> 
>             | gcc-9 | gcc-10 | gcc-11 |
> ------------|-------|--------|--------|
>  -std=c99   | 70.8  | 70.3   | 70.2   |
>  -std=gnu18 | 59.5  | 59.5   | 59.5   |
> 
> It thus seems the issue only appears for the reciprocal throughput.

Thanks.

The primary issue here is false dependency on vcvtss2sd instruction. In the
snippet shown in Stéphane's email, the slower variant begins with

    vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1

The cvtss2sd instruction is specified to take the upper bits of SSE register
unmodified, so here it merges high bits of xmm1 with results of float->double
conversion (in low bits) into new xmm1. Unless the CPU can track dependencies
separately for vector register components, it has to delay this instruction
until the previous computation that modified xmm1 has completed (AMD Zen2 is
an example of a microarchitecture that apparently can).

This limits the degree to which separate cr_log10f can overlap, affecting
throughput. In latency measurements, the calls are already serialized by
dependency over xmm0, so the additional false dependency does not matter.

(so fma is a "red herring", it's just that depending on compiler version and
flags, register allocation will place last assignment into xmm1 differently)

If you want to experiment, you can hand-edit assembly to replace the problematic
instruction with variants that avoid the false dependency, such as

    vcvtss2sd %xmm0, %xmm0, %xmm1

or

    vpxor %xmm1, %xmm1, %xmm1
    vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1

GCC has code to do this automatically, but for some reason it doesn't work for
your function. I have reported in to the Bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504

Alexander