thank you very much Alexander for your analysis and the bugzilla report! Paul > Date: Fri, 6 May 2022 12:27:39 +0300 (MSK) > From: Alexander Monakov <amonakov@xxxxxxxxx> > > On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote: > > > here are latency metrics (still on i5-4590): > > > > | gcc-9 | gcc-10 | gcc-11 | > > ------------|-------|--------|--------| > > -std=c99 | 70.8 | 70.3 | 70.2 | > > -std=gnu18 | 59.5 | 59.5 | 59.5 | > > > > It thus seems the issue only appears for the reciprocal throughput. > > Thanks. > > The primary issue here is false dependency on vcvtss2sd instruction. In the > snippet shown in Stéphane's email, the slower variant begins with > > vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 > > The cvtss2sd instruction is specified to take the upper bits of SSE register > unmodified, so here it merges high bits of xmm1 with results of float->double > conversion (in low bits) into new xmm1. Unless the CPU can track dependencies > separately for vector register components, it has to delay this instruction > until the previous computation that modified xmm1 has completed (AMD Zen2 is > an example of a microarchitecture that apparently can). > > This limits the degree to which separate cr_log10f can overlap, affecting > throughput. In latency measurements, the calls are already serialized by > dependency over xmm0, so the additional false dependency does not matter. > > (so fma is a "red herring", it's just that depending on compiler version and > flags, register allocation will place last assignment into xmm1 differently) > > If you want to experiment, you can hand-edit assembly to replace the problematic > instruction with variants that avoid the false dependency, such as > > vcvtss2sd %xmm0, %xmm0, %xmm1 > > or > > vpxor %xmm1, %xmm1, %xmm1 > vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 > > GCC has code to do this automatically, but for some reason it doesn't work for > your function. I have reported in to the Bugzilla: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504 > > Alexander