On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote: > here are latency metrics (still on i5-4590): > > | gcc-9 | gcc-10 | gcc-11 | > ------------|-------|--------|--------| > -std=c99 | 70.8 | 70.3 | 70.2 | > -std=gnu18 | 59.5 | 59.5 | 59.5 | > > It thus seems the issue only appears for the reciprocal throughput. Thanks. The primary issue here is false dependency on vcvtss2sd instruction. In the snippet shown in Stéphane's email, the slower variant begins with vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 The cvtss2sd instruction is specified to take the upper bits of SSE register unmodified, so here it merges high bits of xmm1 with results of float->double conversion (in low bits) into new xmm1. Unless the CPU can track dependencies separately for vector register components, it has to delay this instruction until the previous computation that modified xmm1 has completed (AMD Zen2 is an example of a microarchitecture that apparently can). This limits the degree to which separate cr_log10f can overlap, affecting throughput. In latency measurements, the calls are already serialized by dependency over xmm0, so the additional false dependency does not matter. (so fma is a "red herring", it's just that depending on compiler version and flags, register allocation will place last assignment into xmm1 differently) If you want to experiment, you can hand-edit assembly to replace the problematic instruction with variants that avoid the false dependency, such as vcvtss2sd %xmm0, %xmm0, %xmm1 or vpxor %xmm1, %xmm1, %xmm1 vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 GCC has code to do this automatically, but for some reason it doesn't work for your function. I have reported in to the Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504 Alexander