On Fri, 6 May 2022, Alexander Monakov wrote: > The primary issue here is false dependency on vcvtss2sd instruction. In the > snippet shown in Stéphane's email, the slower variant begins with > > vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 > > The cvtss2sd instruction is specified to take the upper bits of SSE register > unmodified, so here it merges high bits of xmm1 with results of float->double > conversion (in low bits) into new xmm1. Unless the CPU can track dependencies > separately for vector register components, it has to delay this instruction > until the previous computation that modified xmm1 has completed (AMD Zen2 is > an example of a microarchitecture that apparently can). For future reference, my statement in parenthesis was a bit inaccurate: Zen 2 avoids the false dependency provided that xmm1 carries all-zeroes in high bits after being idiomatically zeroed (i.e. via pxor). Thanks to Andreas Abel for pointing out there's a limitation. (nevertheless, the "blessed" state seemingly survives context switches, so it's quite useful, including this testcase) Alexander