Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

Dave Hansen <dave.hansen@xxxxxxxxx> · Fri, 7 May 2021 09:22:39 -0700

Hi Andy,

Here are a few answers to your questions.  Sorry for the delay.  There's
more of this kind of stuff to come, so stay tuned.

On 1/24/21 8:23 AM, Andy Lutomirski wrote:
> What is the impact of using an AVX-512 instruction on the logical
> thread, its siblings, and other cores on the package?

There’s a frequency penalty on the core using AVX-512, which means both
hyperthreads. The penalty duration is longer on Skylake than Cascade
Lake which is longer than Icelake.

There’s no direct penalty to the other cores.  They do all share an
overall heat budget of course, and on systems with insufficient fans,
heat can impact turbo range performance.

> Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

The impact is incurred when ZMM-specific registers are used; this is not
dependent on the encoding.

On Icelake, the size of the drop depends on the type of the instruction
(mov like instructions have small to none, while the most heavy
instruction is the VFMA family which has the largest penalty)

> What is the impact on subsequent shorter EVEX, VEX, and legacy
> SSE(2,3, etc) insns?

There’s a “shadow” in time even after the last ZMM-using instruction,
(hysteresis)

> How does VZEROUPPER figure in?  I can find an enormous amount of
> misinformation online, but nothing authoritative.

VZEROUPPER exists to clear the AVX2 (and 512 state) so that subsequent
SSE operations don’t get false data dependencies. It’s not related to
the frequency impact.

> What is the effect of the AVX-512 states (5-7) being “in use”?  As far
> as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
> and its variants.  Is this correct?

XINUSE only impacts XSAVE*/XRSTOR*.  Just having XINUSE[5-7]=0x7 will
not incur the frequency impact.  In other words, the XSAVE*/XRSTOR*
“use” of ZMM-specific register state does not incur the frequency penalty.

> On AVX-512 capable CPUs, do we ever get a penalty for executing a
> non-VEX insn followed by a large-width EVEX insn without an
> intervening VZEROUPPER?  The docs suggest no, since Broadwell and
> before don’t support EVEX, but I’d like to know for sure.

It’s the other way around; the dependency is on the non-VEX instruction
side on state in the YMM/ZMM “upper half” that non-VEX is required to
preserve, creating a false dependency.  An instruction cannot depend on
a future instruction, so non-VEX followed by (E)VEX have no false
dependency… so no VZEROUPPER is needed.