Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

Andy Lutomirski <luto@xxxxxxxxxx> · Wed, 24 Feb 2021 09:42:21 -0800

On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha.dey@xxxxxxxxx> wrote:
>
> Hi Andy,
>
> On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
> > On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@xxxxxxxxx> wrote:
> >> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
> >> (first implemented on Intel's Icelake client and Xeon CPUs).
> >>
> >> These algorithms take advantage of the AVX512 registers to keep the CPU
> >> busy and increase memory bandwidth utilization. They provide substantial
> >> (2-10x) improvements over existing crypto algorithms when update data size
> >> is greater than 128 bytes and do not have any significant impact when used
> >> on small amounts of data.
> >>
> >> However, these algorithms may also incur a frequency penalty and cause
> >> collateral damage to other workloads running on the same core(co-scheduled
> >> threads). These frequency drops are also known as bin drops where 1 bin
> >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> >> are observed on the Icelake server.
> >>
> >> The AVX512 optimization are disabled by default to avoid impact on other
> >> workloads. In order to use these optimized algorithms:
> >> 1. At compile time:
> >>     a. User must enable CONFIG_CRYPTO_AVX512 option
> >>     b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
> >> 2. At run time:
> >>     a. User must set module parameter use_avx512 at boot time
> >>     b. Platform must support VPCLMULQDQ and VAES features
> >>
> >> N.B. It is unclear whether these coarse grain controls(global module
> >> parameter) would meet all user needs. Perhaps some per-thread control might
> >> be useful? Looking for guidance here.
> >
> > I've just been looking at some performance issues with in-kernel AVX,
> > and I have a whole pile of questions that I think should be answered
> > first:
> >
> > What is the impact of using an AVX-512 instruction on the logical
> > thread, its siblings, and other cores on the package?
> >
> > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?
> >
> > What is the impact on subsequent shorter EVEX, VEX, and legacy
> > SSE(2,3, etc) insns?
> >
> > How does VZEROUPPER figure in?  I can find an enormous amount of
> > misinformation online, but nothing authoritative.
> >
> > What is the effect of the AVX-512 states (5-7) being “in use”?  As far
> > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
> > and its variants.  Is this correct?
> >
> > On AVX-512 capable CPUs, do we ever get a penalty for executing a
> > non-VEX insn followed by a large-width EVEX insn without an
> > intervening VZEROUPPER?  The docs suggest no, since Broadwell and
> > before don’t support EVEX, but I’d like to know for sure.
> >
> >
> > My current opinion is that we should not enable AVX-512 in-kernel
> > except on CPUs that we determine have good AVX-512 support.  Based on
> > some reading, that seems to mean Ice Lake Client and not anything
> > before it.  I also think a bunch of the above questions should be
> > answered before we do any of this.  Right now we have a regression of
> > unknown impact in regular AVX support in-kernel, we will have
> > performance issues in-kernel depending on what user code has done
> > recently, and I'm still trying to figure out what to do about it.
> > Throwing AVX-512 into the mix without real information is not going to
> > improve the situation.
>
> We are currently working on providing you with answers on the questions
> you have raised regarding AVX.

Thanks!