Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

"Dey, Megha" <megha.dey@xxxxxxxxx> · Tue, 23 Feb 2021 16:54:31 -0800

Hi Andy,

On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@xxxxxxxxx> wrote:
Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.

The AVX512 optimization are disabled by default to avoid impact on other
workloads. In order to use these optimized algorithms:
1. At compile time:
    a. User must enable CONFIG_CRYPTO_AVX512 option
    b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
2. At run time:
    a. User must set module parameter use_avx512 at boot time
    b. Platform must support VPCLMULQDQ and VAES features

N.B. It is unclear whether these coarse grain controls(global module
parameter) would meet all user needs. Perhaps some per-thread control might
be useful? Looking for guidance here.

I've just been looking at some performance issues with in-kernel AVX,
and I have a whole pile of questions that I think should be answered
first:

What is the impact of using an AVX-512 instruction on the logical
thread, its siblings, and other cores on the package?

Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

What is the impact on subsequent shorter EVEX, VEX, and legacy
SSE(2,3, etc) insns?

How does VZEROUPPER figure in?  I can find an enormous amount of
misinformation online, but nothing authoritative.

What is the effect of the AVX-512 states (5-7) being “in use”?  As far
as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
and its variants.  Is this correct?

On AVX-512 capable CPUs, do we ever get a penalty for executing a
non-VEX insn followed by a large-width EVEX insn without an
intervening VZEROUPPER?  The docs suggest no, since Broadwell and
before don’t support EVEX, but I’d like to know for sure.

My current opinion is that we should not enable AVX-512 in-kernel
except on CPUs that we determine have good AVX-512 support.  Based on
some reading, that seems to mean Ice Lake Client and not anything
before it.  I also think a bunch of the above questions should be
answered before we do any of this.  Right now we have a regression of
unknown impact in regular AVX support in-kernel, we will have
performance issues in-kernel depending on what user code has done
recently, and I'm still trying to figure out what to do about it.
Throwing AVX-512 into the mix without real information is not going to
improve the situation.

We are currently working on providing you with answers on the questions 
you have raised regarding AVX.

Thanks,

Megha