Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

"Dey, Megha" <megha.dey@xxxxxxxxx> · Mon, 31 Jan 2022 10:43:12 -0800

Hi all,

On 2/24/2021 9:42 AM, Andy Lutomirski wrote:
On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha <megha.dey@xxxxxxxxx> wrote:
Hi Andy,

On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
On Fri, Jan 22, 2021 at 11:29 PM Megha Dey <megha.dey@xxxxxxxxx> wrote:
Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.

The AVX512 optimization are disabled by default to avoid impact on other
workloads. In order to use these optimized algorithms:
1. At compile time:
     a. User must enable CONFIG_CRYPTO_AVX512 option
     b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
2. At run time:
     a. User must set module parameter use_avx512 at boot time
     b. Platform must support VPCLMULQDQ and VAES features

N.B. It is unclear whether these coarse grain controls(global module
parameter) would meet all user needs. Perhaps some per-thread control might
be useful? Looking for guidance here.
I've just been looking at some performance issues with in-kernel AVX,
and I have a whole pile of questions that I think should be answered
first:

What is the impact of using an AVX-512 instruction on the logical
thread, its siblings, and other cores on the package?

Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

What is the impact on subsequent shorter EVEX, VEX, and legacy
SSE(2,3, etc) insns?

How does VZEROUPPER figure in?  I can find an enormous amount of
misinformation online, but nothing authoritative.

What is the effect of the AVX-512 states (5-7) being “in use”?  As far
as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
and its variants.  Is this correct?

On AVX-512 capable CPUs, do we ever get a penalty for executing a
non-VEX insn followed by a large-width EVEX insn without an
intervening VZEROUPPER?  The docs suggest no, since Broadwell and
before don’t support EVEX, but I’d like to know for sure.

My current opinion is that we should not enable AVX-512 in-kernel
except on CPUs that we determine have good AVX-512 support.  Based on
some reading, that seems to mean Ice Lake Client and not anything
before it.  I also think a bunch of the above questions should be
answered before we do any of this.  Right now we have a regression of
unknown impact in regular AVX support in-kernel, we will have
performance issues in-kernel depending on what user code has done
recently, and I'm still trying to figure out what to do about it.
Throwing AVX-512 into the mix without real information is not going to
improve the situation.
We are currently working on providing you with answers on the questions
you have raised regarding AVX.
Thanks!

We had submitted this patch series last year which uses AVX512F, VAES, 
VPCLMULQDQ instructions and ZMM(512 bit) registers to optimize certain 
crypto algorithms. As concluded, this approach could introduce a 
frequency drop of 1-2 bins for sibling threads running on the same core 
(512L instructions). The behavior is explained in article [1]. [2] 
covers similar topic as [1] but it focuses on client processors.

Since then, we have worked on new AES-GCM implementation using AVX512VL, 
VAES, VCLMUQLDQ instructions using only 256-bit YMM registers. With this 
implementation, we see a 1.5X improvement on ICX/ICL for 16KB buffers 
compared to the existing kernel AES-GCM implementation that works on 
128-bit XMM registers. Instructions used in the new GCM implementation 
classify as 256L ones. 256L class maps onto Core License 2 resulting in 
no frequency reduction (Figure 6 in [1]) and execute at the same 
frequency as an SSE code.

Before we start work on any upstream worthy patch, we would want to 
solicit any feedback to see if this implementation approach receives 
interest from the community.

Please note that AES-GCM is the predominant cipher suite for TLS and 
IPSEC. Having its efficient/performant implementation in the kernel will 
help customers and applications that rely on KTLS (like CDN/TLS proxy) 
or kernel IPSEC tunneling services.

[1] 
https://www.intel.com/content/www/us/en/architecture-and-technology/crypto-acceleration-in-xeon-scalable-processors-wp.html 

[2] https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Thanks,

Megha