On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@xxxxxxxxx> wrote: > > From: Kyung Min Park <kyung.min.park@xxxxxxxxx> > > Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions. > The new instruction allows to work on 4 x 16 byte blocks at the time. > For best parallelism and deeper out of order execution, the main loop of > the code works on 16 x 16 byte blocks at the time and performs reduction > every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys > and the precompute operation has been optimized as well to leverage 512 bit > registers, parallel carry less multiply and reduction. > > VPCLMULQDQ instruction is used to accelerate the most time-consuming > part of GHASH, carry-less multiplication. VPCLMULQDQ instruction > with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction. > > The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ > version with the VPCLMULQDQ version when the following criteria are met: > At compile time: > 1. CONFIG_CRYPTO_AVX512 is enabled > 2. toolchain(assembler) supports VPCLMULQDQ instructions > At runtime: > 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently > only Icelake) > 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at > boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set > to 1 after boot. > If compiled as loadable module, use_avx512 module parameter must be set: > modprobe ghash_clmulni_intel use_avx512=1 > > With new implementation, tcrypt ghash speed test shows about 4x to 10x > speedup improvement for GHASH calculation compared to the original > implementation with PCLMULQDQ when the bytes per update size is 256 Bytes > or above. Detailed results for a variety of block sizes and update > sizes are in the table below. The test was performed on Icelake based > platform with constant frequency set for CPU. > > The average performance improvement of the AVX512 version over the current > implementation is as follows: > For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x). > For bytes per update < 1KB, we see the average improvement of 370%(~3.7x). > > A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction > and VPCLMULQDQ instruction shows the following results. > > --------------------------------------------------------------------------- > | | | cycles/operation | | > | | | (the lower the better) | | > | byte | bytes |----------------------------------| percentage | > | blocks | per update | GHASH test | GHASH test | loss/gain | > | | | with PCLMULQDQ | with VPCLMULQDQ | | > |------------|------------|----------------|-----------------|------------| > | 16 | 16 | 144 | 233 | -38.0 | > | 64 | 16 | 535 | 709 | -24.5 | > | 64 | 64 | 210 | 146 | 43.8 | > | 256 | 16 | 1808 | 1911 | -5.4 | > | 256 | 64 | 865 | 581 | 48.9 | > | 256 | 256 | 682 | 170 | 301.0 | > | 1024 | 16 | 6746 | 6935 | -2.7 | > | 1024 | 256 | 2829 | 714 | 296.0 | > | 1024 | 1024 | 2543 | 341 | 645.0 | > | 2048 | 16 | 13219 | 13403 | -1.3 | > | 2048 | 256 | 5435 | 1408 | 286.0 | > | 2048 | 1024 | 5218 | 685 | 661.0 | > | 2048 | 2048 | 5061 | 565 | 796.0 | > | 4096 | 16 | 40793 | 27615 | 47.8 | > | 4096 | 256 | 10662 | 2689 | 297.0 | > | 4096 | 1024 | 10196 | 1333 | 665.0 | > | 4096 | 4096 | 10049 | 1011 | 894.0 | > | 8192 | 16 | 51672 | 54599 | -5.3 | > | 8192 | 256 | 21228 | 5284 | 301.0 | > | 8192 | 1024 | 20306 | 2556 | 694.0 | > | 8192 | 4096 | 20076 | 2044 | 882.0 | > | 8192 | 8192 | 20071 | 2017 | 895.0 | > --------------------------------------------------------------------------- > > This work was inspired by the AES GCM mode optimization published > in Intel Optimized IPSEC Cryptographic library. > https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm > > Co-developed-by: Greg Tucker <greg.b.tucker@xxxxxxxxx> > Signed-off-by: Greg Tucker <greg.b.tucker@xxxxxxxxx> > Co-developed-by: Tomasz Kantecki <tomasz.kantecki@xxxxxxxxx> > Signed-off-by: Tomasz Kantecki <tomasz.kantecki@xxxxxxxxx> > Signed-off-by: Kyung Min Park <kyung.min.park@xxxxxxxxx> > Co-developed-by: Megha Dey <megha.dey@xxxxxxxxx> > Signed-off-by: Megha Dey <megha.dey@xxxxxxxxx> Hello Megha, What is the purpose of this separate GHASH module? GHASH is only used in combination with AES-CTR to produce GCM, and this series already contains a GCM driver. Do cores exist that implement PCLMULQDQ but not AES-NI? If not, I think we should be able to drop this patch (and remove the existing PCLMULQDQ GHASH driver as well)