Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations

"Dey, Megha" <megha.dey@xxxxxxxxx> · Fri, 15 Jan 2021 16:14:40 -0800

Hi Ard,

On 12/19/2020 9:03 AM, Ard Biesheuvel wrote:
On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@xxxxxxxxx> wrote:
From: Kyung Min Park <kyung.min.park@xxxxxxxxx>

Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
The new instruction allows to work on 4 x 16 byte blocks at the time.
For best parallelism and deeper out of order execution, the main loop of
the code works on 16 x 16 byte blocks at the time and performs reduction
every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
and the precompute operation has been optimized as well to leverage 512 bit
registers, parallel carry less multiply and reduction.

VPCLMULQDQ instruction is used to accelerate the most time-consuming
part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.

The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
version with the VPCLMULQDQ version when the following criteria are met:
At compile time:
1. CONFIG_CRYPTO_AVX512 is enabled
2. toolchain(assembler) supports VPCLMULQDQ instructions
At runtime:
1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
    only Icelake)
2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
    boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
    to 1 after boot.
    If compiled as loadable module, use_avx512 module parameter must be set:
    modprobe ghash_clmulni_intel use_avx512=1

With new implementation, tcrypt ghash speed test shows about 4x to 10x
speedup improvement for GHASH calculation compared to the original
implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
or above. Detailed results for a variety of block sizes and update
sizes are in the table below. The test was performed on Icelake based
platform with constant frequency set for CPU.

The average performance improvement of the AVX512 version over the current
implementation is as follows:
For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).

A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
and VPCLMULQDQ instruction shows the following results.

---------------------------------------------------------------------------
|            |            |         cycles/operation         |            |
|            |            |       (the lower the better)     |            |
|    byte    |   bytes    |----------------------------------| percentage |
|   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |
|            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
|------------|------------|----------------|-----------------|------------|
|      16    |     16     |       144      |        233      |   -38.0    |
|      64    |     16     |       535      |        709      |   -24.5    |
|      64    |     64     |       210      |        146      |    43.8    |
|     256    |     16     |      1808      |       1911      |    -5.4    |
|     256    |     64     |       865      |        581      |    48.9    |
|     256    |    256     |       682      |        170      |   301.0    |
|    1024    |     16     |      6746      |       6935      |    -2.7    |
|    1024    |    256     |      2829      |        714      |   296.0    |
|    1024    |   1024     |      2543      |        341      |   645.0    |
|    2048    |     16     |     13219      |      13403      |    -1.3    |
|    2048    |    256     |      5435      |       1408      |   286.0    |
|    2048    |   1024     |      5218      |        685      |   661.0    |
|    2048    |   2048     |      5061      |        565      |   796.0    |
|    4096    |     16     |     40793      |      27615      |    47.8    |
|    4096    |    256     |     10662      |       2689      |   297.0    |
|    4096    |   1024     |     10196      |       1333      |   665.0    |
|    4096    |   4096     |     10049      |       1011      |   894.0    |
|    8192    |     16     |     51672      |      54599      |    -5.3    |
|    8192    |    256     |     21228      |       5284      |   301.0    |
|    8192    |   1024     |     20306      |       2556      |   694.0    |
|    8192    |   4096     |     20076      |       2044      |   882.0    |
|    8192    |   8192     |     20071      |       2017      |   895.0    |
---------------------------------------------------------------------------

This work was inspired by the AES GCM mode optimization published
in Intel Optimized IPSEC Cryptographic library.
https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm

Co-developed-by: Greg Tucker <greg.b.tucker@xxxxxxxxx>
Signed-off-by: Greg Tucker <greg.b.tucker@xxxxxxxxx>
Co-developed-by: Tomasz Kantecki <tomasz.kantecki@xxxxxxxxx>
Signed-off-by: Tomasz Kantecki <tomasz.kantecki@xxxxxxxxx>
Signed-off-by: Kyung Min Park <kyung.min.park@xxxxxxxxx>
Co-developed-by: Megha Dey <megha.dey@xxxxxxxxx>
Signed-off-by: Megha Dey <megha.dey@xxxxxxxxx>
Hello Megha,

What is the purpose of this separate GHASH module? GHASH is only used
in combination with AES-CTR to produce GCM, and this series already
contains a GCM driver.

Do cores exist that implement PCLMULQDQ but not AES-NI?

If not, I think we should be able to drop this patch (and remove the
existing PCLMULQDQ GHASH driver as well)

AFAIK, dm-verity (authenticated but not encrypted file system) is one 
use case for authentication only.

Although I am not sure if GHASH is specifically used for this or SHA?

Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

Megha