RE: [PATCH v2 09/19] crypto: x86 - use common macro for FPU limit

"Elliott, Robert (Servers)" <elliott@xxxxxxx> · Tue, 18 Oct 2022 00:06:02 +0000

> -----Original Message-----
> From: Jason A. Donenfeld <Jason@xxxxxxxxx>
> Sent: Thursday, October 13, 2022 8:27 PM
> Subject: Re: [PATCH v2 09/19] crypto: x86 - use common macro for FPU
> limit
> 
> On Thu, Oct 13, 2022 at 3:48 PM Elliott, Robert (Servers)
> <elliott@xxxxxxx> wrote:
> > Perhaps we should declare a time goal like "30 us," measure the actual
> > speed of each algorithm with a tcrypt speed test, and calculate the
> > nominal value assuming some slow x86 CPU core speed?
> 
> Sure, pick something reasonable with good margin for a reasonable CPU.
> It doesn't have to be perfect, but just vaguely right for supported
> hardware.
> 
> > That could be further adjusted at run-time based on the supposed
> > minimum CPU frequency (e.g., as reported in
> > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq).
> 
> Oh no, please no. Not another runtime knob. That also will make the
> loop less efficient.

Here's some stats measuring the time in CPU cycles between
kernel_fpu_begin() and kernel_fpu_end() for every x86 crypto
module using those function calls. This is before any
patches to enforce any new limits.

Driver                               boot tcrypt-sweep average
======                               ==== ============ =======
aegis128_aesni                       6240 |       8214     433
aesni_intel                         22218 |     150558      68
aria_aesni_avx_x86_64                   0 >      95560    1282
camellia_aesni_avx2                 52300        52300    4300
camellia_aesni_avx_x86_64           20920        20920    5915
camellia_x86_64                         0            0       0
cast5_avx_x86_64                    41854 |     108996    6602
cast6_avx_x86_64                    39270 |     119476   10596
chacha_x86_64                        3516 |      58112     349
crc32c_intel                         1458 |       2702     235
crc32_pclmul                         1610 |       3130     210
crct10dif_pclmul                     1928 |       2096      82
ghash_clmulni_intel                  9154 |      56632     336
libblake2s_x86_64                    7514         7514     897
nhpoly1305_avx2                      1360 |       5408     301
poly1305_x86_64                     20656 |      21688     409
polyval_clmulni                     13972        13972      34
serpent_avx2                        45686 |      74824    4185
serpent_avx_x86_64                  47436        47436    7120
serpent_sse2_x86_64                 38492        38492    7400
sha1_ssse3                          20950 |      38310     512
sha256_ssse3                        46554 |      57162    1201
sha512_ssse3                    157051800    157051800  167728
sm3_avx_x86_64                      82372        82372    2017
sm4_aesni_avx_x86_64                66350        66350    2019
twofish_avx_x86_64                 104598 |     163894    6633
twofish_x86_64_3way                     0            0       0

Comparing a few of the hash functions with tcrypt test 16
(4 KiB of data with 1 update) shows a 35x difference from the
fastest to slowest:
crc32c         695 cycles/operation
crct10dif     2197
sha1-avx2     8825
sha224-avx2  24816
sha256-avx2  21179
sha384-avx2  14939
sha512-avx2  14584

Test notes
==========
Measurement points:
- after booting, with
  - CONFIG_MODULE_SIG_SHA512=y (use SHA-512 for module signing)
  - CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (compares results
    with generic module during init)
  - # CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
    (run self-tests during module load)
- after sweeping through tcrypt test modes 1 to 999 
  - except 0, 300, and 400 which run combinations of the others
- measured on a system with Intel Cascade Lake CPUs at 2.2 GHz

This run did not report any RCU stalls.

The hash function is the main problem, subjected to huge
sizes during module signature checking. sha1 or sha256 would
face the same problem if they had been selected.

The self-tests are limited to 2 * PAGE_SIZE so don't stress
the drivers anywhere near as much as booting. This run did
include the tcrypt patch to call cond_resched during speed
tests, so the speed test induced problem is out of the way.

aria_aesni_avx_x86_64    0 > 95560  1282

This run didn't have the patch to load aria based on the
device table, so it wasn't loaded until tcrypt asked for it.

camellia_x86_64       0 0 0
twofish_x86_64_3way   0 0 0

Those use the ecb_cbc_helper macros, but pass along -1 to
not use kernel_fpu_begin/end, so the debug instrumentation
is there but unused.

Next steps
==========
I'll try to add a test with long data, and work on scaling the
loops based on relative performance (e.g., if sha512 needs
4 KiB, then crc32c should be fine with 80 KiB).