RE: [PATCH v2 00/19] crypto: x86 - fix RCU stalls

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Elliott, Robert (Servers) <elliott@xxxxxxx>
> Sent: Wednesday, October 12, 2022 4:59 PM
> To: herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> tim.c.chen@xxxxxxxxxxxxxxx; ap420073@xxxxxxxxx; ardb@xxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Cc: Elliott, Robert (Servers) <elliott@xxxxxxx>
> Subject: [PATCH v2 00/19] crypto: x86 - fix RCU stalls
> 
> This series fixes the RCU stalls triggered by the x86 crypto
> modules discussed in
> https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84
> MB1842.NAMPRD84.PROD.OUTLOOK.COM/

I've instrumented all the x86 crypto modules, including ways to
experiment with different loop sizes. Here are some results with
the hash functions.

Key:
    calls = number of kernel_fpu_begin()/end() calls made by the module
     cost = number of CPU cycles consumed by those calls (overhead)
maxcycles = number of CPU cycles between those calls in FPU context
      bpf = bytes_per_fpu loop size
      KiB = bpf expressed in KiB
   maxlen = maximum number of bytes per loop via update()
  maxlen2 = maximum number of bytes per loop via finup()

This is on a 2.2 GHz Cascade Lake CPU, where each cycle is nominally
0.45 ns.  The CPU does not support SHA-NI instructions, so those
results are missing.

Here are the results from a boot with the avx2 bytes_per_fpu values set
to 0 (unlimited - original behavior).

Booting includes:
  - processing 2.3 GB of SHA-512 kernel module hashes
  - crypto self-tests
  - crypto extra self-tests (CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y)

   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
    3641      177182        10230        0    0     4096        0      __ghash-pclmulqdqni          ghash_clmulni_intel
    2242      150516         1684        0    0     8112        0             crc32-pclmul                 crc32_pclmul
    1008       43800        22404        0    0     8068     8105             crc32c-intel                 crc32c_intel
    2565      179734         4286        0    0     7791     8027         crct10dif-pclmul             crct10dif_pclmul
    1603       77112         2414        0    0     8132        0          nhpoly1305-avx2              nhpoly1305_avx2
    1671       81108         9390   203776  199     8109        0          nhpoly1305-sse2              nhpoly1305_sse2
    1977      103598         5314        0    0     8112        0            poly1305-simd              poly1305_x86_64
   26744     1251756         2046        0    0     8096        0          polyval-clmulni              polyval_clmulni
   14669      682428        65462    30720   30      251     8096                 sha1-avx                   sha1_ssse3
   14669      682428        65462        0    0     7170        0                sha1-avx2                   sha1_ssse3
   14669      682428        65462    34816   34        0        0               sha1-shani                   sha1_ssse3
   14669      682428        65462    26624   26     8089     8164               sha1-ssse3                   sha1_ssse3
   26768     1230100       144902    11264   11     8130     8159               sha224-avx                 sha256_ssse3
   26768     1230100       144902    13312   13     8078     8146              sha224-avx2                 sha256_ssse3
   26768     1230100       144902    13312   13        0        0             sha224-shani                 sha256_ssse3
   26768     1230100       144902    11264   11     8068     8168             sha224-ssse3                 sha256_ssse3
   26768     1230100       144902    11264   11     8130     8159               sha256-avx                 sha256_ssse3
   26768     1230100       144902    13312   13     8078     8146              sha256-avx2                 sha256_ssse3
   26768     1230100       144902    13312   13        0        0             sha256-shani                 sha256_ssse3
   26768     1230100       144902    11264   11     8068     8168             sha256-ssse3                 sha256_ssse3
   29157     2044882    164510724    17408   17        0     8127               sha384-avx                 sha512_ssse3
   29157     2044882    164510724        0    0        0 48175432              sha384-avx2                 sha512_ssse3
   29157     2044882    164510724    17408   17        0     8055             sha384-ssse3                 sha512_ssse3
   29157     2044882    164510724    17408   17        0     8127               sha512-avx                 sha512_ssse3
   29157     2044882    164510724        0    0        0 48175432              sha512-avx2                 sha512_ssse3
   29157     2044882    164510724    17408   17        0     8055             sha512-ssse3                 sha512_ssse3
    4314      193456       124918        0    0     7672     8101                  sm3-avx               sm3_avx_x86_64

The self-tests only test small data sets (even the extra tests
limit themselves to PAGE_SIZE * 2) so only the sha512_ssse3
module was stressed with large requests.

The cost of the kernel_fpu_begin()/end() calls (2044882 cycles) was
929 us, and the longest time in FPU context (164510724) was 75 ms.
I think the biggest file it encounters is:
-rw-r--r--. 1 root root 48186713 Nov  1 13:14 /lib/modules/6.0.0+/kernel/fs/xfs/xfs.ko


I added tcrypt tests to exercise each driver ten times with 1 MiB data,
and that exposes all the drivers to larger requests.

bigbuf tests with no limits:
   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
    1000      156354      1484434        0    0  1048576        0      __ghash-pclmulqdqni          ghash_clmulni_intel
    1000      150386       221710        0    0  1048576        0             crc32-pclmul                 crc32_pclmul
    1000      104890       114000        0    0  1048576        0             crc32c-intel                 crc32c_intel
    1000      169596       182904        0    0  1048576        0         crct10dif-pclmul             crct10dif_pclmul
    1000      122842       267568        0    0  1048576        0          nhpoly1305-avx2              nhpoly1305_avx2
    1000      190530       453118        0    0  1048576        0          nhpoly1305-sse2              nhpoly1305_sse2
    1000      134682       431264        0    0  1048576        0            poly1305-simd              poly1305_x86_64
    8000      387206       215922        0    0  1048576        0          polyval-clmulni              polyval_clmulni
    6000      562932      2831190        0    0  1048576        0                 sha1-avx                   sha1_ssse3
    6000      562932      2831190        0    0  1048576        0                sha1-avx2                   sha1_ssse3
    6000      562932      2831190    34816   34        0        0               sha1-shani                   sha1_ssse3
    6000      562932      2831190        0    0  1048576        0               sha1-ssse3                   sha1_ssse3
   12000     1212742      6558712        0    0  1048576        0               sha224-avx                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0              sha224-avx2                 sha256_ssse3
   12000     1212742      6558712    13312   13        0        0             sha224-shani                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0             sha224-ssse3                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0               sha256-avx                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0              sha256-avx2                 sha256_ssse3
   12000     1212742      6558712    13312   13        0        0             sha256-shani                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0             sha256-ssse3                 sha256_ssse3
   12006     1250296      4621038        0    0  1048576        0               sha384-avx                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576  1037416              sha384-avx2                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576        0             sha384-ssse3                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576        0               sha512-avx                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576  1037416              sha512-avx2                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576        0             sha512-ssse3                 sha512_ssse3
    2000      221468      6236756        0    0  1048576        0                  sm3-avx               sm3_avx_x86_64

Setting bpf limits based on those results narrows the maxcycles in
FPU context. I've seen results vary from 81912 (37 us) to
(102 us) - not real tight, but much better than ranging up
to 75 ms.

bigbuf tests with bytes_per_fpu limits as shown:
   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
   21000     1002372       138558    51200   50    51200        0      __ghash-pclmulqdqni          ghash_clmulni_intel
    2000      220666       226806   646912  631   646912        0             crc32-pclmul                 crc32_pclmul
    2000      255110       105968   895232  874   895232        0             crc32c-intel                 crc32c_intel
    2000      218942       107930   626944  612   626944        0         crct10dif-pclmul             crct10dif_pclmul
    4000      208170       141356   345088  337   345088        0          nhpoly1305-avx2              nhpoly1305_avx2
    6000      285286       105072   203520  198   203520        0          nhpoly1305-sse2              nhpoly1305_sse2
    5000      368866       162262   222976  217   222976        0            poly1305-simd              poly1305_x86_64
   10000      457010       142362   402688  393   402688        0          polyval-clmulni              polyval_clmulni
  108000     6048076       160670    30720   30    30720        0                 sha1-avx                   sha1_ssse3
  108000     6048076       160670    34816   34    34816        0                sha1-avx2                   sha1_ssse3
  108000     6048076       160670    27392   26    27392        0               sha1-ssse3                   sha1_ssse3
  520000    23646576       196462    11520   11    11520        0               sha224-avx                 sha256_ssse3
  520000    23646576       196462    14080   13    14080        0              sha224-avx2                 sha256_ssse3
  520000    23646576       196462    11776   11    11776        0             sha224-ssse3                 sha256_ssse3
  520000    23646576       196462    11520   11    11520        0               sha256-avx                 sha256_ssse3
  520000    23646576       196462    14080   13    14080        0              sha256-avx2                 sha256_ssse3
  520000    23646576       196462    11776   11    11776        0             sha256-ssse3                 sha256_ssse3
  356156    18242860       226538    17152   16    17152        0               sha384-avx                 sha512_ssse3
  356156    18242860       226538    20480   20    20480    20480              sha384-avx2                 sha512_ssse3
  356156    18242860       226538    17408   17    17408        0             sha384-ssse3                 sha512_ssse3
  356156    18242860       226538    17152   16    17152        0               sha512-avx                 sha512_ssse3
  356156    18242860       226538    20480   20    20480    20480              sha512-avx2                 sha512_ssse3
  356156    18242860       226538    17408   17    17408        0             sha512-ssse3                 sha512_ssse3
   93000     4537164       138924    11520   11    11520        0                  sm3-avx               sm3_avx_x86_64

If I reboot with sha512-avx2 set to 20 KiB, the sha512-avx2
maxlength can still take a long time (e.g., 2 ms). That's much
better than the original 75 ms, but still not in the 50 us range.

I set /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to
"performance" in .bash_profile, but that's not effective during
boot, so maybe that is the source of variability.

Example boot with 20 KiB limit:
   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
  161011    16232280      4049644    20480   20        0    20480              sha512-avx2                 sha512_ssse3

Limiting it to 1 KiB does reduce maxcycles to the us range, but
the cost of all the extra calls soars.

So, for v3 of the series, I plan to propose values ranging from:
  - 11 to 20 KiB for sha* amd sm3
  - 200 to 400 Kib for *poly*
  - 600 to 800 KiB for crc*

v3 will only cover the hash functions - skcipher and aead
have some unique challenges that we can tackle later.





[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]
  Powered by Linux