Re: [PATCH v4] crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support

Ard Biesheuvel <ardb@xxxxxxxxxx> · Tue, 11 Feb 2025 09:06:33 +0100



On Mon, 10 Feb 2025 at 17:51, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>
> From: Eric Biggers <ebiggers@xxxxxxxxxx>
>
> Delete aes_ctrby8_avx-x86_64.S and add a new assembly file
> aes-ctr-avx-x86_64.S which follows a similar approach to
> aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX,
> VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just
> AESNI+AVX.  Wire it up to the crypto API accordingly.
>
> This greatly improves the performance of AES-CTR and AES-XCTR on
> VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230%
> increase in throughput is seen on long messages.  Performance on
> non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR
> code (aesni_ctr_enc) is also kept as-is for now.  There are some slight
> regressions (less than 10%) on some short message lengths on some CPUs;
> these are difficult to avoid, given how the previous code was so heavily
> unrolled by message length, and they are not particularly important.
> Detailed performance results are given in the tables below.
>
> Both CTR and XCTR support is retained.  The main loop remains
> 8-vector-wide, which differs from the 4-vector-wide main loops that are
> used in the XTS and GCM code.  A wider loop is appropriate for CTR and
> XCTR since they have fewer other instructions (such as vpclmulqdq) to
> interleave with the AES instructions.
>
> Similar to what was the case for AES-GCM, the new assembly code also has
> a much smaller binary size, as it fixes the excessive unrolling by data
> length and key length present in the old code.  Specifically, the new
> assembly file compiles to about 9 KB of text vs. 28 KB for the old file.
> This is despite 4x as many implementations being included.
>
> The tables below show the detailed performance results.  The tables show
> percentage improvement in single-threaded throughput for repeated
> encryption of the given message length; an increase from 6000 MB/s to
> 12000 MB/s would be listed as 100%.  They were collected by directly
> measuring the Linux crypto API performance using a custom kernel module.
> The tested CPUs were all server processors from Google Compute Engine
> except for Zen 5 which was a Ryzen 9 9950X desktop processor.
>
> Table 1: AES-256-CTR throughput improvement,
>          CPU microarchitecture vs. message length in bytes:
>
>                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
> ---------------------+-------+-------+-------+-------+-------+-------+
> AMD Zen 5            |  232% |  203% |  212% |  143% |   71% |   95% |
> Intel Emerald Rapids |  116% |  116% |  117% |   91% |   78% |   79% |
> Intel Ice Lake       |  109% |  103% |  107% |   81% |   54% |   56% |
> AMD Zen 4            |  109% |   91% |  100% |   70% |   43% |   59% |
> AMD Zen 3            |   92% |   78% |   87% |   57% |   32% |   43% |
> AMD Zen 2            |    9% |    8% |   14% |   12% |    8% |   21% |
> Intel Skylake        |    7% |    7% |    8% |    5% |    3% |    8% |
>
>                      |   300 |   200 |    64 |    63 |    16 |
> ---------------------+-------+-------+-------+-------+-------+
> AMD Zen 5            |   57% |   39% |   -9% |    7% |   -7% |
> Intel Emerald Rapids |   37% |   42% |   -0% |   13% |   -8% |
> Intel Ice Lake       |   39% |   30% |   -1% |   14% |   -9% |
> AMD Zen 4            |   42% |   38% |   -0% |   18% |   -3% |
> AMD Zen 3            |   38% |   35% |    6% |   31% |    5% |
> AMD Zen 2            |   24% |   23% |    5% |   30% |    3% |
> Intel Skylake        |    9% |    1% |   -4% |   10% |   -7% |
>
> Table 2: AES-256-XCTR throughput improvement,
>          CPU microarchitecture vs. message length in bytes:
>
>                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
> ---------------------+-------+-------+-------+-------+-------+-------+
> AMD Zen 5            |  240% |  201% |  216% |  151% |   75% |  108% |
> Intel Emerald Rapids |  100% |   99% |  102% |   91% |   94% |  104% |
> Intel Ice Lake       |   93% |   89% |   92% |   74% |   50% |   64% |
> AMD Zen 4            |   86% |   75% |   83% |   60% |   41% |   52% |
> AMD Zen 3            |   73% |   63% |   69% |   45% |   21% |   33% |
> AMD Zen 2            |   -2% |   -2% |    2% |    3% |   -1% |   11% |
> Intel Skylake        |   -1% |   -1% |    1% |    2% |   -1% |    9% |
>
>                      |   300 |   200 |    64 |    63 |    16 |
> ---------------------+-------+-------+-------+-------+-------+
> AMD Zen 5            |   78% |   56% |   -4% |   38% |   -2% |
> Intel Emerald Rapids |   61% |   55% |    4% |   32% |   -5% |
> Intel Ice Lake       |   57% |   42% |    3% |   44% |   -4% |
> AMD Zen 4            |   35% |   28% |   -1% |   17% |   -3% |
> AMD Zen 3            |   26% |   23% |   -3% |   11% |   -6% |
> AMD Zen 2            |   13% |   24% |   -1% |   14% |   -3% |
> Intel Skylake        |   16% |    8% |   -4% |   35% |   -3% |
>
> Signed-off-by: Eric Biggers <ebiggers@xxxxxxxxxx>
> ---
>
> Changed in v4:
> - Used 'varargs' in assembly macros where appropriate.
>
> Changed in v3:
> - Dropped the patch removing the non-AVX AES-CTR for now
> - Changed license of aes-ctr-avx-x86_64.S to Apache-2.0 OR BSD-2-Clause,
>   same as what I used for the AES-GCM assembly files.  I've received
>   interest in this code possibly being reused in other projects.
> - Updated commit message to remove an ambiguous statement
> - Updated commit message to clarify that the non-AVX code is unchanged
> - Added comment above ctr_crypt_aesni() noting that it's non-AVX
>
> Changed in v2:
> - Split the removal of the non-AVX implementation of AES-CTR into a
>   separate patch, and removed the assembly code too.
> - Made some minor tweaks to the new assembly file, including fixing a
>   build error when aesni-intel is built as a module.
>
>  arch/x86/crypto/Makefile                |   2 +-
>  arch/x86/crypto/aes-ctr-avx-x86_64.S    | 592 +++++++++++++++++++++++
>  arch/x86/crypto/aes_ctrby8_avx-x86_64.S | 597 ------------------------
>  arch/x86/crypto/aesni-intel_glue.c      | 404 ++++++++--------
>  4 files changed, 803 insertions(+), 792 deletions(-)
>  create mode 100644 arch/x86/crypto/aes-ctr-avx-x86_64.S
>  delete mode 100644 arch/x86/crypto/aes_ctrby8_avx-x86_64.S
>

Reviewed-by: Ard Biesheuvel <ardb@xxxxxxxxxx>

I've tested this with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y on a CPU
that supports all variants, so

Tested-by: Ard Biesheuvel <ardb@xxxxxxxxxx>