On Mon, 10 Feb 2025 at 17:51, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > From: Eric Biggers <ebiggers@xxxxxxxxxx> > > Delete aes_ctrby8_avx-x86_64.S and add a new assembly file > aes-ctr-avx-x86_64.S which follows a similar approach to > aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX, > VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just > AESNI+AVX. Wire it up to the crypto API accordingly. > > This greatly improves the performance of AES-CTR and AES-XCTR on > VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230% > increase in throughput is seen on long messages. Performance on > non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR > code (aesni_ctr_enc) is also kept as-is for now. There are some slight > regressions (less than 10%) on some short message lengths on some CPUs; > these are difficult to avoid, given how the previous code was so heavily > unrolled by message length, and they are not particularly important. > Detailed performance results are given in the tables below. > > Both CTR and XCTR support is retained. The main loop remains > 8-vector-wide, which differs from the 4-vector-wide main loops that are > used in the XTS and GCM code. A wider loop is appropriate for CTR and > XCTR since they have fewer other instructions (such as vpclmulqdq) to > interleave with the AES instructions. > > Similar to what was the case for AES-GCM, the new assembly code also has > a much smaller binary size, as it fixes the excessive unrolling by data > length and key length present in the old code. Specifically, the new > assembly file compiles to about 9 KB of text vs. 28 KB for the old file. > This is despite 4x as many implementations being included. > > The tables below show the detailed performance results. The tables show > percentage improvement in single-threaded throughput for repeated > encryption of the given message length; an increase from 6000 MB/s to > 12000 MB/s would be listed as 100%. They were collected by directly > measuring the Linux crypto API performance using a custom kernel module. > The tested CPUs were all server processors from Google Compute Engine > except for Zen 5 which was a Ryzen 9 9950X desktop processor. > > Table 1: AES-256-CTR throughput improvement, > CPU microarchitecture vs. message length in bytes: > > | 16384 | 4096 | 4095 | 1420 | 512 | 500 | > ---------------------+-------+-------+-------+-------+-------+-------+ > AMD Zen 5 | 232% | 203% | 212% | 143% | 71% | 95% | > Intel Emerald Rapids | 116% | 116% | 117% | 91% | 78% | 79% | > Intel Ice Lake | 109% | 103% | 107% | 81% | 54% | 56% | > AMD Zen 4 | 109% | 91% | 100% | 70% | 43% | 59% | > AMD Zen 3 | 92% | 78% | 87% | 57% | 32% | 43% | > AMD Zen 2 | 9% | 8% | 14% | 12% | 8% | 21% | > Intel Skylake | 7% | 7% | 8% | 5% | 3% | 8% | > > | 300 | 200 | 64 | 63 | 16 | > ---------------------+-------+-------+-------+-------+-------+ > AMD Zen 5 | 57% | 39% | -9% | 7% | -7% | > Intel Emerald Rapids | 37% | 42% | -0% | 13% | -8% | > Intel Ice Lake | 39% | 30% | -1% | 14% | -9% | > AMD Zen 4 | 42% | 38% | -0% | 18% | -3% | > AMD Zen 3 | 38% | 35% | 6% | 31% | 5% | > AMD Zen 2 | 24% | 23% | 5% | 30% | 3% | > Intel Skylake | 9% | 1% | -4% | 10% | -7% | > > Table 2: AES-256-XCTR throughput improvement, > CPU microarchitecture vs. message length in bytes: > > | 16384 | 4096 | 4095 | 1420 | 512 | 500 | > ---------------------+-------+-------+-------+-------+-------+-------+ > AMD Zen 5 | 240% | 201% | 216% | 151% | 75% | 108% | > Intel Emerald Rapids | 100% | 99% | 102% | 91% | 94% | 104% | > Intel Ice Lake | 93% | 89% | 92% | 74% | 50% | 64% | > AMD Zen 4 | 86% | 75% | 83% | 60% | 41% | 52% | > AMD Zen 3 | 73% | 63% | 69% | 45% | 21% | 33% | > AMD Zen 2 | -2% | -2% | 2% | 3% | -1% | 11% | > Intel Skylake | -1% | -1% | 1% | 2% | -1% | 9% | > > | 300 | 200 | 64 | 63 | 16 | > ---------------------+-------+-------+-------+-------+-------+ > AMD Zen 5 | 78% | 56% | -4% | 38% | -2% | > Intel Emerald Rapids | 61% | 55% | 4% | 32% | -5% | > Intel Ice Lake | 57% | 42% | 3% | 44% | -4% | > AMD Zen 4 | 35% | 28% | -1% | 17% | -3% | > AMD Zen 3 | 26% | 23% | -3% | 11% | -6% | > AMD Zen 2 | 13% | 24% | -1% | 14% | -3% | > Intel Skylake | 16% | 8% | -4% | 35% | -3% | > > Signed-off-by: Eric Biggers <ebiggers@xxxxxxxxxx> > --- > > Changed in v4: > - Used 'varargs' in assembly macros where appropriate. > > Changed in v3: > - Dropped the patch removing the non-AVX AES-CTR for now > - Changed license of aes-ctr-avx-x86_64.S to Apache-2.0 OR BSD-2-Clause, > same as what I used for the AES-GCM assembly files. I've received > interest in this code possibly being reused in other projects. > - Updated commit message to remove an ambiguous statement > - Updated commit message to clarify that the non-AVX code is unchanged > - Added comment above ctr_crypt_aesni() noting that it's non-AVX > > Changed in v2: > - Split the removal of the non-AVX implementation of AES-CTR into a > separate patch, and removed the assembly code too. > - Made some minor tweaks to the new assembly file, including fixing a > build error when aesni-intel is built as a module. > > arch/x86/crypto/Makefile | 2 +- > arch/x86/crypto/aes-ctr-avx-x86_64.S | 592 +++++++++++++++++++++++ > arch/x86/crypto/aes_ctrby8_avx-x86_64.S | 597 ------------------------ > arch/x86/crypto/aesni-intel_glue.c | 404 ++++++++-------- > 4 files changed, 803 insertions(+), 792 deletions(-) > create mode 100644 arch/x86/crypto/aes-ctr-avx-x86_64.S > delete mode 100644 arch/x86/crypto/aes_ctrby8_avx-x86_64.S > Reviewed-by: Ard Biesheuvel <ardb@xxxxxxxxxx> I've tested this with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y on a CPU that supports all variants, so Tested-by: Ard Biesheuvel <ardb@xxxxxxxxxx>