On Fri, 29 Mar 2024 at 10:06, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > This patchset adds new AES-XTS implementations that accelerate disk and > file encryption on modern x86_64 CPUs. > > The largest improvements are seen on CPUs that support the VAES > extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and > later. However, an implementation using plain AESNI + AVX is also added > and provides a boost on older CPUs too. > > To try to handle the mess that is x86 SIMD, the code for all the new > AES-XTS implementations is generated from an assembly macro. This makes > it so that we e.g. don't have to have entirely different source code > just for different vector lengths (xmm, ymm, zmm). > > To avoid downclocking effects, zmm registers aren't used on certain > Intel CPU models such as Ice Lake. These CPU models default to an > implementation using ymm registers instead. > > To make testing easier, all four new AES-XTS implementations are > registered separately with the crypto API. They are prioritized > appropriately so that the best one for the CPU is used by default. > > There's no separate kconfig option for the new implementations, as they > are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL. > > This patchset increases the throughput of AES-256-XTS by the following > amounts on the following CPUs: > > | 4096-byte messages | 512-byte messages | > ----------------------+--------------------+-------------------+ > Intel Skylake | 6% | 31% | > Intel Cascade Lake | 4% | 26% | > Intel Ice Lake | 127% | 120% | > Intel Sapphire Rapids | 151% | 112% | > AMD Zen 1 | 61% | 73% | > AMD Zen 2 | 36% | 59% | > AMD Zen 3 | 138% | 99% | > AMD Zen 4 | 155% | 117% | > > To summarize how the XTS implementations perform in general, here are > benchmarks of all of them on AMD Zen 4, with 4096-byte messages. (Of > course, in practice only the best one for the CPU actually gets used.) > > xts-aes-aesni 4247 MB/s > xts-aes-aesni-avx 5669 MB/s > xts-aes-vaes-avx2 9588 MB/s > xts-aes-vaes-avx10_256 9631 MB/s > xts-aes-vaes-avx10_512 10868 MB/s > > ... and on Intel Sapphire Rapids: > > xts-aes-aesni 4848 MB/s > xts-aes-aesni-avx 5287 MB/s > xts-aes-vaes-avx2 11685 MB/s > xts-aes-vaes-avx10_256 11938 MB/s > xts-aes-vaes-avx10_512 12176 MB/s > > Notes about benchmarking methods: > > - All my benchmarks were done using a custom kernel module that invokes > the crypto_skcipher API. Note that benchmarking the crypto API from > userspace using AF_ALG, e.g. as 'cryptsetup benchmark' does, is bad at > measuring fast algorithms due to the syscall overhead of AF_ALG. I > don't recommend that method. Instead, I measured the crypto > performance directly, as that's what this patchset focuses on. > > - All numbers I give are for decryption. However, on all the CPUs I > tested, encryption performs almost identically to decryption. > > Open questions: > > - Is the policy that I implemented for preferring ymm registers to zmm > registers the right one? arch/x86/crypto/poly1305_glue.c thinks that > only Skylake has the bad downclocking. My current proposal is a bit > more conservative; it also excludes Ice Lake and Tiger Lake. Those > CPUs supposedly still have some downclocking, though not as much. > > - Should the policy on the use of zmm registers be in a centralized > place? It probably doesn't make sense to have random different > policies for different crypto algorithms (AES, Poly1305, ARIA, etc.). > > - Are there any other known issues with using AVX512 in kernel mode? It > seems to work, and technically it's not new because Poly1305 and ARIA > already use AVX512, including the mask registers and zmm registers up > to 31. So if there was a major issue, like the new registers not > being properly saved and restored, it probably would have already been > found. But AES-XTS support would introduce a wider use of it. > > - Should we perhaps not even bother with AVX512 / AVX10 at all for now, > given that on current CPUs most of the improvement is achieved by > going to VAES + AVX2? I.e. should we skip the last two patches? I'm > hoping the improvement will be greater on future CPUs, though. > > Changed in v2: > - Additional optimizations: > - Interleaved the tweak computation with AES en/decryption. This > helps significantly on some CPUs, especially those without VAES. > - Further optimized for single-page sources and destinations. > - Used fewer instructions to update tweaks in VPCLMULQDQ case. > - Improved handling of "round 0". > - Eliminated a jump instruction from the main loop. > - Other > - Fixed zmm_exclusion_list[] to be null-terminated. > - Added missing #ifdef to unregister_xts_algs(). > - Added some more comments. > - Improved cover letter and some commit messages. > - Now that the next tweak is always computed anyways, made it be > returned unconditionally. > - Moved the IV encryption to a separate function. > > Eric Biggers (6): > x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support > crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs > crypto: x86/aes-xts - wire up AESNI + AVX implementation > crypto: x86/aes-xts - wire up VAES + AVX2 implementation > crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation > crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation > Retested this v2: Tested-by: Ard Biesheuvel <ardb@xxxxxxxxxx> Reviewed-by: Ard Biesheuvel <ardb@xxxxxxxxxx> Hopefully, the AES-KL keylocker implementation can be based on this template as well. I wouldn't mind retiring the existing xts(aesni) code entirely, and using the xts() wrapper around ecb-aes-aesni on 32-bit and on non-AVX uarchs with AES-NI.