Re: [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs

Ard Biesheuvel <ardb@xxxxxxxxxx> · Fri, 29 Mar 2024 11:03:07 +0200

On Fri, 29 Mar 2024 at 10:06, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
>
> There's no separate kconfig option for the new implementations, as they
> are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> This patchset increases the throughput of AES-256-XTS by the following
> amounts on the following CPUs:
>
>                           | 4096-byte messages | 512-byte messages |
>     ----------------------+--------------------+-------------------+
>     Intel Skylake         |        6%          |       31%         |
>     Intel Cascade Lake    |        4%          |       26%         |
>     Intel Ice Lake        |       127%         |      120%         |
>     Intel Sapphire Rapids |       151%         |      112%         |
>     AMD Zen 1             |        61%         |       73%         |
>     AMD Zen 2             |        36%         |       59%         |
>     AMD Zen 3             |       138%         |       99%         |
>     AMD Zen 4             |       155%         |      117%         |
>
> To summarize how the XTS implementations perform in general, here are
> benchmarks of all of them on AMD Zen 4, with 4096-byte messages.  (Of
> course, in practice only the best one for the CPU actually gets used.)
>
>     xts-aes-aesni                  4247 MB/s
>     xts-aes-aesni-avx              5669 MB/s
>     xts-aes-vaes-avx2              9588 MB/s
>     xts-aes-vaes-avx10_256         9631 MB/s
>     xts-aes-vaes-avx10_512         10868 MB/s
>
> ... and on Intel Sapphire Rapids:
>
>     xts-aes-aesni                  4848 MB/s
>     xts-aes-aesni-avx              5287 MB/s
>     xts-aes-vaes-avx2              11685 MB/s
>     xts-aes-vaes-avx10_256         11938 MB/s
>     xts-aes-vaes-avx10_512         12176 MB/s
>
> Notes about benchmarking methods:
>
> - All my benchmarks were done using a custom kernel module that invokes
>   the crypto_skcipher API.  Note that benchmarking the crypto API from
>   userspace using AF_ALG, e.g. as 'cryptsetup benchmark' does, is bad at
>   measuring fast algorithms due to the syscall overhead of AF_ALG.  I
>   don't recommend that method.  Instead, I measured the crypto
>   performance directly, as that's what this patchset focuses on.
>
> - All numbers I give are for decryption.  However, on all the CPUs I
>   tested, encryption performs almost identically to decryption.
>
> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
>   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>   only Skylake has the bad downclocking.  My current proposal is a bit
>   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>   CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
>   place?  It probably doesn't make sense to have random different
>   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode?  It
>   seems to work, and technically it's not new because Poly1305 and ARIA
>   already use AVX512, including the mask registers and zmm registers up
>   to 31.  So if there was a major issue, like the new registers not
>   being properly saved and restored, it probably would have already been
>   found.  But AES-XTS support would introduce a wider use of it.
>
> - Should we perhaps not even bother with AVX512 / AVX10 at all for now,
>   given that on current CPUs most of the improvement is achieved by
>   going to VAES + AVX2?  I.e. should we skip the last two patches?  I'm
>   hoping the improvement will be greater on future CPUs, though.
>
> Changed in v2:
>   - Additional optimizations:
>       - Interleaved the tweak computation with AES en/decryption.  This
>         helps significantly on some CPUs, especially those without VAES.
>       - Further optimized for single-page sources and destinations.
>       - Used fewer instructions to update tweaks in VPCLMULQDQ case.
>       - Improved handling of "round 0".
>       - Eliminated a jump instruction from the main loop.
>   - Other
>       - Fixed zmm_exclusion_list[] to be null-terminated.
>       - Added missing #ifdef to unregister_xts_algs().
>       - Added some more comments.
>       - Improved cover letter and some commit messages.
>       - Now that the next tweak is always computed anyways, made it be
>         returned unconditionally.
>       - Moved the IV encryption to a separate function.
>
> Eric Biggers (6):
>   x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
>   crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
>   crypto: x86/aes-xts - wire up AESNI + AVX implementation
>   crypto: x86/aes-xts - wire up VAES + AVX2 implementation
>   crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
>   crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
>

Retested this v2:

Tested-by: Ard Biesheuvel <ardb@xxxxxxxxxx>
Reviewed-by: Ard Biesheuvel <ardb@xxxxxxxxxx>

Hopefully, the AES-KL keylocker implementation can be based on this
template as well. I wouldn't mind retiring the existing xts(aesni)
code entirely, and using the xts() wrapper around ecb-aes-aesni on
32-bit and on non-AVX uarchs with AES-NI.