On Tue, Mar 26, 2024 at 10:51:48AM +0200, Ard Biesheuvel wrote: > > Open questions: > > > > - Is the policy that I implemented for preferring ymm registers to zmm > > registers the right one? arch/x86/crypto/poly1305_glue.c thinks that > > only Skylake has the bad downclocking. My current proposal is a bit > > more conservative; it also excludes Ice Lake and Tiger Lake. Those > > CPUs supposedly still have some downclocking, though not as much. > > > > - Should the policy on the use of zmm registers be in a centralized > > place? It probably doesn't make sense to have random different > > policies for different crypto algorithms (AES, Poly1305, ARIA, etc.). > > > > - Are there any other known issues with using AVX512 in kernel mode? It > > seems to work, and technically it's not new because Poly1305 and ARIA > > already use AVX512, including the mask registers and zmm registers up > > to 31. So if there was a major issue, like the new registers not > > being properly saved and restored, it probably would have already been > > found. But AES-XTS support would introduce a wider use of it. > > > > I don't have much input here, except that I think we should just > disable AVX512 kernel-wide on systems where there is no benefit in > terms of throughput. I suspect this might change with algorithms that > rely more heavily on the masking, but so far, we have been making > quite effective use of simple permute vectors and overlapping loads > and stores to do the same. And as Eric points out, the only relevant > use case in the kernel is blocks of size 2^n where n is at least 9. There are several benefits to AVX512 besides the 512-bit zmm registers. Besides masking, there are also twice as many SIMD registers which make it possible to cache all the AES round keys. There are also other new instructions such as vpternlogd which I've used in AES-XTS to XOR values together more efficiently. That's why this patchset adds both xts-aes-vaes-avx10_256 and xts-aes-vaes-avx10_512. And I've adopted the new "AVX10" naming, maybe a bit early, to emphasize that it's not just about 512-bit... Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds on 4096-byte messages in MB/s I'm seeing: xts-aes-aesni 5136 xts-aes-aesni-avx 5366 xts-aes-vaes-avx2 9337 xts-aes-vaes-avx10_256 9876 xts-aes-vaes-avx10_512 10215 So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2. But taking advantage of AVX512 does help a bit more, first from the parts other than 512-bit registers, then a bit more from 512-bit registers. I do have Ice Lake on the exclusion list from xts-aes-vaes-avx10_512 anyway, since the concern with downclocking is not really about the performance of the code itself but rather the impact on unrelated code running on the CPU. And I *think* the right policy is to just disable the use of the zmm registers, as opposed to AVX512 entirely. As AVX512 was originally presented it did tie these together, but they don't have to be. AVX10 (which supposedly future x86_64 CPUs will have) explicitly moves away from that by repackaging the existing AVX512 features and making the zmm registers optional. - Eric