On Mon, Apr 08, 2024 at 07:41:44AM +0000, David Laight wrote: > From: Eric Biggers > > Sent: 05 April 2024 20:19 > ... > > I did some tests on Sapphire Rapids using a system call that I customized to do > > nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair. > > > > On average the bare syscall took 70 ns. The syscall with the kernel_fpu_begin / > > kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns > > if it used ymm, or 360 ns if it used zmm... > > > > Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI > > instructions cannot be used and the alternative would be xts(ecb(aes-generic)). > > On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic)) > > takes about 2235ns. With xts-aes-vaes-avx10_512 it takes 75 ns... > > So most of the cost of a single 512-byte sector is the kernel_fpu_begin(). > But it is so much slower any other way it is still faster. > Yes. To clarify, the 75 ns time I mentioned for a 512-byte sector is the average for repeated calls, amortizing the XSAVE and XRSTOR. For a real single 512-byte sector that eats the entire cost of the XSAVE and XRSTOR by itself, if all state is in-use it should be about 75 + (360 - 70) = 365 ns (based on the syscall benchmarks I did), with the XSAVE and XRSTOR accounting for 80% of that time. But yes, that's still over 6 times faster than the scalar alternative. - Eric