On Thu, Apr 04, 2024 at 07:53:48AM +0000, David Laight wrote: > > > > > > How much does the kernel_fpu_begin() cost on real workloads? > > > (ie when the registers are live and it forces an extra save/restore) > > > > x86 Linux does lazy restore of the FPU state. The first kernel_fpu_begin() can > > have a significant cost, as it issues an XSAVE (or equivalent) instruction and > > causes an XRSTOR (or equivalent) instruction to be issued when returning to > > userspace when it otherwise might not be needed. Additional kernel_fpu_begin() > > / kernel_fpu_end() pairs without returning to userspace have only a small cost, > > as they don't cause any more saves or restores of the FPU state to be done. > > > > My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end() > > pair per message (if the message doesn't span any page boundaries, which is > > almost always the case). That's exactly the same as the current xts-aes-aesni. > > I realised after sending it that the code almost certainly already did > kernel_fpu_begin() - so there probably isn't a difference because all the > fpu state is always saved. > (I'm sure there should be a way of getting access to (say) 2 ymm registers > by providing an on-stack save area to allow wide data copies or special > instructions - but that is a different issue.) > > > I think what you may really be asking is how much the overhead of the XSAVE / > > XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel > > clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does. > > That's much more relevant to this patchset. > > It depends on what has to be saved, not on what is used. > Although, since all the x/y/zmm registers are caller-saved I think they could > be 'zapped' on syscall entry (and restored as zero later). > Trouble is I suspect there is a single piece of code somewhere that relies > on them being preserved across an inlined system call. > > > I think the answer is that there is no additional overhead. This is because the > > XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers, > > and it operates on the userspace state, not the kernel's. Some of the newer > > variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where > > they don't save parts of the state that are unmodified since the last XRSTOR; > > however, that is unimportant here because the kernel's FPU state is never saved. > > > > (This would change if x86 Linux were to support preemption of kernel-mode FPU > > code. In that case, we may need to take more care to minimize use of AVX and > > AVX512 state. That being said, AES-XTS tends to be used for bulk data anyway.) > > > > This is based on theory, though. I'll do a test to confirm that there's indeed > > no additional overhead. And also, even if there's no additional overhead, what > > the existing overhead actually is. > > Yes, I was wondering how it is used for 'real applications'. > If a system call that would normally return immediately (or at least without > a full process switch) hits the aes code it gets the cost of the XSAVE added. > Whereas the benchmark probably doesn't do anywhere near as many. > > OTOH this is probably no different. I did some tests on Sapphire Rapids using a system call that I customized to do nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair. On average the bare syscall took 70 ns. The syscall with the kernel_fpu_begin / kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns if it used ymm, or 360 ns if it used zmm. I also tried making the kernel clobber different registers in the kernel_fpu_begin / kernel_fpu_end section, and as I expected this did not make any difference. Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI instructions cannot be used and the alternative would be xts(ecb(aes-generic)). On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic)) takes about 2235ns. With xts-aes-vaes-avx10_512 it takes 75 ns. (Not a typo -- it really is almost 30 times faster!) So it seems clear the FPU state save and restore is worth it even just for a single sector using the traditional 512-byte sector size, let alone a 4096-byte sector size which is recommended these days. - Eric